TEXT FLY WITHIN THE BOOK ONLY
00
OU160517>m CD
CO
OUP
391
29-4-72
10,000.
OSMANIA UNIVERSITY LIBRARY Call No.
Author
$/^'M /*
Accession No.
f(h
%tf ?
V
^? i-
Title This book should/t>e returned on or before the date
last
marked below.
LINEAR REGRESSION AND
ITS
APPLICATION TO ECONOMICS
LINEAR REGRESSION AND
ITS
APPLICATION TO ECONOMICS BY
ZDZISLAW HELLWIG
Translated from the Polish by J.
STABLER
Translation edited by
H.
1NFELD
PERGAMON PRESS OXFORD LONDON NEW YORK PARIS PAINISTWOWE WYDAWNICTWO EKONOMICZNE
WARSZAWA 1963
PERGAMON PRESS LTD. Headington Hill Hall, Oxford
4&5
Fitzroy Square,
London W.
1
PERGAMON PRESS INC. 122 East 55th Street, NEW YORK 22, N.Y. GAUTHIER
VILLARS
55 Quai des Grands- Augustins, Paris 6
PERGAMON PRESS
G. m.
Kaiserstrasse 75 Frankfurt
Copyright
H.
am Main
1963
PERGAMON PRESS
Library of Congress Card
b.
LTD.
Number
62-21781
Set by
Panstwowe Wydawnictwo Ekonomiczne Printed in Poland by
Drukarnia im. Rewolueji Paidziernikowej
CONTENTS Introduction 1.
vii
Regression and correlation 1.1. 1.2.
1
... General comments on regression and correlation Two-dimensional random variables .2.1. Definitions and symbols ... .2.2. Two-dimensional discrete random variables .2.3. Two-dimensional continuous random variables .
.2.4. .2.5.
.2.6. .2.7.
.2.8.
1.3.
2.
Moments of a two-dimensional variable Regression I Regression II Linear regression Correlation. Correlation ratio and correlation coeffi-
and
correlation
to
economic
research
52
the relation between economics and mathematics, on
statistics
2.2.
3.
and econometrics
3.2.
applications of regression and correlation theory in economic research 2.2.1.
Want
2.2.2.
Income
2.2.3.
Demand
2.2.4.
Cost curves
curves distribution curves
curves
55 55 63 71
79
2.2.5.
Time curves
88
2.2.6.
Techno-economic curves
94
General remarks about methods of estimating Estimating linear regression parameters by the method of least squares 3.2.1. 3.2.2.
3.3.
52
More important
Estimating linear regression parameters 3.1.
-
25 27 29
43 49
distribution
Non-linear regression in R%
On
13 18 21
35
The two-dimensional normal
The application of regression 2.1.
12 12
.
cient 1.2.9.
1
The derivation of formulae. Examples The technique of computing regression parameters in a small and a large sample. Contingency table ...
98 98
100 100 114
Estimating linear regression parameters by the two-point
method 3.3.1. The derivation of formulae
124 124
Linear regression
vi 3.3.2.
The technique of computing
regression parameters
and a large sample. Examples The properties of estimates obtained by the two-point method Comments on estimating the correlation coefficient by the two-point method in a small
3.3.3. 3.3.4.
4,
128 137
139
On
testing certain statistical hypotheses
143
4.1
Two tests to verify the hypothesis that the distribution of the is normal general population 4.1.1. The formulation of the problem 4.1.2. Testing hypothesis by rotating the coordinate
143 143
.
Q
H
4.1.3. 4.2.
system (method A) Testing the hypothesis quadrants (method B)
145
H
by dividing the plane into 153
Checking the hypothesis that the regression
Q
lines in general
arc straight lines population General comments 4.2.2. Testing hypothesis HL in a small sample by a run 4.2.1.
160
test 4.2.3.
Testing hypothesis HI. in a large sample by Fisher's test
4.3.
An
.4.4.
An
analysis of the significance of regression parameters analysis of the significance of the correlation coefficient
5,
The transformation of
6,
The regression 6.1.
The
6.2.
Some 6.2.1. 6.2.3.
6.3.
line
curvilinear into linear regression
Appendix List of symbols
165 166 176 180 193
H
193 198 198 201
t
The run test The yf test PitmaVs test
The determination of
Tables Bibliography
...
and the trend
definition of trend tests for verifying hypothesis
6.2.2.
159 159
trend ex post and
ex ante
....
202 205 213 223 229 235
INTRODUCTION Increased interest in research methods employing numbers in recent years. More and more
has been shown in economics tables,
graphs and formulae are to be found in economic
in textbooks, monographs, articles and studies. publications Naturally the use of formal methematical methods of research
must always be subordinate to
qualitative analysis because of the
complex character of socio-economic phenomena and processes. Economic research concentrates on the individual
man
an economic agent, on his behaviour as a producer or consumer, on his individual and
and
social activities of
social needs, likes
and
on
as
his customs, psychological reactions, tastes,
dislikes.
Economics
is
one of the
social sciences.
laws and relationships governing the social activities of men, particularly in the processes of production, distribu-
It studies
tion,
exchange and consumption. Although in economic studies
are primarily analysed in their qualitative aspects, does not follow that their quantitative aspects may be neg-
phenomena it
Thus, for instance, economists analyse and attempt to explain the relationships between wages and the productivity lected.
of labour, costs and production,
demand and
price as well
as personal income, the productivity of labour and the mechanization and automation of the production processes, national
income and investment expenditures, information that
is
etc.
In order to provide
as complete as possible, an analysis of
these relationships, besides explaining the mechanism,
must
enable us to predict the behaviour of one of the related phenomena when the behaviour of the others is known. This is usually
impossible
when
the vii
phenomena
studied
are
not
Linear regression
viii
measurable and the relationships existing between them cannot be presented in functional form. In economic studies it can be seen at almost every step
how
closely interrelated are the
and quantitative aspects of phenomena. For this reason a correct method of economic analysis cannot ignore very the importance of the quantitative approach to the description and analysis of phenomena. qualitative
Every day new and more specialized methods of research using a more or less complex mathematical apparatus are being adapted to economic studies. Usually they are statis1 tical methods , but because of the sphere of their application they are generally referred to as econometric methods.
A
very great progress has been
made
recently in econometric
interesting books dealing with econohave been metrics published. These books contain ample evidence that the most valuable statistical methods applicable
research
and many
economic analysis are those belonging to the theory of and regression.
to
correlation
In economic applications of regression theory, linear regression is of greatest importance. This is for many reasons,
of which the most important are: 1) linear regression
is
a simpler concept than curvilinear
regression and the calculations involved are much
less
complicated; 2) linear
as
is
sional in
most frequently in practice; the known, regression lines in a two-dimen-
regression appears well
normal distribution are
straight lines; therefore,
studying two-dimensional populations
we
deal with
1
Seldom mathematical in the narrow sense of this word. example of a mathematical method may be found in analysis inter-branch flows (or input-output analysis), or in non-linear, dynamic).
programming
An of
(linear,
Introduction
linear regression distribution.
An
at
as
least
often
explanation of
bution appears frequently Central Limit Theorem;
ix
is,
as with a normal
why a normal in
turn,
distri-
found in the
can often be replaced by linear which regression provides an approximation close enough
3) curvilinear regression
for practical purposes; 4) curvilinear regression
be reduced to linear regression
may
by replacing the curve by linear segments; 5) linear regression
is
of particular importance for multiIt is known that the nature of a
dimensional variables.
approximating regression I may be inferred a scatter from diagram. When the number of variables is greater than three a diagram cannot be drawn and function
common
sense indicates that linear regression the simplest) should be used in such cases.
(being
book has been written primarily for scientists in economic, agricultural and technical colleges who deal with This
economic problems
in their research. It
is
also addressed to
graduates of economic and technical colleges employed in different branches of the national economy who because
of their occupation tical
methods
ena.
To
ning,
have frequent occasion to use statisbetween phenom-
in studying the relationships
group belong primarily those engaged in plancost accounting, economic analysis, time statistics, this
and motion studies, inventory control, and technology. This book may also be of some help to students in day and correspondence courses run by schools of economics and business. necessary to have a basic knowledge of calculus and of some elements of the theory of probability and mathematical statistics. Since econo-
In order to use
mists
are
it
with ease
usually
interested
knowledge of mathematics
is
it
is
in the
humanities
and
their
often rather scanty, the outline
Linear regression
X of this book
use by enabling the reader to omit the more difficult parts without interfering with his understanding of the whole exposition. Those is
so designed as to facilitate
its
parts that require a better knowledge of mathematics tics
are
two
marked with one
and statisand with
asterisk (*) at the beginning
They may be omitted without lessening the understanding of the main ideas behind the methods (**) at the end.
presented, If the
or the mastering of the computation technique. difficult parts are omitted this book is accessible
more
even to those whose background in mathematics and statistics quite modest, so that the circle of its readers may be quite wide. Even though they will not be able to learn about all
is
the
more formal
aspects of statistical research
methods and
of descriptions of relationships existing among phenomena which are presented in this book, they can learn the intuitive
and computational aspects of these methods. This, of course, most important from the point of view of the wide dissemination of the methods presented. is
The book has been divided Chapter
1
constitutes the
into 6 chapters.
background for the whole work.
comprises the elementary concepts and the more important definitions and theorems concerning two-dimensional and It
multi-dimensional
random
variables. This chapter also con-
an explanation of the symbols and terms used in the book. In Chapter 2 the more important applications of correlation methods to economics are reviewed. So far, correlation tains
methods have rarely been used
in
economic
analysis.
The
review of applications given in Chapter 2 and numerous examples quoted in the following chapters will illustrate the use-
fulness of statistical
among random
methods in analysing the relationships
variables.
In Chapter 3 methods of estimating regression parameters are discussed. Chapter 4 deals with methods of testing some statistical hypotheses important for practical applications
Introduction
of the correlation
XI
Particularly
analysis.
worth noting are
non-parametric tests for verifying the hypothesis that the two-dimensional population is normal, and non-parametric test for verifying the hypothesis that the regression in the
popu-
a linear regression. In Chapter 5 methods of transformation of curvilinear regression into linear regression are lation
is
discussed. In examples illustrating
the computational technique of determining regression parameters a new method called the two-point method has been used. In the last Chapter (6) an attempt has been made at a new approach to the prob-
lem of
trend. It
known
is
that the determination of trend
parameters, in a formal sense, does not differ from the determination of regression parameters. There are, however, differences of substance between the trend line and the regression necessary to define the trend line different from the definition of the regression line.
line,
and, therefore,
in a
way
This definition
is
it
is
given in Chapter
6,
which
is
also a concluding
chapter, in order to emphasize the fact that correlation methods can be used not only in static but also in dynamic research.
This
work
Most of and applied
deals with two-dimensional variables.
the results obtained, however,
may be
to multi-dimensional variables.
generalized
The author has
tried to
diverse types of statistical data so as to create a broad
use basis
for checking the usefulness of the two-point method he proposes for determining regression line parameters. For this reason, the work contains not only the results of the author's
own
research, but also statistical data
from the works of other [ ] denote the numbers
authors. Figures in rectangular brackets
of the items in the Bibliography.
book are quoted either end of the book.
The work
is
decimal system
Statistical
in the text or in the
data used in the
Appendix
at the
divided into chapters, sections and items. is
used in denoting them. The
first
The
figure
Linear regression
xii
the section, and the third denotes the chapter, the second the item. Thus, 2.2.3. denotes the third item of the second section of the second chapter. tables and graphs are numbered separately numbered part of the book.
Formulae, for each
REGRESSION AND CORRELATION
1.
/./.
General comments on regression and correlation
It is
commonly known
that one of the basic elements of the
learning process is the scientific experiment. However, there are sciences in which it is very difficult to experiment, especially if the
ing
the
word experiment is understood
object in
question under conditions
ated for this purpose. the social sciences, and
mean
to
study-
artificially cre-
To this category belong, primarily, among them again in first place
economics interpreted in the broad sense of the word (i.e. not only political economy, but also all the related economic disciplines).
In those sciences in which experimenting impossible, the process of learning
is
some. One of the objectives of an experiment a causal relation between the
phenomena. To achieve
difficult
is
particularly
phenomenon
or
cumber-
to establish
is
studied and other
purpose a large number of experiments have to be carried out and in the process the influence of those factors which may be related to the phenom-
enon
studied
is
this
eliminated
gradually
and
observations
behaviour of the phenomenon are
regarding the
made
in
isolation.
In this
way experimenting may
factors exert
an
phenomenon
studied,
tially
or not at
sible (e.g.
all. If,
when
help in recognizing which on the behaviour of the
essential influence
it
and which
affect
non-essen-
it slightly,
for any reason, experimenting
creates a danger to
human
life,
costly, or technically impossible) then the process 1
is
or
imposis too
of learning
Linear regression
2
must take
course under natural conditions. In such cases
its
the search for a causal relation between the studied and
its
environment
then the question arises
is
how
phenomenon
particularly difficult because
to classify the
phenomena
into
those which essentially affect the behaviour of the phenomenon studied and those with negligible influence. The answer to
provided by statistics. The importance of statismethods of analysis is of the highest order for those sciences which it is difficult to experiment, even if these sciences
this question is tical in
demography and quantum physics. we are interested in two phenomena, A and B.
are as widely separated as
Suppose that have to find out whether or not one
We
affects the other.
both phenomena can be expressed numerically then for the description of their mutual relationship we may use the If
mathematical apparatus provided by the theory of function. Such sciences as physics and mechanics very often make use of mathematical functions to describe the relationship existing
between two phenomena. The quantities with which these sciences deal may be considered as ordinary mathematical variables. Thus, for instance, distance travelled is
a function
of time and speed; voltage is a function of current intensity and resistance; work is a function of power and distance. Besides
phenomena which have a
relationship so close that
be regarded, for practical purposes, as functional, there are others among which the relationship is weak and obscured it
may
by the impact of many other forces of a secondary nature which cannot be [eliminated while making observations. This type of relationship occurs when there exists an interdependence between random variables. We say that two random variables are stochastically dependent (as distinct from functionally) // the distribution
Example
1.
a change
in
of the other
We
one of them causes a change
in
(see [16], p. 364).
are interested in the problem of the effect
of nutrition on the length of
human
life.
The
scatter
diagram
Regression and correlation
GRAPH
1.
22
1.
Australia
2. Austria 3.
Belgium
4. Brazil 5. Bulgaria 6. Canada 7. Chile
60
8.
9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.
\40
?0
30
234567
Czechoslovakia
Denmark Egypt
Germany Greece
Honduras Hungary Iceland India Ireland Italy
Japan
Mexico The Netherlands.
New
Zealand
Norway Panama Poland Portugal
Siam Spain
Sweden Switzerland
United
USA
Kingdom
8
consumption (thousand calories)
Note. The calories in the table above are "vegetable calories" in which an allowance is made for a higher value in calories obtained from proteins. is
shown on Graph
consumption j-axis
1.
The
food and the
x-axis represents average
measured in
per person, expectation of life of man.
On
calories,
the graph
we
see a
Each point may be regarded as a realization (x y) of a two-dimensional random variable (X Y) where X denotes the consumption of food and Y the expectation of life of man. The points are numbered. On the right side of the graph there is a list of countries, all numbered for collection of points. 9
9
easy identification of the points corresponding to particular countries. The distribution of the points on the graph shows
4 a
Linear regression
clear tendency. It is expressed
by a curve drawn among the
points. This curve is called a regression line. The ordinates of the curve give the expectation of life of man in different
countries corresponding to different values of the average food consumption in those countries. It follows that a regression line is
a functional expression of a stochastic relationship between
random line we
variables call
X and
Y. If the regression line is a straight
a linear regression.
Its practical importance of a very high order. When the relationship studied pertains not to two, but to a greater number of random variables, then instead of a regression line we get a regression plane (with three variables) or a regression hyperplane (with four it
is
or more variables). In such cases
we
deal with a multi-di-
mensional regression.
We
have said above that the regression
line expresses
a
on a scatter dialie as a do not on a regression Particular rule, points (x, y), gram. it the majority but line, and are more or less removed from of points are grouped around this line. The regression line expresses a relationship between the random variables Y and X. The deviations of particular points from the regression line may be considered a result of the influence of a variety of random certain tendency in a distribution of points
imagine that a study of the interdependence between the length of human life and the consumption of food may be carried out under conditions ensuring the abso-
factors. Let us
lute elimination of the influence of
any other factors besides food consumption on the length of life. (In practice, of course, this is impossible.) It could be surmised that under such circumstances the points on the graph would lie almost exactly along a certain curve which would mean that the relationship
between variables
Y
and
X
is
functional and not stochastic.
It might be expected that the shape of such a curve would approximate the shape of the regression curve shown on
<3raph
1.
Regression and correlation
5
The importance of
the regression line as a tool of learning consists in the fact that it permits us to relate to any value
of one variable the expected or most probable value of the other
of particular importance in cases when accurate observation of the values of one of the variables encounters variable. This
is
we want amount of timber that can be obtained from 1,000 hectares of forest. If we know the regression line describing the relationship between the amount of timber in a tree
substantial difficulties. Let us suppose, for instance, that
to estimate the
trunk and the circumference of the trunk measured at a certain
we can
height,
The
know
easily
solve the problem.
regression line is
a tool of scientific prediction;
if
wer
the value of one variable the regression line allows
us to estimate the corresponding value of the other variable. The less the particular points deviate from the regression line,
the better and the
more accurate
because then a stochastic relationship functional
relationship.
The
will is
influence
be our estimate,
transformed into a of random
factors,
disappears and that of a regular factor is revealed more clearly. The bond between the two variables studied becomes stronger. The whole group of problems related to measuring the strength of the relationship between random variables is the subject of the branch of statistics called
The measures of
correlation
correlation
most frequently
theory*.
used are: the
and the correlation ratio. methods of studying interdependence between random variables allow us not only to measure the strength correlation
coefficient
Statistical
of this interdependence but also to verify the hypothesis that
two variables are correlated with one another. The objective of every science is to discover and to explain causal relations existing
between phenomena. Sometimes such relations are
1 The word "correlation" random variables.
in statistics
means interdependence between
6
Linear regression
strong and immediately apparent. Often, however, they are
weak and hidden among many diverse relationships existing between the phenomenon studied and the outside world. The researcher, on the basis of his scientific analysis, assumes the hypothesis that there exists a causal relationship between
two defined phenomena.
It
may happen
that
it is
impossible hypothesis by a direct experiment. Correlation theory has at its disposal methods which allow us, in many
to
test this
cases, to verify such hypotheses.
Example 2. The hypothesis has been postulated that an increase in the consumption of animal protein reduces fertility. This hypothesis seems to be fairly unexpected. It cannot be tested by experimenting. Its verification, however, can be carried out on the basis of statistical data contained in Table 1
TABLE
1
THE RELATIONSHIP BETWEEN THE BIRTH RATE AND CONSUMPTION OF PROTEIN
Regression and correlation
(see [5], p. 82).
Even a casual glance
7
at the data contained in
this table indicates that the birth rate decreases as the con-
sumption of animal protein increases.
We
are dealing here
with a case of negative correlation. This term is used to define the type of correlation in which an increase in the value of
one random variable
accompanied by a decrease in the value of the other. The relationship between the birth rate is
and the consumption of animal protein becomes even more GRAPH 2.
40
20 30 40 50 60 consumption of protein (gm/person/ day)
10
TO
apparent on a [scatter diagram (Graph 2). The trend in the distribution of the points on the graph is very distinctly marked. Of course, neither the table nor the graph provides
a basis for accepting the hypothesis. Methods for testing hypotheses of this kind
will
be discussed
We shall here, however, take the opportunity to say a few words about apparent or spurious correlation. This is the type of correlation in which a'relationshipjappears between later.
statistical series,
but there is no causal^relation between the [phe-
nomena described by these series. 'For instance, let phenomenon A be causally related to phenomenon B and phenomenon C. There will be a correlation between the ing A
and the
with regard to
series describing B.
statistical series describ-
The same
situation will exist
C which will show correlation with A.
It is clear
that owing to the correlation between the series describing and B and the correlation between the "series describing
A
Linear regression
8
A
and C, there may also appear a correlation between the B and C. However, this type of relationship
series describing
between
statistical series is
no
since there is
of a formal mathematical nature,
direct causal relation
Statistical experience supplies
many
For example Tschuprow [59] on compulsory fire insurance in
relationships. statistics
between
illustrations
showed an unusually
B
and C.
of spurious
states
that
the
prewar Russia
close relationship between the average
number of
buildings destroyed in one fire and the application of fire engines to extinguishing fires. The evidence shows therefore that the losses caused by a fire were greater in cases
where
fire engines were used, and smaller in cases where they were not used. This might be taken to indicate that in order to reduce losses caused by fires the use of fire engines
should be abandoned. The explanation was of course that fire brigades were usually called only in the
When
more
serious cases.
a single building was on
fire and there was no danger might spread to other buildings, fire brigades usually did not interfere. In this case, then, there is an interdepend-
that
it
ence between the intensity of a fire and the participation of a fire brigade which uses fire engines. There is no
however, between the application of fire buildings destroyed. This is an
causal relationship,
engines and the
number of
example of a spurious relationship. In prewar Russian statistics we find further interesting examples of spurious relationships. For instance, it has been established
on the
when a doctor
basis of
abundant
statistical
material that
assisted in child birth, the percentage of
still-
born children was higher than in cases delivered by a midwife.
At
first
glance
it
might appear that in order to reduce
infant mortality, doctors should not be called, a conclusion
which It
is
obviously absurd.
turns out that the
relationship. It should be
relationship observed
is
a spurious
remembered that formerly the doctor
Regression and correlation
9
was called at childbirth only in serious cases which often ended in the death of the infant. Hence the numerical relationship between infant mortality and the assistance of qualified
medical personnel.
The relationship between a drop in crop yields and the number of fires, observed by statisticians, also belongs to the category of spurious correlations. The number of fires increased in the years when precipitation was low. In the same years yields were lower than average.
An
amusing case of spurious correlation between the number of registered births and the number of storks has been noted in Scandinavian countries on the basis of abundant
statistical
data.
The above examples of spurious
correlation prove
categorical judgments about the existence
of causal
that
relations
should not be formed on the basis of numerical relationships. causal relation may exist but does not necessarily exist. When
A
there
is
a causal relation between observed phenomena
it
may
be expected that there will also be a numerical relationship. This relationship may sometimes appear very distinctly and sometimes less distinctly, or it may be so weak that it will hardly be noticeable. Such a relationship exists, however, when there is a causal relation between the phenomena studied. Inferring that there
is
a causal relation on the basis of a
numerical interdependence
may
lead to an absurd conclusion,
we have
seen from the above examples. Their very absurdity us from accepting them. However, if the conclusions protects as
resulting
from a hypothesis based on mathematical premises
that a causal relation exists between
phenomena
are
not
absurd then the temptation to accept them may be very strong. One must not yield to such temptations. If there is a relationship
between two 1) there is
by
statistical series the following cases are possible: a causal relation between the phenomena described
these series;
Linear regression
10 2) there is
no
direct causal relation
between the phenomena
described by these series because they are correlated with another, unknown series describing a phenomenon causally related to the 3) the
phenomena
observed relationship
Choosing the first
possibility
is
studied
by us;
accidental.
would be tantamount
to giving
priority over the remaining two without any foundation. The mere existence of a correlation between statistical series
it
only a signal that there may exist some direct or indirect relation between the phenomena described by these series.
is
up, we may formulate the following rules: 1) if a causal relation has been discovered between two phenomena then correlation analysis may be used to
Summing
determine the strength of this relation; 2) if a causal relation has not been discovered, but it may be assumed that such a relation between the studied
phenomena does exist, then the appearance of a distinct correlation on the basis of more abundant statistical material
a
causal
substantially relation
strengthens
the
hypothesis
that
exists;
making observations there were no grounds a for postulating relationship between two phenomena, but after observations have been completed and statistical material compiled a distinct correlation between statistical series can be noticed, then there is reason to assume that a causal relation may exist between the phenomena studied. It follows that even a formal analysis of a nu-
3) finally, if before
merical relationship is fully justified since to a scientific discovery.
The
usefulness
it
may
lead
of the rules formulated above becomes
Example 2, where we deal with a very strong interdependence between two statistical series: between the series on the birth rate in different countries and the particularly apparent in
series
on
the daily consumption of animal protein in those
Regression and correlation
11
countries. In spite of a strong correlation between these series
should not be inferred on this basis that there exists a causal
it
relation
between
fertility
and the consumption of animal and ex-
protein. Various sciences are engaged in discovering
plaining causal relations. Statistics facilitates these tasks for
them by supplying
useful research tools.
In economic research, regression analysis and correlation many applications. In company economics
analysis find very
or
in
the ^whole cost and effective1 jout for industrial enterprises
"micro-economics",
ness theory has been
worked
on the basis of regression field the following
analysis.
Amongst
should be named:
studies in this
[9], [12], [13], [18], [19],
Correlation analysis can also be applied in the economics of the firm to the analysis of the velocity of [23], [37], [49], [57].
circulation of liquid assets, to studies
on the productivity of
labour and the degree of utilization of working time, to the analysis of wages and the wage fund, etc. A separate field for the application of correlation methods
is
that of the tech-
nology of production. Correlation analysis and particularly regression analysis can be of great service in studying the influence of technological processes on the quality and cost of the product and on the length of the production period. In macro-economic studies correlation is used primarily for determining Engel curves
There are
many works in
and supply and demand curves. We shall mention here
this field.
only some of the more important: [46],
[50],
[51],
[3],
[22],
[36],
[39],
[45],
[60].
1
Since the author has been engaged so far in studying the applications of linear regression primarily to the analysis of the effectiveness of the industrial enterprise, most of the examples quoted in the book are from this field of research. The book also contains examples of other applications, since the author
is
interested
in
demonstrating by
diversified examples the usefulness of the
two-point method
by him for the determination of regression parameters.
many and proposed
Linear regression
12
An
important and now widely studied statistical problem the application of correlation methods to the analysis of
is
time series (the determination of the trend, the analysis of seasonal factors, the auto-correlation of time series, correlograms). Among the more important works should be mentioned: [9], [33], [62].
1.2.
the
following
Two-dimensional random variables 1
1.2.1. Definitions
D
and symbols
be a given
set of events forming a complete group a pair of numbers has been assigned to each of the events, these numbers may be treated as the values of two functions determined on the set D.
Let
(see [25], p. 22). If
Definition
mined on the
A pair of functions of real variables deterset D is called a two-dimensional random variable.
1.
Two-dimensional variables are usually denoted by the symbol f Definition
1
can
= (*i,JT,).
easily
(1)
be generalized to include multi(1) a two-
dimensional variables. In addition to notation
dimensional variable
may
also be denoted as follows:
= (*,*). In
this
work we
(2)
shall use only notation (2). Multi-dimensional
variables are generally denoted
S
= (X
19
by
X... XJ.
(3)
9
Random variables are sometimes interpreted geometrically. To each event from set D a certain arbitrary point on the
D
has a correspondplane correspond so that the set of events set of D' on the The of points on location ing points plane. 1
Items
1.2.1., 1.2.2.
and
1.2.3.
have been published in paper
[29],
Regression and correlation
13
determined by two components. These components of the points of set D' are the twothe two-dimensional plane
J? 2 is
dimensional random variable
An
variable in statistics
is
a
sta-
A
population with two characteristics called a two-dimensional population. Two characteristics characteristic.
tistical is
(X, Y).
random
equivalent of a
of a population are equivalent to a two-dimensional random
and
variable
observations
statistical
particular
expressing
the values of each of these characteristics for particular statis-
belonging to the population analysed are equivthe realization of the two-dimensional random
tical units
of
alents
An example of a two-dimensional population is the labour force of a factory studied from the point of view of variable.
seniority in
employment and
earnings.
Similarly, as in the case of one-dimensional bles,
two-dimensional random variables
discrete
1.2.2.
random
The two-dimensional random
a discrete variable
variable
and continuous random
varia-
treated as variables.
Two-dimensional discrete random variables
Definition 1. is
variables
random
may be
Y
if
variable (X,Y)
the sets of values of variable
X
and
are finite or denumerable.
Definition 2.
random
The
distribution function of the two-dimen-
a function which assigns appropriate probabilities to the values of this variable. The distribution function of the two-dimensional discrete random sional
variable
is
variable
is
expressed in the following way:
X=x
P( If a set of values
and
the
Y=y,)
of the variable
probabilities
of the variable can be table:
i9
=p
.
iJ
is finite
(1)
then these values
particular values corresponding out in the following contingency
to
set
Linear regression
14
TABLE
1
CONTINGENCY TABLE
Pin
Z
PZ2
Pzn
Pn
Pi*
Pin
Pmi
Pmz
P-l
i
Pmi
Pmn
P-j
P-n
PI.
PI-
Since a set of events which determines the two-dimensional
random
variable forms
a complete group of events, then
S
(2)
1=1 y-
Sum ties
obtained by adding together all the probabilicontained in Table 1. This can be done in two ways: by (2) is
summing up
the rows
and then the sums of the rows in the
column, or the other way around, by summing up the columns and then the sums of the columns in the last row.
last
It
follows that the
of the
last
row and
sum of
the last
equals one,
column equals the sum
i.e.
(3)
(4)
Regression
and
correlation
5
\
where n ( 5)
P .= %Pij> :
l
./=-!
>>.
/.,=
(6)
1=1 It
follows from equations (3) and (4) that the probabilities in the last colun n and in the last row of Table 1 form
shown
distributions.
called marginal distributions of the
discrete
(X, 7).
They are random variable
Let us write the
sum on
the right side of formula (5) in a
developed form:
After dividing both sides of equation (7) by
Since
sum
(8)
==
f-,..-j
-|
Pi.
p we lf
get
w/
1
Pi.
Pi.
equals unity then
we have a
distribution. It is the conditional distribution
of
Y
probability
on X.
Let us denote (9) Pi.
where p(yj \xi) is the conditional based on the assumption that
r
probability
Xx
conditional probability that
Therefore, formula form:
(8)
may
X=
i9
that
and X*^;)
Y= y
j9
^e Y=y *s
xt9 assuming that f be written in the following short
Similarly the conditional distribution of A"
presented in the following form:
.
on
Y may
be
16
Linear regression
1
On
(ii)
1
the basis of (9) Pi,
follows
It
jJx*iW=i. =
=
Pi.
-
P(y,
*,)
\
from formula
=p
(12)
.
mi
p(x
%
that
|
ys )
the
(12)
.
two-dimensional
joint probability equals the product of the marginal probability of one variable and the conditional probability of the other.
The term
"joint probability"
sizes the fact that
ptj
is
is
denoted by
p
.
tj
This empha-
the probability of a two-dimensional
variable, whereas p l9 p J9 p(x \y} \ p(y3 \x^ are the probabilities of one-dimensional variables. i
Two
Definition 3.
independent
if
for
all
discrete /,
random
X
and
Y
are
j
*) or,
variables
=
(13)
/>.,
what amounts to the same thing: p(*i\y}
= Pi,
(14)
In this case formula (12) assumes a simpler form: Pi,
Hence for r
it
on the
follows
^ m, s ^ n r
= Pi.-P.r basis of (12), (13)
(15)
and
(14) that
the following equality holds: s
r
s
7=1 s
r
(16)
and
finally
Regression and correlation
17
=
P(X< x,Y
the
two-dimensional
follows
it
from
random
variable
is
discrete
the definition of the distribution function
that )
=S
x<x
I>
08)
and F(
+
+
oo,
oo)
=
/>=!.
(19)
jc
The marginal
distribution of variable
X
is
expressed by
the formula
F(X
,
+ oo)=
Pi,
(20)
or
The formulae
for the marginal distribution of variable
Y
are analogous:
(20
follows from formulae (17), (20) and (21) that if two random variables and Y are independent the twodimensional distribution function (X,Y) is equal to the product It
X
discrete
of the distribution functions of one-dimensional variables
X and 1
y.
The
reverse statement
is
also true.
The definition of one-dimensional distribution function is analoThe distribution function of x is function F(x) = P(X < #).
gous.
Linear regression
18 1.2.3.
Two-dimensional continuous random variables
Definition
sional
1.
random
The
density function f(x y) of the two-dimen9
variable (^,7)
is
a mixed derivative of second
order of the distribution function F(x y) with respect and y of this variable at point (x y) i.e. 9
9
to
x
9
oxoy
The two-dimensional random
Definition 2. is
continuous
and
if its
distribution function
variable (X,Y)
F(x y) 9
is
continuous
the density function f(x,y) is also a continuous function with the possible exception of a collection of points belonging if
to a finite
On
number of
curves.
the basis of Definition
F(x,y)==' J* k
oo
1
we have: (2)
ff(u,v)dudv oo
andj \
00
F(-,-o The marginal
9
(3)
00
)=F(-o,y) = F(x,-oo)=0.
distribution of variable
X
is
(4)
expressed by
the formula
F(x,oo)=
v = / - ff(u,v)dud oo
ou
In formula
fftuydu. -
(5)
oo
(5)
(6)
denotes the marginal density of variable X. The formulae for the distribution function and marginal density of variable Y are analogous owing to the symmetry
Regression and correlation
of the formulae.
We
functions of variables
shall
X
19
denote the marginal distribution
and
Y
by F^x) and
F%(y).
In discussing the two-dimensional discrete random variable we have given the definition of conditional probability (p. 15,
formula
(9)).
When
the random variable is continuous, stand the conditional probability
we
shall under-
as the expression
(7)
^X<
at the same tim? that P(x x+A x) > 0. Comparing formula (7) with formula (9) from the preceding
assuming section
we can
easily notice a formal similarity
Indeed, in formula
(7),
instead of quantities
< X<
between them.
x and y i
< <
t
we
have put in expressions (x x+A x) and (y Y y+A y ) For continuous random variables, probabilities respectively. corresponding to particular values of the variables always equal zero,
since
P(X=
x)
9
we have
= P(Y= y) =--P(X=x,Y=y)= 0.
Conditional probability (7) is the probability that point chosen at random, will be located in rectangle
(X,Y),
GRAPH
2*
1.
Linear regression
20
y+A y
,
x^X<x +A
X
when
point lies within area x < X < x+A It
follows
from the
it is
known
that this
oo
x9
definition of the distribution function
of the continuous variable that
/ /(
/
= Of
-
v) rfw
dv
>
(8)
*+*
>
/
ff(u,v)dudv
x
oo
course
= The conditional
distribution
formula
function
is
-
1.
(9)
expressed by the
ff(u,v)dudv *+4r
(io)
,
oo
/
/ /O,
x
oo
v) rfw rfv
and the density f(y\x) of the conditional distribution by formula
In formulating the definition of conditional probability (see formula (7)) we considered variable Y on the assumption that variable It is
X
metrical
to
variable
Y
On
the
satisfies
inequality
x
^X< x+A
easy to find formulae for variable
formulae satisfies
(7)
(11)
the inequality
X which
x
.
are sym-
on the assumption that
y^Y
y
.
page 16 we have given the definition of an independent
Regression and correlation
discrete
random
variable.
We
now
shall
an independent continuous random Definition 3. Two continuous are independent if
21
give the definition of
variable.
random
X
variables
and
Y < y 2),
Y
(12)
where xl9 x% and yl9 y2 are any real numbers. It is easy to prove that for two random variables to be independent it is necessary and sufficient that their joint two-dimensional function
distribution
equal
the
distribution functions of variable
F(x y) 9
product
X
random
The
for discrete variables.
variable
moment of a two-dimensional
relative
variable (X,T)
(13)
1
Moments of a two-dimensional
Definition 1.
marginal
= F (x).F,(y).
The same theorem was given above 1.2.4.
of the
and variable Y:
the expected value of
is
moment and /, k are noncan be considered as the which negative integers. coordinates of an arbitrary point, are any real numbers.
where
/
and k
is
the order of the
C
Definition 2. If
product x?y
k
is
and
D
9
C= D=
the
called the ordinary
expected
moment
value
(or simply
moment) of the two-dimensional random variable X moments are usually denoted by the symbol m lk
9
.
In accordance with this definition
m = E(X> Y*) = lk
B.
^
a.
of
Y.
the the
These
Linear regression
22
where variable (X,Y)
is
and
discrete,
d
b f a
b - +
=
oo c oo d~>
+
oo oo
J a
x y kf(x,y)dxdy l
Jf c
x y*f(y,x)dxdy, l
/
(2)
-00 -00
where variable (X,Y) is continuous. The most frequent use is made of moments of the second order. Moments of the
random
values of the
first
and
order are the expected and Y:
first
X
variables
(3)
and
mn=E(XY^=E(Y). The moment of
(4)
by formula
the second order defined
mu =E(jrr)
(5)
called a product moment. The remaining two moments of the second order by the formulae
is
m 40 =
2
E(X 7)
are expressed
- E(X*)
(6)
and
m 02 =E(X<>Y*)=E(Y*). Moments with
Definition 3.
and
D=
reference
points
E(Y) are called central moments.
denoted by p lk We then have
They
C=
E(X)
are usually
.
/i tt
Of
(7)
= E[(X - w
10)<
(7
-m
k dl) ]
.
(8)
course
Ao = E[(X - Hi,,) 1 (Y - m01)]
-
(9)
and
= 0.
(10)
Regression and correlation
23
In our further considerations three central
moments of
the
second order will be of great importance: (11)
and
l*to=E[(Y-mn?\=V(Y).
(12)
These are variances of random variable X and random variable
The mixed
Y.
moment of
central
the second order
^=E((x-m^(Y-m j\
(13)
Q
is
known
is often denoted by C(X,Y). of a two-dimensional random variable Central moments
as a covariance. It
can be expressed by ordinary moments, and vice versa.
It
easy to show, for instance, that
is
[*ii
= wu
w w 10
01
(14)
.
Indeed /in
= E[(X - m
10 )
(Y - w 01)]
=
E(XY)
+ m m = mu 10
Similarly,
it
m lo w 01
01
.
can be proved that ^20
=w
m lo
ao
2
(15)
and
^02=
w02
m012
(16)
.
The following important theorem can be proved
for covar-
iance.
X
and Y are independent THEOREM 1. If random variables then the covariance C(X,Y) of these variables equals zero. 1 and 7 are independent, then Proof When variables
X = Pi,
.
(see 1.2.2., 1
formula
The proof
is
Pt.
/^
(15)).
for discrete variables.
the variables are continuous.
The
situation
is
similar
when
Linear regression
24 Therefore
C(X 7)=
(*-/
9
j
i
On
the basis of (9)
and
we obtain
(10)
C(y v\) V^^-Aj
The converse theorem
is
o \J.
JL
not true.
In addition to the moments thus far defined there
is
another
group of moments. They are called conditional moments. This term is used for moments of one of the variables X, Y, as-
suming that the remaining variable has a certain
definite
value.
In our further considerations
moments: variance.
we
shall use
If variable (X,Y)
are expressed
by
is
two conditional
and the conditional
the expected conditional value
continuous
these
parameters
the respective formulae oo
/
E(Y\X= x) = m
01
(x) =
yf(x,y)dy
= 00
_
Jyf(y
I
x)dy, (17)
/ f(x,y)dy oo
J[y-m
01
(x)]*f(x,y)dy
jf(x,y)dy oo
= J\y-m01 (x)]*f(y\x)dy. oo These are the conditional moments of variable Y.
An analogous
pair of formulae can be given for variable JT.
When
variables
X
and
Y
(18)
are independent, then
Regression and correlation
25
f
J
7.2.5. Regression
I
In a two-dimensional distribution of variable (X,Y) the expected conditional value E(Y\X=x) is a function of the variable x.
We may
thus write:
Substituting the simple symbol in the
J>
Equation tion
y
for the expression
(71^
= x)
above formula, we get
(2) is
known
of regression I of
= fc(*)-
in mathematical statistics as the equa-
Y
on X.
By interchanging letters x and y in formulae (1) and we get the regression equation of variable X on Y * If variable (X,
Y)
is
(2)
=
200-
(2)
(3)
continuous, then the geometrical repre-
sentations of functions (2) and (3) will be lines. These lines are called regression I lines.
GRAPH 4Y
1.
Linear regression
26 In
Graph
the regression line of
1,
Y
X
on
is
shown. The
ordinates of this curve represent the expected values of variable
Y when
known then expected
variable
X = x.
If the equation of this line is
to each value of variable
GRAPH
We
shall
we can
assign an
2.
prove the following
THEOREM
1
deviations of
Proof.
X
of variable Y.
value
We
.
The expected value of the sum of the squared from the regression line is a minimum.
Y
are to prove that
But
As we know, E(Y Then
2
u)
= minimum
when u
E(Y-u)*=V(Y).
= E(Y).
Regression and correlation
Hence, for
E[Yg(X)]
2
to have a
27
minimum
it
is
necessary
that
because then
f[y-E(Y\X=x)]*f(y\x)dy oo
equals
V(Y\X = x),
i.e.
is
a minimum. Therefore
E[Y-g(X)]*= //,(*>&
J\y-m m (x)]*f(y\x)dy
7.2.5. Regression II
In practice the shape of function g(x)
is
rarely
known.
The usual procedure is to take a sample from a two-dimensional population and to draw a scatter diagram. The points on the graph follow a more or less distinct trend. This trend provides certain information', on its basis the hypothesis be formulated that function g(x) belongs to a certain of functions (e.g. to the class of linear, exponential or power functions or to the class of polynomials).
may
class
Graph of
3
statistical
presents a scatter diagram
drawn on the
basis
on and the producfrom the Piast
material collected in connection with studies
the relationship between hop consumption tion of wort. The statistics were obtained
Brewery in Wroclaw. The trend of the points on the graph is so distinct that we can safely postulate the hypothesis that g(x) belongs to the class of linear functions.
In order to determine the parameters of function g(x) E[Y g(x)] 2 has to be made a minimum.
the expression If g(x)
= ax then to find the value of parameter a we have
to minimize the value of the expression
S
==
E[Yax] 2
.
Linear regression
28
30
We
must
the
calculate
and equate
derivative
it
to
da zero.
The
From line
the equation thus obtained a can be determined.
determined in
this
way
is
called a regression II line.
If the hypothesis concerning the class of functions to
GRAPH
which
3.
W
5
wort (thousand hi/month)
g(x) belongs is true then the regression II line coincides with the regression I line. In applications we are always interested in regression I lines. Since the equations of these lines are usually not known we substitute regression II lines for
regression I lines because the former are easier to determine.
we
from worry as to whether been properly determined because information provided by a scatter diagram is scanty and so the hypothesis regarding the class of functions to which g(x)
While doing
this
are seldom free
the regression line has
belongs
may
easily turn out to
be wrong.
happens sometimes that apart from a scatter diagram we may have at our disposal some additional information It
providing a basis for the hypothesis concerning the class of functions to which g(x) belongs. For instance, sometimes
Regression and correlation
29
we know the equation of asymptotes of the regre:sion line, or we know that this line passes through the origin, or that it does not intersect the positive part of the jc-axis and the negaof the >>-axis. Such information is very valuable.
tive part
always comes from sources outside
It
statistics.
One of
conditions of the effectiveness of statistical analysis
is
the
a thor-
ough knowledge of the subject being studied and of the division of science which is concerned with it. This means, for instance, that to analyse by statistical methods the effectiveness of
we need
penicillin in fighting tuberculosis
phthisiologists,
and
to study the effect of the price of butter on the consumption of edible oils we need economists. The knowledge of statistics
alone
is
not
Only the combination of
sufficient.
non-statistical information can
This principle
is
make our
statistical
and
analysis fruitful.
of
fully applicable to the determination
regression lines.
7.2.7. Linear regression 1.
Definition
If
an equation of regression
expressed by
is
the formula ;p
we
=a
al
say that the regression of
Formula
(1) is
+
Y
20
on
(1)
,
X
is linear.
Y on
a regression equation of
Quantities a 2l and
parameters. The
x
/? 2 o
X.
are certain constants called regression
indices next to the parameters serve to distin-
X
X
from those of guish the regression parameters of Y on on Y. The first index is for the dependent variable in the regression equation
The
and the second for the independent.
linear regression equation of A"
on
Y
is
expressed by
the formula jio.
(2)
Linear regression
30
Sometimes we
shall write equation (1) omitting indices,
y In such cases refer to
Y
on
X
it
= ax +
i.e.
(3)
ft.
should be understood that our considerations
both types of regression
and the regression of
X
lines, i.e.
on
the regression of
Y.
If we know that in the distribution of the two-dimensional random variable (X Y) the regression lines are straight, then in order to determine the value of parameters a and /? we have 9
to find the
minimum
for the expression
E[Y-aX -| 2 = f[y -ax where
dP
R2
denotes the two-dimensional integration space and
the differential of the two-dimensional distribution.
We
of the expression in with respect to a and /9.
calculate the partial derivatives
brackets on the
We
-ffidP,
left side,
have
E(Y da
aXp)* = -2E[(Y-aX-
and
aft
By equating both
these derivatives to zero
we
obtain a set of
normal equations
After replacing the expected values by appropriate moments this set of equations can be written in the following form:
mu am zo /?w10 = W 01 -aw10 ~-^=0
0,
_
Regression and correlation
From
the solution of the set of equations (5)
a
On
31
=a =
we
get:
m m -----
"*u-
31
01
,~
10
-
(/)
m 10
w 20
the basis of 1.2.4. (14), (15) and (16), formula (7)
may be
written thus:
a
=a
ai
==s
-
00
/*20
Similarly
we
get
and au
= -^-.
(10)
/% Parameters a 21
,
and a12 are
called
Substituting (6) in (1) and (9) in (2) equations in the following form:
y
It
lines
=a
21(*
regression
we
coefficients.
obtain the regression
w10) + w01
(1 1)
,
follows from the above equations that both regression pass through the point with coordinates (m 10 , AW OI).
We
shall call this point the population centre of gravity.
The knowledge of
regression equations allows us to express
the stochastic dependence between
random
variables
by a
mathematical function describing numerically the relationship existing between these variables. The derivation of a function
formula
is very convenient since it allows us to assign to each value of a random variable, appearing as an argument, an appropriate value of the other variable appearing as a
function.
Linear regression
32
We
have stated on page 5 that the significance of
re-
gression line equations consists in the fact that they enable us to
estimate particular values of one variable on the basis of the values assumed by the other variable. This estimate may be
more accurate, or less accurate. When we use word "estimate" we must also introduce the notion of the
better or worse,
the
"accuracy of the estimate" and create a measure of
this ac-
curacy.
Of course,
the smaller the
sum of errors1 random
replacing the real values of the
we commit by
thai
variable
by the values
obtained from the regression line equation, the better the
mate mil
be.
This statement lends
itself
to geometrical interpretation.
The more closely the points are grouped around the line on the scatter diagram, or, what amounts to thing, the smaller is line,
which th
regression
the
same
the dispersion of the points around the
the better the estimate will be.
Let us
is
call the
quantity defined by formula
a realization of the random variable
Y on X
residual of the regression of
i
esti-
As a measure of
9
7^,
ey
the
or in brief a residual.
dispersion of points around the regression
line the residual variance V(ey) is generally used. It is deter-
mined by the formula
V(e,)=E(e*)=E(Y-W. If the regression line
is
a
straight line,
(13)
than
- (anX + & = E[Y - m + a m + F(
01
)]
ai
10
/5 20
-
2
cl)
1
We
are not interested
here in
how
absolute values, squared value, or in
these errors are measured: by
some other way.
Regression and correlation
where
And
On
W=
Y
and
AH OI
U= X
m10
33
.
further
the basis of (8)
=^
V(ey )
we have
-
l
2
^20
Analogously
it
can be shown that V(ex)
=^
20
a!^!
(1 5)
.
The square root of the residual variance we shall call the standard error of the estimate and we shall denote it by the symbols
X and
cr21
and a12 respectively for the regression of X on Y. In this case
Y
on
the regression of
and
Using the symbols for variance and covariance we formula (14) in the following way: V(eJ Hence,
or,
it
=
V(Y)
-a
21
C(XY) - F(JO
may present
- <&V(X).
(18)
follows that
what amounts
to the
same
thing, that
.
(19)
V(Y) In the conclusion of our comments on the two-dimensional regression line let us quote proofs.
two important theorems and
their
Linear regression
34
THEOREM
1.
If a regression I line
is
a straight line then
the regression II line coincides with the regression I line.
By assumption we have
Proof.
Taking the mathematical expectation of both
we
equation
sides of the
obtain:
But
Hence
This means that the regression I line passes through the centre of gravity. We have shown on page 31 that the regression II line also passes
troduce
new
through
this point.
This enables us to in-
variables:
U=X-m
lo ,
W= r-w
Consequently the equation for regression following form:
and the equation for regression
II will
I
01
.
will
assume the
be expressed by the
formula
w= a'u, where
denotes the regression coefficient in this equation.
a'
Therefore
E(W - a'U) = E[(W - aU) + (aU - a'U)] 2
2
= E(W - aU)* + 2E(W - at/) [(a - a')C/] + E(aU - a' t/)
2 .
The expression
E(W-aV)[(a-a)U] is
zero for each determined value of
the assumption of linearity that
U
since
it
follows from
Regression and correlation
35
Thus we have
E(W - a'C/) = E(W - at/) + E(aU - a'U) 2 2
2
Since the
.
term on the right side of the equation does
first
then the expression E(lV-a'U) 2 has a minimum for the same value of a' as the expression 2 E(aU a'U) , and that expression has a minimum when
not depend on
a',
E(aU-dU)*= But
a' is
is
aX
a'.
2
#)
= minimum,
equivalent to the condition
E(W and
a-
when
determined by the condition
E(Y which
i.e.
0,
a' tO
this condition is fulfilled
the regression sion II line.
THEOREM
2.
1
2
line is straight
When
minimum,
when a it
= a'.
is
when
coincides with the regres-
the regression I line
then the residual variance
Therefore,
a straight
is
line
a minimum.
Proof. The parameters of the regression mined by the condition (see p. 30):
II line are deter-
= E(Y - a'X -/3') = minimum. 2
(ej) It
follows from
Theorem
1
that both the regression I
and
regression II lines coincide in the case of linear regression.
This means that for the regression
I
line this
which proves that the theorem theorem proved is a special case of theorem
also fulfilled,
1.2.8. Correlation. Correlation ratio
and
is 1
condition correct.
on page
is
The 26.
correlation
coefficient
On
page 32 in formula (13) the definition was given for
the residual variance as a measure of the dispersion of points (x,y) 3*
around the regression
line.
Linear regression
36
However, the application of the residual variance
is
not
limited to measuring only the dispersion of points around the regression line. Let us note that the smaller the dispersion
the closer
When
all
is
the
bond between random
the points
persion at
all
and
on the regression
lie
V(e)
= 0.
relationship between variables
This
is
variables
X
line there is
and Y. no dis-
a case of a functional
X and Y and not of a stochastic
relationship. It follows that quantity V(e) may be used for measuring the dependence between two random variables. Indeed the
residual variance
is
used for
this
purpose although not in
form described in the definition. measure of dependence between two random variables should meet the following requirements: 1) it should have no dimension; 2) it should be normalized and assume values belonging
the
A
a certain finite numerical interval;
to 3)
it
should assume increasing values when the dependence
becomes stronger, and decreasing values when comes weaker; 4)
it
on
it
be-
should not depend on whether the dependence of Y is measured, or vice versa.
X
None of
these requirements is satisfied by V(e). With the of several help simple mathematical operations, however, we can construct a quantity which will fully satisfy these
requirements. In order to satisfy requirements 1) and 2) sufficient to divide V(ey ) by V(Y) or V(ex) by V(X). Indeed,
it is
since F(ey)
and V(Y) are of the same dimension, then
has no dimension. It
follows
from formula
(19)
on
p. 33 that
0<-^>-
Regression and correlation
which means that
this
quantity
37
normalized in the inter-
is
val [0,1].
Requirement 4
3)
be met
will
if
instead of
V(e
)
and
we
V(e
)
V(X)
V(T) introduce the quantities
TO l=l--
n Quantities
r\ v
and
r/x
(E
^-.
(2)
TO
defined by formulae (1) and (2) are
known
as correlation ratios1 .
Of
course
0^17, <1. The correlation ratio r\y when the dependence is
(3)
equals one if and only if V(e = 0, between random variables X and Y i.e.
y)
of a functional type.
When it
is
said that the
another.
random
variables are correlated with one
When
we
say that they are uncon elated. All that has been said about correlation ratio r\ y also applies to rj x .
The
lack of a correlation between
not mean,
chastically independent. (13))
that
by any means,
random
variables
As we know
X
F(x.y)
and
=
Y
does
are
sto-
(see p.
21,
formula
a
(4)
= 0.
(5)
if
or
YI X
Relations (4) and (5) are not equivalent. 1
variables
variables
are independent if
= ^i(#).-F (y),
and they are uncorrelated rj y
random
these
This measure was introduced by K. Pearson.
Linear regression
38
In order to satisfy requirement 4) we assume that both regression I lines are straight lines. Therefore
(see p. 33,
formula
(18)),
and analogously
Hence
C(XY) V(Y)
V(Y) (see (12), (13)
on
p. 23)
Similarly
when
Therefore,
the regression lines are straight lines ?!,
=
fla-
The quantity Q is
called
the
coefficient
on the
is
correlation
]/oia .a 21
coefficient.
(6)
Since
the
a special case of the correlation we also have
correlation ratio,
then
basis of (1)
2
When
=
>
<
1
or
1
1.
X
and Y say that between random variables there is positive correlation. In cases of positive correlation an increase in the value of one variable is accompanied by an Q
we
increase in the expected value of the other. Let us note that
Regression and correlation
By
this
convention the sign of the correlation coefficient design of // u
pends on the
On
39
.
the other hand,
>
Since // 20 and ^ 02 > 0, the signs of a12 and a 21 also depend on the sign of /%. Hence, when Q and 21 0. 0, a 12 But a32 and a 21 are slopes of the regression lines. Therefore,
>
>
>
when
the regression line
angle with the
y r-axis which means
=a
21
that
>
x+/? 20 forms a sharp y is an increasing
function of x.
<
When we deal with negative correlation and then an increase in the value of one variable is accompanied by a decrease in the conditional expected value of the other.
The position of the regression lines for a positive correis shown on Graph 1 and the position of these lines for a negative correlation is shown on Graph 2. lation
GRAPH
The all
GRAPH
1.
correlation coefficient equals
points (x,y)
lie
on the
now prove THEOREM 1. When
+1
or
2.
1
only when
straight line.
Let us
2
=
1
the
two regression
lines coincide.
Linear regression
40
=
Proof. Let us assume that Q
when we assume that Q =
+1
(the
In this case
1).
proof it
is
analogous
follows
from the
definition of the correlation coefficient that 2
Q
The parameter a 12 of
slope
.
a 21
=
1
.
of the
angle
that
the
^-axis.
The
horizontal
axis,
forms with the
with reference to the
line,
The
.
tangent
12
this
equals
the
is
la
= a y + Pio
line
regression
=a
inclination parameter of the regression line
ia
=
a 20 +/S01 with reference to the horizontal axis equals a 21 $ The tangent of the angle contained between two regression lines
.
then
is
1
a 21
,
j
!2
_21
21
12
Since the tangent of an angle equals zero when the angle equals zero, the result obtained indicates that the lines coincide.
When
the correlation coefficient equals zero, the
variables It
rem
can 1)
X and Y are easily
that
sion lines
uncorrelated.
be shown (the proof
= Q
when
random
is
the
same
as for
Theo-
then the angle between the regres-
equals. Let us note
further that the correlation
2 coefficient equals zero only
Cho
when C(XV)
C(AT) _ --
Ctgi
_
if
#
2
= 0,
an
=
and a 31
However, since
C(XT) V(Y)
V(X) then
= 0.
= 0.
Thus, when variables
X and Y are not correlated the regression lines intersect at the
Regression and correlation
angle
n and
the regression line of
41
Y on X
is
horizontal axis and the regression line of A" to the vertical axis
On
(Graph
on Y
is
parallel
3).
page 23 we proved the theorem that
X and Y are
parallel to the
independent, the covariance
We
shall now express this theorem in a THEOREM 2. If variables X and Y are
if
random
C(XY)
variables
equals zero.
slightly different
form.
independent, then they
are also uncorrelated.
The converse theorem positive to
importance.
THEOREM
Theorem 2
is
is
not
true.
The theorem
contra-
of great practical and theoretical
It is: 3.
If
random
variables
X and Y are correlated they
are also dependent.
GRAPH
Theorem
3 does not
contrapositive to
3.
have to be proved since it is a theorem 2, which is true. As we know, two
Theorem
contrapositive theorems
must be
either
both
false,
or both true.
To conclude our discussion of the correlation coefficient we shall prove the important THEOREM 4. For any random variables X and Y there is always a linear transformation which can bring these vari-
Linear regression
42 ables to the
form
which the correlation coefficient Q be-
in
tween the transformed variables equals zero. Proof. Using known formulae for translating and rotating the coordinate system we have
= (X - m y = (A' w X'
For
and y
variables X'
sufficient that
cos
10)
'
9+
+
sin
10)
-w
(Y
to be uncorrelated
0,
it
necessary and
is
f
0.
)
In accordance with formula (13) item
E(X'Y)
sin
w 01) cos 0.
(7
C(X'Y'} = E(X'Y =
01)
= E{[(X - w w
)
1.2.4.:
+ (Y - m ) sin 0] - (JT + (y w ) cos 0]}
cos
w10) sin
01
[
.
01
After multiplying and bringing the sign of the expected value within the brackets we get r
E(X'Y )
= E[(X - m
(Y
10 )
-m
cos
01 )]
- E(Y - m
2
01)
]
- ! [E(X - m
20
sin
20.
Divide both sides of the above equation by cos
If
cos
20
-
(y
=E[(X
-m
10)
(Y
-m
-
ol)]
-
[E(X
-m
2
01) ]
tan
20 = /%
--
-m
2
10)
-
^ /%
(/i ao
- ^02) tan 2
so that
1 ,
we
(7)
get
[*02
E(X'Y')=Q and hence g(jry)= 9
Let us note in passing that
and
if
we determine
if
A
.
2
the rotation angle
tan 20 ==
(7)
20
2
we choose
mula
2
10)
=
tan0,
0.
angle
from
for-
Regression and correlation
43
then the line having the equation (8)
has the property that the sum of the squares of the distances of points (x,y) from this line is a minimum. The straight line (8) is
known
formula
as the orthogonal regression line. It follows
(8) that this line passes
from
through the centre of gravity.
After some elementary transformations
we
obtain a formula
for the slope of line (8)
7.2.9.
The
The two-dimensional normal
distribution
random
distribution of the two-dimensional
variable
(X,Y) with a density determined by the formula (1)
-
a!
_2
(*
itiQ Q> ffl
m,)
0>-m 2 )
I
aj
J
a
where
is
called a two-dimensional
normal
distribution.
The great practical importance of distribution (1) follows from the generalized Central Limit Theorem on two-dimensional variables.
On Graph
1
the density surface for a two-di-
mensional normal distribution
THEOREM
1.
If
the
is
presented.
density of random variable (X Y) 9
is
expressed by formula (1) and if the correlation coefficient Q and between variables and Y equals zero, then variables
X
Y
are independent.
X
Linear regression
44
GRAPH
Proof. If
=
1.
then
Denoting
(2)
-
(3)
1/2^0-2
we have
After integrating both sides of the respect to each variable, x
we
y
(p(x,y)dxdy
=
get
above
identity
with
Regression and correlation
45
Using the definition of the two-dimensional random variable
we may
distribution function
We
write
have obtained the necessary and
the independence of the
mula
sufficient condition for
variables (item 1.2.3., for-
Thus the theorem has been proved. The conditional density y(y\x) in a two-dimennormal distribution is the density of a normal distri-
(13)).
THEOREM sional
random
2.
bution with the following parameters:
X = x) = m^ + on(* -iff,), K(r|Ar=*) = (i-e)K(y).
E(Y
Proof.
On
(4)
|
(5)
the basis of formula (11), item 1.2.3.
For convenience
let
us denote
Then
1
I
2 I
,-m-^fe-i*) tf./l-e'
V J
Linear regression
46
Formula
equation of Y on X. a locus of conditional expected
(4) represents the regression
Since a regression values then
it
Corollary
1.
line is
I
from the theorem proved above that: Regression lines in a two-dimensional normal
follows
On
distribution are straight lines.
equation in formula
(5)
the right-hand side of the
only constant quantities appear. Hence
follows
Corollary 2. In a two-dimensional normal distribution, the conditional variance corollaries, as
The
we
V(Y\X
x)
is
a constant quantity. Both
of great practical importance.
shall see, are
density surface of a normal distribution
shown on
a geometrical interpretation of equation (1). The intersection of this surface with the plane parallel to plane XOY is called the equiprobability curve. Such curves in a
Graph
1
is
normal distribution are of these (x
ellipses.
The equation of
the family
ellipses is as follows:
mtf
(x
mj
(y
(6)
_j
C is a variable parameter dependent upon the parameter of the intersecting plane. The centre of this family of ellipses is the centre of gravity, i.e. the point with coordinates [m l9 m 2 ]. The regression lines are the diameters of the ellipses conjugate where
to the diameters parallel to the coordinate axes.
GRAPH regression of
X
on
2.
Y orthogonal regression ,
regression of Yon X
orthogonal regression
Regression and correlation
The major
axis of the ellipse coincides with the orthogonal
regression line
When
Q =
(Graph
0,
2).
equation (x --mi)
assumes the form
(6)
(y- m*) 2 __
2
The major and minor axes of
ellipse
the corresponding coordinate axes,
X and
~ 0^
0-3,
the ellipses
Since the
47
sum
become
of the
,~
2
(7)
Y.
are parallel to
When
Q
and
circles.
random
variables
(*^) r 0^m
a)
a\
has the # 2 distribution with two degrees of freedom, the probability that a random point (x,y) is located within the area determined by curve (7) is
The
probability that point (x, y) is located within the area (6), is the same.
determined by curve
The area determined by the equiprobability ellipse may be considered as a characteristic of the dispersion in a two-dimensional normal distribution. The measure of this area is, then, a generalized measure of dispersion comprising two-dimensional
The
random
variables.
values of the distribution function and of the density
function of a two-dimensional normal distribution are given
and [44]). These tables, however, are not in and therefore are not easily accessible. For this
in tables (see [40]
general use, reason, the following expansion of the density function of a
normal distribution into a
series (see [7], [52], [54])
practical importance. It can be
proved that
when
has great
m = m^= 1
0,
Linear regression
48 then
'_
Wr*
I
_.
?_
+
jM
ajff.
ffj
a!
J
Ui
VI
1
[__
2/l"7
(8) v\
-
where
^ 2
Hence, when x
= j = 0, ? r=
^_
__
'
v!
Integrating both sides with respect to Q, 1
>=
f |
271 J
On
=
get
1
arc sing-
2jr
the other hand, after integrating formula (8)
oo
Of
dQ __^_-^ 2 ]/l-e
we
x
00
= y = 0,
then oo
r0(t>VQY|2
J
J
-^
oo
oo
1
however,
00 (see [7]).
get
oo
course, if
Since,
we
f
f
00
00
(0)
=
i,
y(u, v)du
we
finally
dv~
i>l
have
j
h arc sin Q
(9)
Regression and correlation
Rz
1.3. Non-linear regression in
In
1.2.5.
we have
49
defined
the
I
regression
line
(for-
mula (1), p. 25). To determine the regression equation of on X we used the notation
The concrete form of the regression curve equation ally not known. In order to know the equation of the sion curve
(X
9
Y),
it is
and
necessary to
this is
know
seldom possible in
procedures of them.
1.
may
We
usu-
regres-
the distribution of variable practice.
If the distribution of variable (X,Y) is not
be used.
is
Y
shall discuss the
known, various
more important
The hypothesis concerning the shape of curve g^x) based on
the
hypothesis concerning the distribution of variable (X,Y)
If the collection of values that
variable (X, Y)
is
so large that
it
can be assumed by the random cannot be analysed as a whole,
then the only statistical source of information about the distribution of variable (X,Y) is a sample. All other information about the distribution of variables (X,Y) is non-statistical information (see p. 29). In analysing the distribution of a variable we may, and should, take into consideration all the information in our possession, both statistical and
random
Suppose that on the basis of available information we have postulated a statistical hypothesis H, according to which the distribution of random variable (X Y) is normal. non-statistical.
9
1 suppose that the testing of this hypothesis has not provided a basis for rejecting it. In this case we may
Let us also
assume that function g^x)
is linear,
because
true, then, according to Corollary lines are straight lines. is
1
4
It will
be discussed in Chapter
4.
1
the hypothesis (1.2.9.) the regression if
Linear regression
50 2.
The hypothesis concerning the shape of curve statistical
gi(x) based on non-
information
A hypothesis concerning the shape of the regression curve can often be based on information not directly related to the distribution of
phenomenon instance, in
random
variable
(X,Y\ but pertaining
to the
that this variable describes mathematically.
smoothing out a broken curve of a time
For
series
representing the growth of population of a country within a certain period of time, there are reasons to assume that such
a curve will be exponential if only we make a generally acceptable assumption that the rate of population increase during the period under consideration was not subject to serious fluctuations.
3.
The hypothesis concerning the shape of curve g^x) based on a
scatter
diagram If non-statistical information is so scanty that
we cannot
postulate any hypothesis concerning the shape of the curve gi(x),
then the only
way
out
diagram and to analyse
is it.
to take a sample, to If the points
draw a
scatter
on the diagram are so form of a clearly
distributed that a distinct tendency in the
marked trend this
is
noticeable, then
on the
basis of the shape of
trend a certain class of functions can be chosen that
should
be
suitable
for
approximating the
distribution
of
on the scatter diagram. The parameters of a function approximating the distribution and belonging to this class are determined by an appropriate method (e.g. the method of points
least squares).
What is striking in this type of approach is the high degree of arbitrariness. This approach cannot be taken when there are more than 3 dimensions, since in such cases it is not possible to draw a scatter diagram. For this reason non-linear regression
is
much
less
often used than linear regression.
Regression and correlation It is
worth
51
stressing at this point, that non-linear regression
can always be approximated by a broken curve and thus non-linear regression can be reduced to linear regression. In many important applications, non-linear regression can be reduced to a linear form by an appropriate choice and introduction of
new
For these reasons mathematical
variables 1
.
studies on probability and non-linear regression is either com[7], [21]) or discussed only briefly and in
recent
statistics
pletely omitted (e.g.
in general terms (e.g. [28], [33]).
1
This will be discussed in Chapter
5.
THE APPLICATION OF REGRESSION AND CORRELATION TO ECONOMIC RESEARCH
2.
On
2.1.
on
the relation between economics statistics
and mathematics,
and econometrics
The constant changes
to which production processes are
subject as a result of the rapid development of science
technique, pose
The
science.
more and more historical
difficult
and
and
problems to economic
descriptive
methods
tradi-
tionally used by the social sciences are no longer adequate
to solve these problems.
an indispensable tool
The
ability to predict is
in the proper
becoming
management of
the production processes. The need for acquiring this ability is felt both in a capitalist economy which continually tries to free
from the
market domination, and in a where the growth targets are determined economy on the basis of long-run economic plans.
itself
vicissitudes of
socialist
The ordinary meaning of the word "predict" is obvious. The term "scientific forecasting", however, requires some explanation. By scientific forecasting we understand in this context every judgment, the accuracy of which event with a probability
known
to the degree
ficient for practical purposes. It follows
is
a random
of exactness suf-
from
this definition
that a scientific forecast is always a statistical hypothesis. Scientific
forecasting
is
impossible without comparing dif-
ferent quantities, without measuring, without using numbers.
For
this
ematics 1 1
An
reason contemporary economics makes use of mathto a growing extent; especially useful are the
exhaustive survey of applications of mathematics to economics
can be found in studies
[2], [581.
52
The application of regression
theory
of probability,
to
economic research
mathematical
statistics
53
and econo-
metrics.
The knowledge of causal relations existing between a given phenomenon and other phenomena is indispensable for scientific forecasting. If the relation between them is very strong it
can be presented as a mathematical function. In the soas a rule, we do not deal with functional re-
cial sciences,
due to the complexity of the nature of the phenomena studied. The relationships between them are usually lationships. This
is
of a stochastic nature (see
mathematical is
statistics
p. 2).
A
specialized
branch of
dealing with stochastic relationships
the theory of regression
and correlation (which we
shall
also call correlation analysis).
So far, correlation analysis has not been very extensively used in their normal work. It is not difficult to this is so. If a research method is to find a wide explain why by economists
range of applications universal,
i.e.
it is
necessary for
it
suitable for solving a large
to be: 1) sufficiently
number of
different
problems; 2) not too difficult, and easily popularized. We shall discuss the second of these conditions in Chapter 3.
Here we
shall try to
show
that the first condition
is satisfied:
economics provides many interesting problems which can be solved only by correlation methods. that
Since the times of A. Cournot, political economy has emits research with ever greater frequency and daring. Cournot himself, being a mathematician, believed
ployed functions in
that economics, like mechanics, can freely use the concept
of a function without the necessity of concerning itself unduly with the exact form of the function. Every function can be accepted as given, on the assumption that such a function and, one way or another, can always be determined when the need arises. This point of view has been accepted by other economists such as Gossen, Pareto, actually exists in real
life
Marshall and Keynes, to
name
just
a few. This standpoint
Linear regression
54 is
if it
acceptable
remembered that the curves used
is
mathematical economics are, in fact, regression curves
looked upon from a
statistical
in
when
point of view because the
employed in economics are not ordinary, but random
variables
variables. This
is
number of applications
the reason for a great
We
of correlation analysis to economics. ourselves to
shall
this general justification of the
not confine
suitability
methods for economic research, but
correlation
shall
of
also
more important fields of their application. Before we do so, however, let us say a few words about
discuss the
the problems with which this science the elaboration of numerical research methods for
econometrics. deals
is
Among
use in economics. The statistical
first
methods. Here
is
among them what Tintner has
place
book "Econometrics": "Econometricians
occupied by
to say, in his
make
use of
hypotheses^ about
statistical
methods to
unknown
population. This procedure
test
also
is
certain
is
the
useful in the testing
and verification of economic laws" (see [57], p. 18). The above sentence does not define the subject of econometrics but pertinently indicates with what this science deals. And here is how the founders of the Econometric Society interpret its
scientific
tasks:
Econometric Society is an international association whose objective is the development of economic theory in conjunction with statistics
and mathematics
(see [61], p. 5).
In the majority of works on econometrics we find examples of the application of statistical methods to economics. Correlation analysis,
most
useful.
categories with
analyses
and
particularly regression theory are the
Mathematical economics
which
functional
When a dependence it may be presented
it
treats
the
economic
deals, as mathematical variables,
relationships
between
these
and
variables.
between a pair of variables is considered, as a curve. The curves with which mathe-
matical economics deals
may be
divided into three groups.
The application of regression
To
to
economic research
55
the first group belong the curves describing the relation-
ship between a pair of economic variables, e.g. the relationship
between demand and tion
and income,
price, costs
and production, consump-
etc.
The second group
of the curves describing the
consists
relationship between an economic variable and time. They are
And
called time curves.
finally, to the third
group belong the
curves describing the relationship between an economic variable and a technological variable. We deal with this type of relationship
when we study
raw materials
the effect of the quality of
on the cost of production or on the productivity of labour. The third group of curves we shall call techno-economic curves.
We
most important curves belonging want curves, personal income distribution demand curves and cost curves. Curves belonging to
shall discuss here the
to the first group:
curves,
the second group will be considered together in the section entitled "Time Curves". Curves belonging to the third group
be discussed together in the section called "TechnoEconomic Curves".
will also
2.2.
More
important applications of regression and correlation theory in economic research
2.2.1.
Want
In order to
curves
live,
man
has to
satisfy his various wants.
Speakbe divided into physical his wants man has to develop
ing most generally, these wants
and
spiritual wants.
appropriate kinds of ity
is
labour.
To
satisfy
activities.
Labour
may
One of
the forms of such activ-
creates use values,
i.e.
objects of nature with the ability to satisfy
In a society where the division of labour
is
provides the needs.
human
well developed,
Linear regression
56
people receive objects necessary to satisfy their needs on the
market in exchange for money. The more money they have, the better they can satisfy their needs. The desire to satisfy one's needs manifests itself externally as the pursuit of money.
The feeling of displeasure that a man sometimes experiences when his needs are not satisfied is the driving power inducing him to develop an activity leading to the satisfaction of his needs. The lack of such a feeling
is
tantamount to the lack of
wants.
Human wants
a
are, as
rule, unlimited.
However, the means,
that people have at their disposal for the satisfaction of their needs are limited. This results in a continuous conflict consisting in the necessity
wants
is to
be
of making a decision as
to
which of the
satisfied.
be the sum of financial resources which person Z has at his disposal to meet his needs during period T. Let Let
S
be various needs of this person, and Sl9 5"a ... the 63 amounts required to meet needs Ql9 Q%... Amounts Sl9 S2 ... we Qi,
shall call the prices
The following
of needs
inequality
is
Q ga 19
...
obviously satisfied:
s<2* i
and person Z can satisfy only some of must be made so that
where k denotes the number of needs
his needs.
The choice
satisfied.
In choosing those needs that are to be satisfied
amount S, person Z,
if
he behaves
rationally,
minimize the displeasure caused by the
from
must act so as
the to
inability to satisfy all
his needs.
Among
the
most important human wants are those which
are necessary to his existence. These wants
we
shall call basic
The application of regression
wants. If the
sum S
that person
to
economic research
Z has
at his disposal
57 is
small,
be spent entirely on his basic wants. Until they are satisfied person Z has no freedom in choosing the wants then
it
be
to
will
Let us denote by S' the
satisfied.
money
Z
necessary to keep
alive
minimum amount of
during period
ditions that are not detrimental to the existence
ment of
his body. In this case,
sum S
assuming that
T under
con-
and develop-
Z behaves
ratio-
be spent entirely on the satisfaction of his basic needs unless S is greater than S".
nally,
will
Suppose that disposal the
he
1,2,...)
moment
at a certain
sum S
S',
will receive
and that
amounts Sr
,
t
at
person
Z
moments
has at his
t
increasing with
+rT r.
(r
=
Then,
moment t +rT person Z will have at his disposal " the amount Sr = *S"+Sr". The sum Sr is the amount that Z has at moment tr after meeting his basic needs during period T. The sum S"' we shall call the sum of free decision. at the
9
Let us consider what behaviour should be expected from people
when
the
Two
posal increase. 1)
free decision that are at their dis-
types
The whole sum S" the same,
2)
sums of
of situations
will
develop here:
be earmarked to satisfy always
permanent group of needs.
As S" grows, new needs satisfied
may
will
be
satisfied.
The group of
needs will change and expand with the growth
of S".
We know
from experience
that the first type of situation
rarely occurs in practice. In conformity with the psychological
law1 the displeasure related to an unsatisfied need decreases as the need is satisfied. This means that as personal incomes
money from the sums of free decision at their disposal to satisfy new and different needs and do not always spend these sums for the satisfaction of the same needs. increase people spend
1
Called Gossen's First Law.
58
Linear regression
In accordance with experience as personal income increases the
it
should be assumed that
number of new and
differ-
ent needs that people can satisfy also increases. This means number of wants felt depends on the sum of free
that the
decision. This
On
may
be presented on a graph. we have the sum of free decisions
the horizontal axis
S" which
is
The vertical sums S which, on the average, are earmarked
at the disposal of a single person.
axis represents the
t
by the person who has at his disposal the sum of sions S", to satisfy need Q
free deci-
.
t
The
bisector of the angle between the coordinate axes
is
the locus of points whose ordinates represent the maximum amount of expenditure for the satisfaction of needs at a given
value of S".
GRAPH
1.
As S" increases, new needs or groups of needs are satisfied. The sums spent on the satisfaction of needs Q are increasing t
functions of S":
The functions are represented on the graph by curves. As the is satisfied, as S" increases, the sums S asymptotically
need
t
The application of regression
to
economic research
59
The on the measured vertical axis S that are earrepresents the maximum amounts of sum marked, on the average, by a single person for the satisfaction parallel to the horizontal axis.
approach straight lines distance between these
lines
i
of certain needs. This means that
if
satisfies
commodity Cj
Q1 then the total amount of this commodity that can be absorbed by the consumers, when the price of commodity
need
C^ is given and equals p l9 cannot exceed a certain constant value which depends upon the number of persons who possess
sums of free decision
sum of
If the
the point located sents the
faction
S"
= SQ
enabling them to satisfy need
*S"'
free decision
on the
5"
bisector
maximum amount
= 5 ",
Qv
then the ordinate of
and determined by S^', repreon the satis-
that can be spent
sum of
of needs by a person whose
(the ordinate in question is represented
free
decision
on the graph
by segment MR). The points of intersection of the ordinate with particular S curves determine the amounts of expenditures
the
on
incurred
sum of
the
free decision
average by the person possessing S^ for the satisfaction of his
S"
=
needs. Particular curves
S which we i
want satisfaction
shall call
curves, or want curves run one above the other, according to the priority of wants, i.e. according to their intensity. Wants that we find it difficult not to satisfy are represented by curves
located low on the graph. As S" increases new needs appear; they are represented by curves located higher up. The segment determined on the ordinate by two ad-
MR
jacent
curves S and $l+l i
incurred
reflects
whom S" = S ". Segment
MN
=
Ql+1
expenditure
by a person for
represents the average total
itures for the satisfaction of the
S"
the average
for the satisfaction of need
sum of expend-
needs of a person for
whom
S?.
Segment
NR= MR MN
represents the average
amount
Linear regression
60
saved during a certain period of time by a person for
Stt = o
whom
ritt .
The knowledge of curves S
t
is
of great practical importance.
These curves enable us to determine the expected amount of expenditures incurred by person Z for the satisfaction of particular needs depending upon the sums of free decisions at his disposal. They also enable us to determine the amount saved by person Z. This means that knowing the distribution
of personal income and curves S we know the character of demand for goods and services. %
social
From
the statistical point of view, curves
S are regression t
curves reflecting the relationship between the expenditures for the satisfaction of particular needs and the sums of free decisions.
The determination of
parameters is the task of statistics. These parameters are determined on the basis of statistical material which should be as complete as possible, collected
in
the course
regression
of making
line
statistical
observations.
analysis of the relationship between the amount of maintenance expenditures and the size of income was started
The
by Engel. Analysing family budgets, he noticed that
the perdecreases with an share maintenance expenditures centage of increase in income. This relationship is known in the literature
as the
Engel-Schwabe
Law
(see [61], p. 47).
GRAPH
Analysing the
2.
food
other
2,000
expenditures
1.000
rent
clothing
,
total family expenditures
(marks/annum)
The application of regression to economic research structure of family budgets, Engel derived
61
an equation for
the curves determining the stochastic relationship between the sum of maintenance expenditures and the size of income.
These curves are known as Engel curves. It is particularly worth noting that Engel curves may be approximated
an accuracy sufficient for practical purposes. Engel curves shown on Graph 2 were determined by Allen and Bowley [3] on the basis of German statistical
by
straight lines with
material for the years 1927-28.
As can linear
easily
be seen,
correlation.
we have
here a case of a very clear
The study of Engel curves has
shown
that they can be approximated by linear regression lines within their interval of validity 1 .
Engel curves are a special case of want satisfaction curves. the author for the
The term "want curves" was used by
time in study [30]. It is worth noting that since Engel's time a lot of attention has been devoted to the analysis of
first
family budgets.
On
the basis of information derived from
these budgets, research
is
being carried on concerning the
relationship between the expenditures for the satisfaction of particular wants and the size of income. It has turned out that the terminology connected with these studies is not uniform. H. T. Davis in study [11] uses the term Engel curves
for
want
curves.
of this term and [34]
calls
W. Winkler calls
in study [61] objects to the use
them Othmar Winkler's
them consumption
curves.
A
curves.
similar
Keynes
confusion
developed with regard to the term used in the literature to describe the situation when consumption growth is slower
than income growth. This economic phenomenon besides the term "Engel's Law" is also called a "declining propensity to consume". 1
The
all
appearing in
after Keynes,
we denote consump-
of validity of a regression line is a numerical interval measured values of the random variable of the sample the equation of regression as an argument
interval
containing
If,
Linear regression
62 tion expenditures
C
by
and income by
Y we
can express the
relationship between consumption and income in the form of the following equation:
The
derivative
-
(see
dY
p.
[34],
marginal propensity to consume.
On
work we read: "Our normal psychological law of the community increases or will increase stated... in
calls
the
same page of
this
Keynes
114)
the
that,
when
its
decreases,
or decrease but not so
fast,
a formally complete fashion...
the real income
consumption
can, therefore, be
AYW >ACW ..."
The above statement has been quoted here
to
show
that,
regardless of certain differences of a secondary nature, the
main idea expressed by Engel's Law and the "psychological law" of Keynes is the same. It is interesting to note that a similar idea is expressed by Gossen's First Law: "The intensity of a given need steadily decreases as it is satisfied until the level of saturation is reached" (see [26], p. 4). Without becoming involved in a criticism of the conclusions derived by various economists from the laws quoted above we can say that all three laws express the following important economic truth, binding both in a capitalist and in a socialist
economy:
individual wants
weaken as they are
satisfied.
To conclude our cations of
want
numerous
appli-
mention an
inter-
considerations concerning
satisfaction curves
we
shall
esting study by Wald [60] devoted to the problem of determining indifference surfaces. In his work Wald assumes that when three commodities x, y and z are considered the indif-
ference function
type
W
W(x y,z) 9
(x,y,z) == a QQ
2
+^zy +^yz+a^z can be found
if
2 9
is
a second order polynomial of the
+a olx+a02y+aQ3z+alIx2 +al2xy+alzxz+ where
all
the coefficients a
i>
tf
2
,
#33
the equations for the Engel curves are known.
The application of regression
The constant term
to
economic research
63
however, cannot be determined. This
00 ,
constant term
is not of great importance since it is necessary the indifference surface equation only in order to determine the equations of the indifference curves (isoquant
know
to
and these do not depend on aQ0
equations),
.
It follows that want curves, regardless of their names, are of fundamental importance to all contemporary mathematical economics. A number of important concepts and methods used in mathematical economics could certainly be used in
the political
economy of
socialism.
For
happen, how-
this to
expand the studies of family budgets since the material they provide enables us to learn about the relationship between the amount of expenditure for the satisever,
it
necessary to
is
faction of particular needs lines are 4.3.,
statistical "way
Example
2.2.2.
In
a
size
of income. Regression
of expressing
this relationship (see
1).
Income
the
and the
distribution curves
preceding
the
section
between
relationship
the
expenditures for the satisfaction of particular needs and the size of income has been described. This is undoubtedly
one of the most important and interesting economic relationships. Naturally it does not appear in isolation but is causally related to other dependencies
which form the whole, complex
economic system.
For the curves
it is
full
utilization
necessary to
of information supplied by want the distribution of income. By
know
the distribution of income of the population we shall understand a statistical relationship between the size of income and
the
number of people
in
a given income bracket, or the
ship between the size of income
ance.
These
formulations
and
are
the frequency
actually
another; the formal difference between
of
equivalent
them
relation-
its
appearto
one
consists in the
Linear regression
64 fact that in
one case we deal with frequencies and in the other
with relative frequencies.
Below
is an example of the distribution of income per taken from the Statistical Yearbook for 1956 ([65], person pp. 284-285). Table 1 shows the distribution of monthly
earnings for September 1955.
TABLE
1
EMPLOYMENT ACCORDING TO MONTHLY EARNINGS FOR SEPTEMBER 1955 The number of employed persons
in particular classes
of gross earnings
(in zlotys)
8
1
in absolute figures (thousands)
181-9
625-1
For each
1,017-8
979-2
1,516-3
565-9
200-2
76-4
57-4
wage group there is a corresponding number of employees whose earnings fall
particular
figure showing the into this class.
in
Table 2 also shows the distribution of monthly earnings September 1955. It differs from Table 1 only in that rela-
tive
A
frequencies have been substituted for frequencies. is a graphic presencondition that has to be ful-
graph called a frequency histogram
tation of the distribution. filled if
a graph
is
to be
One
drawn
is
that the class intervals of the
frequency distribution be equal. Since the distribution shown in Table 1 has unequal class intervals then in order to draw
a graph we have to calculate cumulative frequencies corresponding to particular class intervals of the frequency distribution:
The application of regression
to
economic research
65
TABLE 2
EMPLOYMENT ACCORDING TO MONTHLY EARNINGS FOR SEPTEMBER 1955 8
l
relative frequencies
3-5
12-0
19-5
3-5
15-5
35-0
29-0
18-8
10-8
3-8
1-5
1-1
97-4
98-9
100-0
cumulative frequencies
Below on Table
is
53-8
I
I
a graph showing cumulative frequencies based
2.
GRAPH
2
93-6
82-8
4
6
8
W
12
U
16
1.
18 20 22 24 26 28 30 sa/aries
hundred
zlotys
Linear regression
66
The cumulative frequency curve a broken rial
line.
This
is
is
shown on
the result of dividing
the graph as
statistical
If the classification of statistical
into classes.
mate-
observa-
and the preparation of frequency distributions could be avoided, the curve of cumulative frequencies would certainly be different. It would give a true picture of earnings tions
in the
month of September 1955 and would be
free
from
caused by the classification of statistical data. the income of the population as a continuous variable, Treating we can assume that the cumulative frequency curve would distortions
also be continuous. Its shape could be guessed in different
ways. The simplest
way would be to construct a polygon of and to smooth out by hand the broken this way. This is a really good and simple
relative frequency line
obtained in
method but
it
is
used rather reluctantly because:
involves a certain degree of arbitrariness in drawing a curve; 2) it does not provide an equation of the curve; 3) it is not conducive to probability reasoning. For these reasons analytic methods are preferred for 1) it
smoothing out curves. Although they require many cumbersome computations they are free from the last two of the above-mentioned drawbacks.
They
also
to determine the curve in the "best" way,
make i.e.
it
possible
with proper condition
consideration given to a maximum or minimum such as that the sum of the squared deviations of the variable
from the resultant curve be a minimum. A smoothed out curve of cumulative frequency enables us to guess the shape of the frequency distribution curve. Indeed, a cumulative frequency distribution curve is nothing
but an empirical
we can
distribution.
Hence, using the known formula:
guess the shape of the distribution curve with a fairly
The application of regression
good approximation.
On Graph
to
1
economic research
67
the distribution curve
is
presented as a broken line.
An
analysis of personal income, like an analysis of family
budgets, should be based on current research data providing statistical material that can be used to determine particular
want curves and the income Let us denote, by
V
distribution curve.
the inccme per person,
commodity A
W
and by
i
the
expressed in physconsumption per person of units, where i is the number of a commodity. In this case (K, W^) is a two-dimensional random variable. Let /== i
ical
(v 9
w
l9
t)
denote the empirical distribution of this variable. This is a function of time. Let tl9
distribution changes in time, / 2 ,... tr ...
be consecutive moments
in time
and
let tr
tr
x
= At.
At we can assume that sufficiently prices on the market are constant. If for every given moment tr supply S is greater than demand D and if the distribution is known for all /'s then we have enough information to decide how demand changes with price. In a planned economy this
For a
small
interval
{
would provide of
how
t
sufficient information for solving the
to fix the
problem volume of production to ensure market
equilibrium. Unfortunately in practice this is not possible. There are too many goods and services to make it feasible to carry on statistical research on each of them in order to
determine the function f(v wi9 t). The processing of statistical material pertaining to income and expenditures per person and a comprehensive analysis of this material require a lot 9
We can learn about only those relationand ships dependencies which are of greatest importance to the national economy. The knowledge of the shape of the income distribution curve and of the want satisfaction curves of time and work.
is
of special importance in a planned economy. Let us discuss matter in greater detail. Let us assume that we know
this
the shape of the income distribution curve and the shape of the want satisfaction curve for A (for example, A could
Linear regression
68
be sugar). In
this case for
each
size
of income there are two
corresponding figures: the number of persons in this income bracket or the relative frequency with which this income appears, and the average degree of satisfying need A, i.e. the average consumption of sugar by persons in this income bracket. On Graph 2 the relationship between the size of
income and the average consumption of sugar per person presented. Variable V is the argument and stands for the
is
size
of income; variable
W denotes the
of sugar per person. The dependence of on the graph by the regression curve \v
GRAPH
About a dozen
on
W
=
2.
points are distinctly
are determined
average consumption on V is described
marked on
the curve.
They
From
these
the basis of statistical data.
points perpendicular segments are drawn to the plane VOW. These segments represent the relative frequencies of the occur-
rence of particular magnitudes of income per person. Graph 2 is a three-dimensional graph. The relative frequencies shown on it
are denoted by /(v) and measured along the vertical axis. It
can be clearly seen from Graph 2 that
if
we know
the
shape of want satisfaction curve A and the shape of the income distribution curve we can easily determine both the
The application of regression
to
economic research
69
average consumption per person of commodity A corresponding to a particular income group, and the total consumption
of this commodity in particular groups. This enables the authorities
and
to
see
to
control
wage
it
by using an appropriate
apparatus
that
particular
price
needs
of
the population are satisfied to a sufficiently high degree, with special consideration given to the protection of the interests of those in the lowest income groups.
we assume
that people in different income groups do not from each other with respect to the intensity of desire to satisfy want A i.e. that people whose income is v and consumption is wk would consume n^u if their incomes increased to vfc+1 then it follows from Graph 2 that knowing the income distribution curve and the want satisfaction curve we can predict by how much the total consumption of comIf
differ
9
modity
A
will increase
/c
when
the earnings of population group
Denoting by A the total increase in the of the total population consumption commodity A> and by
k grow from
vk to vfc+1 .
N
we
get the following equality:
Graph 2). The knowledge of
(see
the want satisfaction curve and of the income distribution curve allows us to predict how the demand for a given in income.
commodity will change in consequence of changes The latter are not the only cause of changes in
demand. Besides income, price essentially affects the size of demand. In our considerations so far we have assumed that
was permissible since our analysis was limited to a sufficiently short period of time At. When research on income, consumption, prices, demand and supply is carried on continuously we can disregard those periods of time At during which a change in prices occurs, similarly prices are constant. This
as
we proceed
in analysing a function with a finite
number
Linear regression
70
of points of discontinuity. If economic research is conducted continuously then in every period At, the incomes, average consumption and prices are known as is the demand. This is enough for the purpose of the current management of the economy. However, it is not enough for planning, or for the scientific forecasting of the course of the economic
processes in the future.
It is
well
known
that for
an economy
required that at a given price of be its A commodity equal to the demand for it. Supsupply pose that during a certain period of time the following situation prevails on the market: the supply of commodity A is to be in equilibrium
small, the
demand
for
it
it
is
large
and the
price high. This price
considerably exceeds the social cost of producing commodity A. This means that the production of this commodity is very profitable. If the
economy is based on the principle of profitability bound to provide a stimulus to increasing
this situation is
An increase in production cannot occur instanbut taneously requires a certain period of time. Thus the need for accurate scientific forecasting stems from the fact production.
that the adjustment of supply to directly, as
was the case
demand does not
take place
economy, but through
in a primative
As long as the market exists, so long will scientific be needed, regardless whether the market is in a forecasting capitalist or a socialist economy. the market.
If a forecast of changes in
demand
is
to be accurate
we
must know the relationship between demand and price as well as the relationship between demand and size of income. The knowledge of the relationship between demand and price allows us to answer the question:
what
demand be
will the
at a given price
and what
will the
probable
probable price
be at a given demand? An accurate answer to this question of great importance both in planning production and in
is
To provide a correct answer it is necessary to demand curve. We shall now discuss these curves.
fixing prices.
know
the
The application of regression
to
economic research
71
2.2.3. Demand curves As we know, there is a distinct interdependence between price and demand in a free market economy; when the price demand drops, when the price declines rises demand
increases. In this case
where
P
amount
D = V(P),
stands for the price of a
(1)
commodity and
D
for the
that can be sold at this price. Naturally, a functional
description of complex relationships that exist between particular
phenomena, or economic processes
is
always to some
extent a scientific abstraction. In fact, the relationship between
demand and
price
nature. This
means
is
not of a functional, but of a stochastic
that
it is
possible to express this relation-
ship in mathematical language only
by
statistical
methods.
Functions that are used in mathematical economics are a tool
of learning only when they can be statistically verified. All about these functions are actually only scientific hypotheses until their correctness is checked by
theoretical utterances
statistical
methods. They become laws only after
statistical
verification.
We
have made these comments because in
on economics no
justification
many
textbooks
given for representing the
is
demand and price by a concave and monotonically downward sloping curve (see Graph 1). relationship between
GRAPH
J.
demand
Linear regression
72
can be said a priori about the shape of the demand curve. All that we know is that it should fall as the price increases. The shape of the demand curve can be deter-
Very
little
mined only by
statistical
methods. The demand curve, from
D
on P. point of view, is the regression line of This means that in order to express the relationship between a
statistical
price
and demand
write
down
in a functional language it is not enough to a function inverse to function (1), but that we
have to find the other regression
line, i.e.
the regression line
ofPonD. In a socialist economy in which the monetary commodity
exchange system prevails, an analysis of demand not only does not lose its importance, but acquires new significance which is essentially different from the significance it has in
economy. The main purpose of analysing demand in a socialist economy is to learn about the needs of the society and to adapt the production apparatus to the best possible a
capitalist
of these needs. the average consumption per person Let us denote by
satisfaction
W
of commodity A, by V income per person, and by of commodity A. Between the random variable
random
variables
V
P
and
A certain defined value w (v,/?).
Suppose that as a
P
the price
W
and the
a stochastic dependence. corresponds to each pair of numbers there
result
of
is
statistical analysis
obtained numerical material on the basis of which
we have we have
constructed a three-dimensional model of relationships bet-
ween random (see
Graph
2).
variable
Along
W
V
and random variables
and
P
the K-axis the centres of class intervals
of the income distribution
series
are measured,
and along
the P-axis the prices that have been observed during study
1 .
Along the W-axis the average consumption per person of commodity A is measured. The segments perpendicular to 1
Segments on the
K-axis are
representing price intervals.
not
in
the
same
units
as
those
The application of regression
to
economic research
73
VOP
drawn from the middle of each square represent the volume of the average consumption per person of commodity A for various incomes and prices. The model would plane
not be complete if it did not take into account the distribution of the random variable (VPW). Since in this model the values of the distribution function constitute a fourth variable
we
cannot draw the required number of coordinate axes in a three-dimensional space. We have overcome this difficulty
by presenting on the graph the values of the empirical bution function
(i.e.
distri-
frequencies) as squares of different areas
located inside the squares of the
chessboard.
Let us note
and moving along the F-axis we find the relationship between the amount of the average consumption per person of commodity A and the size of that
the
by fixing
price
income. This relationship, expressed by a regression the want satisfaction curve of commodity A. By
line,
is
fixing
moving along the P-axis we find the between the relationship consumption per person of comA and the The modity price. regression curve describing this relationship is called the demand curve. If by D we denote the size of income and
the
demand
for
commodity A, then
D(v,p)=W(v,p).f(v,p).N,
(2)
where D(v,p) denotes the demand for A on the assumption commodity is p and the income per person
that the price of this is
v;
similarly
W(v,p) denotes the average consumption per
=
=
v and P person of commodity A if V p, and if /(v,/?) denotes the frequency of random variable (V,P) at the point
(V
=
v,
P = p).
N
stands for the total
population. the three-
The above graph presents the distribution of the dimensional random variable (VPW). Suppose distribution
we
also
A on
is
know
both the
known. In
this case, other things
the dependence of the size
demand
that this
being equal,
for
commodity
of income per person and the price of this
74
commodity. Thus we have enough information to be able to adjust the supply to the demand for commodity A, i.e. to satisfy the economic equilibrium requirements.
The knowledge of
the distribution of the
random
variable
many important economic problems. consider one of them. Suppose that the production and import potential does not permit us to saturate the market
(VPW) Let
allows us to solve
iis
commodity A to a degree sufficiently high to meet the demand for it at the constant price P = p This means that,
for
.
since the
economic equilibrium requirements have been im-
p fixed by the governequilibrium price p l (or rather a black market price) will appear. This price will shift on to the shoulders of the society the burden of maintaining speculators and smugglers and will constitute a temptation for dishonest paired, in addition to the rigid price
ment, a
new market
employees of the socialized trade apparatus to hoard the scarce commodity. To prevent such a development, extremely harm-
The application of regression
to
economic research
from an economic point of view, the State price of commodity A by trying to fix the price ful
distribution of
income per person
regulates the at a given
such a
in
75
way
ensure the sales of the stock of commodity A, and at the
as
to
same
revenues a maximum. To regulate prices without social and economic results it is necessary undesirable causing
time
make
to decide: 1)
how
2) to
to determine the equilibrium price of
what extent the demand for
person. that if at price
supply, the equilibrium price
The higher pricey
will cause
P=p
will
will
occur because
be forced to
satisfy
the
demand exceeds
the
p: will be higher than price p Q a drop in demand, thus adjusting
the latter to the available supply.
demand
commodity A; commodity wilt be
population groups of different incomes per
satisfied in
We know
this
.
Of
course, the drop in
less well-to-do
population groups a smaller amount of their wants in
consequence of the high price (too high a price of commodEach forced renunciation of the satisfaction of social
ity A).
wants
is
an undesirable development from an economic point it may sometimes be tolerated as a neces-
of view. However, sary evil if
its
social
and economic consequences are not
dangerous. They may be dangerous when basic needs are not satisfied. They should not be dangerous, however, when the restrictions
goods. Naturally,
if
of the needs for luxury
the interests of poorer people are to be
know the process of satisfying wants population groups of different income brackets, their
protected, in
the satisfaction
affect
we have
to
respective purchasing
powers and the market demand for
commodity A.
s
Answers to these questions are facilitated by Graph 3. In this graph we can see what the distribution of the amount of commodity A would be at the price P = pf The areas .
of the bases of the parallelepipeds shown on the graph are
<
Linear regression
76
GRAPH
The
3.
W(v,p) of these parallelepipeds represent the average consumptions of commodity A, or, what amounts to the same thing, the average individual satisfactions of
f(v,p).
heights
need A. Hence the volume of a parallelepiped, equal to the product W(v pj). f(v pj) multiplied by the population number t
N, represents the
t
t
demand
for
commodity
A
that exists in the
=
=
when price P pj \\, population group whose income is V (see formula 2). The total demand for commodity A equals the sum of the volumes of particular parallelepipeds multiby N. This means
plied
(3)
ffyj,,).
It
follows that, other things being equal,
we can answer
the questions concerning the size of the demand for comp, the price of this commodity modity A when price P
=
when
bution
S
of this commodity equals s, and the distriof the amount s of commodity A among the popula-
the supply
tion groups
whose incomes are
vls v 2 ,
...
If
we know
the distri-
The application of regression
bution of variable
(VPW) we can
to
economic research
solve the equations of the
three regression surfaces. These equations are presented in a general
77
below
form
= Zi(P,w), = ft(v,w), w = g3(v,P)v
(4) (5)
(6)
These formulae express in a functional way the relationeconomic quantities: the average con-
ships between three
sumption per person of commodity A, the price of commodaverage income per person. The determinaity A, and the tion of equations (4), (5)
and
very difficult in practice. This is due to the complex nature of economic phenomena which are interrelated by strong causal relations. In the course (6) is
of observing economic phenomena we try to consider them isolation. This approach always tends to diminish the
in
accuracy of the results of observations and economic analysis because of the strength of causal relations existing be-
tween economic phenomena. In order to improve the accuracy, we must consider in the process of learning ever new relationships between the into focus one
phenomena, trying to bring them
by one, according
to the diminishing strength
of their influence. If
w
in
=g
equation
3 (v,/?)
(2)
we
substitute
the
expected
value
for W(v,p) then the equation d(v,p)
= g*(v,p).f(v,p).N
(7)
expected size of the demand for commodity A on the assumption that income V = v and price P = p. Formula (7) expresses explicitly the dependence of demand on income and price. These two economic factors undoubtedly exert the greatest influence on the size of demand. However, they are not the only factors. If income and the price of commodity A are determined, the demand for it may change will represent the
Linear regression
78
depending on changes in the prices of the complementary goods of, and substitutes for, commodity A. Since the determination of impossible
the relationships of this type
all
we
either take into account only the
is
practically
most impor-
tant relationships or content ourselves with an analysis of the relationship
Equation
demand, income and price. shows the dependence of the price of commodity
between (5)
A
on the average consumption per person of this commodity and on income. In a free market economy the supply of commodity A depends on price. Let us introduce new variables U,Y Z, where U denotes 9
sales revenues,
Y
costs
and
Z
profit. In this case
U=X.P
(8)
U=Y+Z
(9)
and
(symbol
X
At given
denotes the volume of production). P p Y y and Z z the supply
values
=
=
9
=
P In an is
profit
economy governed by an economic
the principle of profitability an inducement to
factor providing
increase production. Of course, under these circumstances production should be increased until profit reaches its maxi-
mum To
value.
how maximum
decide
obtain
high the volume of production should be to profit we have to know the relationship be-
tween costs and production. The knowledge of this relationship a condition for a rational management of production. It
is
follows from the equation i
_
(ID
The application of regression that at given values s
to
and z the
economic research
price of
79
commodity
A
de-
when the cost of its production decreases. Since a drop in the price of commodity A results in an increase
creases
in the
demand
for
it
the lowering
leads to a better satisfaction
of the cost of production
of wants.
22.4. Cost curves
As we know from mechanics, the ratio of the amount of energy received from a given machine to the amount of energy supplied to it is called the coefficient of efficiency and is denoted by the symbol
where
E
1
r\.
Thus
stands for the energy produced and
E
for the energy
shows the curve for coefficient
rj
as a function
2
used.
Graph
1
of variable
Ev
This variable represents the output of energy
produced by the machine studied.
GRAPH
It -follows
from
1.
the characteristic shape of curve
77
that
if
the productive capacity of the machine is not fully utilized the coefficient of efficiency 77 is low. As production increases the efficiency of the machine rises and eventually reaches
its
Linear regression
80
maximum
point.
The
abscissa of this
optimum size of production E = tion the efficiency of the
tion increases
J
At
.
lo
machine
point represents the this level
of produc-
the highest. If produc-
is
beyond the optimum value E^ the efficiency it is overloaded and conse-
of the machine decreases since
quently operates under unfavourable conditions. It
from the shape of curve
follows
r\
E
senting the dependence of variable
2
that the curve repre-
on variable
E
l
must
be a monotonically increasing curve. This curve
then a
rises rapidly at first,
and then rapidly again
(see
Graph
GRAPH
If instead of a
single
more
little
slowly
2).
2.
machine
(a boiler, engine or gener-
X
will we consider a whole enterprise then production be an equivalent of variable E19 and cost Y will be an equivalent of variable E2 Hence we can postulate the hypothesis
ator)
.
that the shape of the total cost curve
is similar to the shape of the curve shown on Graph 2. In other words, this curve may be considered as a hypothetical total cost curve. We have
used the term "hypothetical curve" since in reality both variable
X
ables.
(production) and variable
The
joint
distribution
Y
(cost) are
random
of variable (X, Y)
is
vari-
different
for each enterprise. In order to learn about the shape of the regression line describing the relationship between cost and
production
we have
to
carry
out
appropriate
statistical
The application of regression
GRAPH
to economic research
81
3.
production
research in the enterprise which interests us. As we shall see later, the regression line is usually a straight line.
On Graph 3 the total cost curve is shown. On this graph we can also see a portion of a straight line which does not differ much from the curve between the points marked by two vertical dashes.
We
shall
denote the total cost by the symbol Y.
On Graph 4
the hypothetical shape of curve Y is shown. It appears from the graph that costs are incurred even when the production
equals zero. The amount of this cost is represented on the graph by the segment determined on the positive part of the Y-axis by the point of intersection of curve Y with this axis.
GRAPH
GRAPH
4.
production
(According to Paulsen c
[43]).
5.
Linear regression
82
These
costs
known
are
in
under
literature
different
names, e.g. fixed costs, or independent costs. The fixed cost is denoted on Graph 4 by the symbol y^. If we deduct the
from the
fixed cost
total cost
and
we
if
divide the difference
obtained by the volume of production we get the variable cost or dependent cost. The variable cost is denoted by the
symbol
7,.
Thus
y *
y.
~~
X
In a geometrical representation the variable cost tangent of angle a (see Graph 4).
By
is
dividing the total cost by the^volume of production
get the average, or unit cost which in
5
Graph
is
the
we
denoted by
the symbol Y. In this case
y ~~x' The average
cost
If the total cost
and
if at
function
is
y is
equal to the tangent of angle f) (Graph 4). a continuous function of the production
every point of a certain interval within which this determined, the derivative of this function is
is
-\rt
then y'
is
(Y\
Marginal cost is equal line tangent to the the between y
called the marginal cost.
to the tangent of angle
curve and the horizontal axis (Graph 4). Variable cost, average cost and marginal cost are tions
of production.
representing
these
On Graph
functions.
5
The
we can cost
all
func-
see three curves
curves
shown on
Graphs 4 and 5 are not only graphic presentations of the interdependence between total cost on the one hand, and marginal, average and variable costs on the other; they are also a valuable tool of research. On the basis of these curves
we can make
several important observations.
Assuming that
The application of regression
to
economic research
83
the hypothesis concerning the shape of the total cost curve is true, i.e.
1)
that:
the curve
located in the 1st quadrant of the coordinate
is
system; 2) the ordinate of the curve at the point
equals zero 3) the
curve
is
is
continuous" within the whole interval of
and has a derivative
validity
where the abscissa
not negative; its
at each point of this in-
terval;
one point of inflexion separating the confrom the vex concave part;
4) the curve has
the correctness of these observations follows directly
graph.
We
from the
can prove their validity in a formal way.
L The
Observation
minimurnjnarginal cost
minimum variable cost minimum average cost: the
min Y' Observation
2.
by the point of
which, in turn,
is
lower than
lower than the
is
< min Y < min Y. z
The minimum
variable cost
is
determined
intersection of the marginal cost curve with
B on Graph 5). The minimum average cost
the variable cost curve (point
Observation
3.
by the point of
is
determined
intersection of the marginal cost curve with
the average cost curve (point
C
on Graph
5).
Cost curves allow us to find correct solutions to the prob-
lem of the "optimum production
Assuming
that an
size".
based on the principle of
economy optimum size of production means the size profit is a maximum. Some economists conis
profitability the at
which
total
optimum size of production can be determined with the help of the unit cost curve. They maintain that the optimum level of production is one at which the average cost sider that the
is
a
minimum and
thus presumably profit
is
a maximum. In
Linear regression
84 spite
We
of
can
apparent correctness
its
easily
statement
this
is
not true.
prove the following:
Total profit reaches its maximum value when marginal revenue equals marginal cost, or, which amounts to
THEOREM
the
same
1.
thing,
Proof. Let
Y
cost
and
when marginal
cost equals price.
U denote revenue, X production, P Z profit. In this case
Z= UIf profit
Z
is
to
be a
maximum
dZ
dU
~dX
~dX
Y.
it is
dY
But
Hence
__
~~dX~~
i.e.
V=
necessary that
Y'.
price,
The application of regression
to
economic research
85
Graph 6 is a geometric representation of Theorem 1. It follows from the graph that profit attains its maximum value at A"= xl9 and not at X = x This means that if we determine the volume of production .
in such a
way
as to minimize unit cost, the profit
the production were
would be if timum production
it
size is
X= x
lf
The
is less
than
correct op-
obtained by using marginal cost and
not average cost. Both Theorem
is
1 and its proof have been economics since the days of Cournot. This theorem valid in a socialist economy if the principle of profitability
is
observed. So far this theorem has not been applied con-
known
in
(it is, perhaps, applied to planning production in such as to maximize profit at a given price, but this is done
sciously
a
way
somewhat unintentionally, more on the basis of experience and intuition, than of theory). It seems that the chief reason for this attitude
is
the reluctance of economists to use mathe-
matical and statistical methods of research in their professional work.
indoubtedly one of the most important problems facing the economist. The lower the
Cost analysis
and
difficult
is
social cost of production the higher is the social productivity
of labour and the better the satisfaction of the needs of the society. In
economic
literature the subject
of costs occupies
the most prominent position. The determination of the equation of a regression line which is an approximation to the total cost curve tical
is
a typical econometric problem. The prac-
aspect of this problem
dealing with
The human
is
sufficiently
important to justify
in greater detail.
it
wants induces people to prothem. The production of these goods requires sacrifices on the part of the society; it requires labour power, materials, power and all those factors of desire to satisfy
duce such goods as
will satisfy
production without which the production process would be impossible.
The
society
is
willing to
make
these sacrifices only
Linear regression
86
impossible to produce and thus to satisfy human needs. It follows that production is the only economic and social justification of the cost of production.
because without them
It
is
it
follows from economic considerations that there should
be an interdependence between costs and production. In mathematical economics it is assumed that cost is a function of production.
The
a geometrical represenrepresentation of the relation-
total cost curve is
tation of this function.
The
ship between cost and production as a mathematical function of course, a scientific abstraction. In real life neither cost
is,
nor production is
is
a variable in the general sense, but each
a random variable.
We
can learn about the interdepend-
ence between cost and production only
by statistical research. The procedure leading to the knowledge of this interdependence has to follow a certain sequence. Each manufacturing enterprise keeps a record of cost and production. This record tistical
provides
usually monthly
periodically
sta-
data pertaining to the size of production and cost. If
we denote production by
X and
quantities as the realization (x
t,
cost by y ) (i= t
Y we
can
1,2,..., ri)
treat these
of the two-
dimensional random variable (X,Y). A point on a plane correcollection of such sponds to each pair of numbers (x >>,.).
A
t,
points
may
be regarded as a sample selected from an infinite it is true that there is an interdepend-
general population. If
ence between cost and production and if production and records are properly kept, the distribution of points
cost
on the
scatter
diagram
will
show a
trend.
The
regression line,
being a statistical representation of this trend, is a functional way of expressing the interdependence between cost and production. It is especially
worth noting
that, as
numerous
studies
have
shown, the regression line describing the relationship between cost and production is usually a straight line. Let us quote a few opinions on this subject. Falewicz (page 61 of
The application of regression
book quoted
his
cost
economic research
87
here) says: "In spite of the fact that theo-
line best representing the relationship
the
retically
to
and production
if it
could be established for
between all
possi-
to the highest that the capacity ble sizes of production from of the enterprise permits would be a curve of an equation
probably not
of a high
when
it is
less
than 3rd
degree, in practice,
possible to study this relationship only within cer-
tain limits of production size,
we can assume, with a
ciently high degree of accuracy that
it
suffi-
can be represented by
a straight line".
And
here
is
lished that in
what Tinbergen has to say:
many
cost with respect to the
be represented by a pressed
by Dean
"It has
been estab-
industries the shape of the curve of total
volume of production can generally
straight line". Similar opinions are ex-
[12],
Lyle
[37]
and many other
statisticians
who have
studied the interdependence between cost and production. Very characteristic and to the point is a comment by Tintner. On page 49 of his book [57] we read: "It is remark-
able that (in the relevant interval covered total cost of
by the data) the
seems to be a linear function of the
making amount of product. Hence the marginal cost is constant. The importance of this fact of constant short-run marginal steel,
cost discovered by all investigators
contradicts the a priori mists/'
tions
We in
of
statistical
assumptions
cost func-
of the econo-
have quoted the above opinions in order to show that
practice
the
regression line
between cost and production
describing
the
relationship
a straight line. This is of great since the determination of the linear regression importance parameters is relatively easy and, therefore, a statistical analysis of the relationship between cost and production could is
and should be made widely known.
Linear regression
88
2.25. Time curves
Time
series constitute
a wide
field for applications
of
re-
known that a trend is one of the gression of a time series. The notion of trend is intercharacteristics preted in the literature in a variety of ways. Below we describe theory. It is well
two generally accepted interpretations. Interpretation 7. The following time
where series
t
is
series is given
assumes integer values. An illustration of a time provided by the corn crop yields in the USSR in
1922-1934.
TABLE
1
CROP YIELDS IN THE USSR
(see [41], p. 171).
The broken
statistical
line
on
data of Table
this
graph
is
1
are
shown on Graph
called a time curve. It
1.
The
shows van-
The application of regression
to
economic research
89
ous irregular breaks which are a result of random factors. These breaks in the time curve not only do not help in the process of learning, but on the contrary, make it difficult to detect the influence of the regular factor which causes
crop yields per hectare to show a tendency to increase.
GRAPH
1922
1924
1926
1.
1928 1930 time
1932
1934
1936
According to the first interpretation a trend is a line expressing a general tendency in the shape of a time curve. A trend determined by the elimination of random oscillations from the time curve. The parameters of the trend line are obtained by appropriate statistical methods, e.g. the moving line is
average method, or the method of least squares. This interpretation of trend is accepted by O. Lange in his textbook
on
which he writes: "Analysing time series we notice that they show a certain development tendency" (p. 181). And further on: "Table 46 gives yields in metric statistics [35] in
quintals per hectare in the
USSR
in
1922-1934; these data
shown on Graph 46 1 A development tendency can be clearly seen. The yield per hectare fluctuates from year to year but on the whole there is, undoubtedly, an increase in are
yield...
.
A
development tendency
may
be emphasized by a
procedure called the smoothing out of a time series" (p. 182). In this interpretation a trend line is a regression II line; by a visual inspection of the time series we select a family 1
See Table
1,
Graph
I.
90
Linear regression
of approximation functions and determine the parameters of one of the functions belonging to this family. The curve of this function
is
line. The role of the trend line in smooth out the time curve. The value
the trend
this interpretation is to
of the trend line as a tool of learning
is
much
smaller than
an analysis of the relationship between two random variables X and Y when both are independent from time, the regression I line assigns conditional expected values of one variable to any values of the value of the regression I line. In
the other variable.
When we
deal with a time series there are
no conditional expected values involved. Trend
is
a smoothed out time curve. In the
first
interpre-
tation, trend can be used only to describe a time series, but
cannot constitute the basis for a prediction concerning
it
the development of a stochastic process in the future. the
most
careful extrapolation
Interpretation II.
Even
not permissible.
is
The values of a time
series are
a realization
of a certain stochastic process. This process may be subject to some law which can be described by an appropriate mathematical function. The nature of this law, and consequently
known. The parameters of the known. They can be estimated by
the shape of the function, are function, however, are not statistical
methods on the
contained in the time
basis
of the
statistical
material
In this interpretation the trend line is a geometric presentation of the function which is known, a priori, to be a mathematical expression of a law governing series.
the stochastic process under consideration.
This interpretation often leads one astray and results in
mathematical formalism. The law governing a stochastic process is rarely known a priori1 Hence a temptation to proceed .
in the following 1
In economic
tistical
way: on the
basis of the visual inspection of
applications this law
quality control of production.
is
sometimes known
in the sta-
The application of regression
the shape of a tim? curve
we
to
select
economic research
91
an appropriate approxi-
mation function and then argue "theoretically" that this curve expresses, in fact, a "law" governing the realization of a stochastic process. Such a temptation is particularly strong when a tim? curve shows only minor fluctuations, like the curve shown on Graph 2. It would seem that this kind of procedure
too obviously against
is
common
sense to be
however. O. Lange, in his book mentioned two above, quotes examples of an improper application of a logistic curve to smoothing out a time series. In both cases
used. This
is
not
so,
a logistic curve was used not only because
"fitted" well
it
to the statistical material but primarily because
it
allegedly
expressed the "law of growth" which can be presented in a
mathematical form as a differential equation: -
= x(a-x)
(0
g(t),
<x<
a) t
dt
where a
The
is
a constant called the "level of saturation".
application of this equation to the analysis of the trend
line is deprived
of
all
economic
justification.
Both cases
cited
by Lange are examples of mathematical formalism; he emphasizes this fact
very strongly.
The equation of a ential
logistic
curve
is
the integral of the differ-
equation mentioned above and presumably express-
ing the "law of growth".
One of
the examples given in Lange's
book comes from
"Theory of Econometrics" by H. T. Davis, who also wrote "The Analysis of Economic Time Series". The latter book
prompted M. G. Kendall
to express the following short but
pointed opinion: "Davis's book on The Analysis of EconoSeries' (1941) contains a great deal of interesting
mic Time
material but should not be read uncritically" (see
[33], p. 437).
Linear regression
92
GRAPH
I
1790
I
I
I
I
i
1810
I
2.
i
I
I
i
1950
(see [64]).
It
might be worth while to mention here a very pertinent the subject of formalism by A. Hald:
comment on "The
logistic
curve has been frequently used to illustrate
human populations, (cells, the development of business transactions between different countries, education of persons in
the
growth
of
'populations'
telephone subscribers,
etc.),
various manual and mental accomplishments, etc. Regarding most of these applications it may be said that the theoretical
growth process in hand is so uncertain that doubtful whether or not the process is governed by a
analysis of the it
is
The application of regression
to
economic research
93
such as (20.7.2) 1 , wherefore application of curve is mainly based on its descriptive proper-
differential equation
the logistic
The
ties.
of the extrapolations regarding population which have been carried out on this
results
figures, production, etc.,
should therefore be regarded with great scepticism"
basis 2
[28]
.
J.
M. Keynes
expresses his opinion
on the matter very
frankly:
"Too
a
large
proportion of recent 'mathematical' eco-
nomics are mere concoctions, as imprecise as the initial assumptions they rest on, which allow the author to lose sight of the complexities and interdependencies of the real world maze of pretentious and unhelpful symbols" ([34], p. 298).
in a
would appear from the above quotations that if there are economists who use mathematics improperly, there also are
It
those
who
The more
is
p. 88, that
the
their mistakes.
criticize
correct of the
scribed above
on
and
see
first. It
two interpretations of trend defollows from the definition given
from a formal point of view the trend
can be considered as a regression II line. All those the first interpretation of the trend line agree on ever, there is
one doubt. The regression
g(x) for which
E[Y (see p. 36).
formula
With
(1) will
i.e.
prefer
this.
How-
a function
= minimum
respect to the time series
(1)
xi9
x%,
...,
x
t
...
assume the following form:
E[X 1
2
g(x)]
II line is
line
who
2
g(t)]
= minimum,
(2)
by equation: dx cit
(a-x)g(t),
(0<x
x(a-x)g(t), 2
A. Hald:
York, 1952,
Statistical
p. 661.
Theory with Engineering Applications,
New
Linear regression
94
symbol of the trend equation. Formula (2) that the trend is a function of time g(t) for which the
where
g(t) is the
states
mathematical expectation of squared deviations of the values of the time series from the values of this function at moments /
1,2,...
is
a minimum.
The doubt mentioned above
is
due to
not really known exactly how to interpret the notion of mathematical expectation with respect the random variable dependent upon time. With reference to a ranthe fact that
dom
it is
variable which
expectation
parameter
random
is is
is independent of time, mathematical a distribution parameter of this variable. This a number. The situation is different when the
on time. In this case its distribution on time and also the parameters of this consequently depends distribution depend on time, are functions of time. Naturally variable depends
covered by the definition of the mathematical expectation of a random variable independent of time. This means that if the trend is understood as a regression II line this case is not
of a random variable correlated to time then there are certain aspects that require explanation. It should be stressed that the
problem of the definition of a trend is, so far, an open problem in the literature on statistics. For instance, Hald [28] mentions a textbook by Kendall studies
[33] in the
bibliography of
related to time series. Indeed, this textbook
can be
considered as the most important one as far as time series are concerned because of the amount of space and attention given to this subject. However, even Kendall does not give a definition of a trend that is free from the reservations men-
tioned above (see [33], p. 371). This fact has prompted the author to attempt to formulate such a definition. It is given in 6.1.
2.2.6.
Techno-economic curves
Correlation analysis finds
branches of technology where
many it is
applications in different
often necessary to discover
The application of regression relationships
between
various
to
economic research
random
variables.
95
Among
such relationships are: the relationship between the amount of gas or liquid sucked in by a suction pump and the degree of vacuum created in the pump, expressed in percentages; the relationship between the durability of alloys used
for
heat resistors and the temperature in which they operate; the relationship between the hardness of steel used for tool
manufacturing and temperature or carbon content,
There
is
no need
etc.
to give further examples of the applica-
tions of correlation analysis to technological research.
They numerous and increase almost every day. It should be emphasized that studies on relationships in the field of are
technique very often have an important economic aspect in addition to a technical aspect. For instance, if some changes have been introduced in a technological process (i.e. in the process of manufacturing scarce goods) in consequence of
and usually do, From an economic
technological research, these changes may,
have economic as well as technical point of view these effects tive.
The
criterion for this
effects.
be positive, neutral, or negatype of classification is found in
may
production costs. Let us discuss the already-mentioned relationship between the durability of heat resistant alloys and the temperature in
which they arc used. This type of research is conducted in connection with a search for the most durable alloys. In elecengineering various alloys are used: constantan, manganin, nickeline, nichrome, chromel, alumel, kanthal and trical
others.
Each of these
alloys has a different durability, differ-
ent resistance to high temperatures, changes in -the frequency
of heating, cooling, etc. Depending on the technical requirements, one or another type of alloy is used. In making a choice economic consequences have to be taken into considSome alloys can be produced at home and others
eration.
have to be imported; for some of them the raw materials
Linear regression
96 are available at
home;
duction are different for
more
are usually
some they are not. Costs of proeach alloy. The more durable alloys
for
expensive.
It
follows that studies of the
relationship between the durability of the alloy
and tempera-
ture are of interest not only to the technician but also to the
economist. All examples of statistical relationships which have both technological
nomic
and economic aspects we shall call techno-ecoThe curves which are a graphic represen-
relationships.
tation
of these relationships
curves.
An
we
shall
techno-economic
call
interesting example of techno-economic curves is on Graph 1, taken from a study by Vernon
presented L. Smith (see
1
[53])
.
GRAPH
10
20
30
1.
40
50
60
weight of loaded cars (thousand pounds)
The
shown on this graph describe the relabetween the tionship consumption of fuel and the weight of a car together with its load. Number R is a measure of the 1
regression lines
See also
3.2.2.,
Example
2.
The application of regression
to
economic research
slope of the road (the slope coefficient). It pressing the increase in height in feet per
is
97
a fraction ex-
100 feet of the
lenght of the road. It can be seen from the graph how the consumption of fuel has decreased in consequence of technical
improvements. The greater the value of R the smaller is the drop in the consumption of fuel. The relationship between fuel consumption and the weight of the car with a load is a technical clear
and
problem,
but
its
far reaching, that
economic consequences are so
no comments are
required.
The
regression curves shown on the graph can be used as a basis for setting fuel consumption norms. Much has been written
norms and the necessity of replacing them by "technical" norms. It can be seen from the above example that without the help of statistics (in this case correlation analysis) it would be difficult to set a technical norm.
in
literature
about
the
uselessness
of "statistical"
3.
ESTIMATING LINEAR REGRESSION PARAMETERS
3.1.
General remarks about methods of estimating
There are several methods of estimating the parameters of a general population on the basis of statistical data supplied by a random sample of the population. The most important are:
the
maximum
likelihood method,
the
minimum
minimum % z method, and the method of So far, only the method of least squares has
variance method, the least squares.
been used in regression theory because of tages. This method
its
many advan-
comprehend than the others since it requires only knowledge of how to find the maximum or minimum of a function by differential calculus, and it is not necessary to
is
easier to
know mathematical
least squares is very general. It
when
statistics.
may
provide solutions in cases
other methods have failed. For both these reasons the
method of
known by astronomers and
least squares is
veyors, physicists and biologists, technicians
Since a basic knowledge of calculus
method of least tists.
The method of
squares,
it is
as the
estimates obtained
and most
efficient.
the distribution of
it.
has a very valuable formal cases of linear regression. There is a theo-
least squares
quality, important in
rem known
necessary to learn the
used almost exclusively by scien-
Practical workers rarely use
The method of
is
sur-
and economists.
Markoff Theorem [8] which states that by this method are consistent, unbiased In this theorem
random
variables
these variables are independent.
98
not assumed that
it
is
is
normal, or even that
Estimating linear regression parameters
99
In spite of these unquestionable advantages of the method of least squares, there are two reasons why we here propose a new method of estimating regression parameters besides the
method of
we
shall call
least squares. Until
it
it is
the two-point method.
finally given a
However,
it
name
should be
only a temporary term. In cases of linear regression the two-point method also provides consistent and unbiased estimates of regression parameters, but
understood that
1)
this is
computations of the values of estimates are much easier than those required in the method of least squares, and two-point method
2) to learn the
know in the
the calculations for
method of
it
is
not necessary to
maxima and minima
required
least squares.
The efficiency of the estimates obtained by the two-point method is a little lower than the efficiency of the estimates obtained by the method of least squares, but when a sample is large this consideration does not carry great weight. The important advantages that are gained by the introduction of the two-point method in the theory of estimating linear regression parameters consist, first of all, in the fact that this
method
is
conducive to the popularization of regression and among practical workers. This is of partic-
correlation theory
ular importance in
economic research. In Chapter 2 we have
discussed the most important applications of regression and correlation to economic research. These applications are di-
and important to the economy. However, they will be of real service in expediting the control of economic proverse
cesses only
when
correlation analysis
of economic analysis,
gaged in economic izing is
known and
activities.
becomes a handy tool
willingly used
The main
by those en-
obstacle to popular-
regression and correlation methods among economists
the undoubtedly too high requirement of mathematical
knowledge for the determination of regression parameters by 7*
Linear regression
100 the classical
method 1
The two-point method
.
is,
to a large
extent, free of these difficulties. Let us
hope that many people, method and see the advantages in the application of statistical methods to studies of the relationships between random variables, will make an effort after they learn the two-point
to improve their knowledge and gradually to master the classical
method:
3.2. Estimating linear regression
of
parameters by the method
least squares
3.2.L The derivation of formulae. Examples All our further considerations concerning the two-dimensional variable tions [28] 1)
will
(X Y) 9
be based on the following assump-
2 :
the conditional distributions of variable
ing to any values of variable X, are
Y
9
correspond-
normal
distribu-
tions; 2) the regression line of is
Y
on
j>
where a and
4) points (xt ,yt) is
are constant parameters;
/?
=
x) is a constant; V(Y\X l,2,...,w) drawn for the sample, where
(/
=
the size of the sample, are stochastically independent.
Let us denote by
The random
Q
a two-dimensional general population. is defined by the elements of this
variable (X, Y)
From
population comprising n items.
population. o>
1
2
i.e.
in a general population
= ax + ft
3) the conditional variance
n
X
a straight line with the equation
the
method of
A. Hald:
York, 1952,
we draw a random sample
least squares.
Statistical
p. 528.
Q
Theory with Engineering Applications,
New
Estimating linear regression parameters
Problem.
On
from sample
the basis of the data
the parameters a and
of the regression line
/?
the general population
101
j>
o>,
estimate
= ax+fi
in
Q.
This type of problem is usually solved by the method of obtained least squares. Let us denote by a and b the estimates
from the sample in
unknown parameters a and
of the
ft
Q. In order to determine the values of these estimates
we have
to minimize the expression
After simple transformations (see
1.2.7.)
we
get a set of nor-
mal equations:
1=1
1=1
/-I
J;K-
/=
i
= O.
i
we
Solving them with respect to a and b
b
=y
obtain
ax,
(1)
(2)
.
(*,--*) 1-1
In formula (1)
x and y
fl
The equation of
"
= -jj** 1
*
are arithmetic
means of
(3)
n ft
/ri
the regression line
i.e.
"
= -2>'1
>
the sample,
of the sample assumes the
form:
y
= ax + b.
(4)
Linear regression
102 Let us denote further
= I/
(5)
I/
These are standard deviations of variables
=-
X and Y of the sample.
(*,-*)(>>,-#.
(6)
w.-i This
is the covariance of the sample. In this case the regression parameters of
7
on
A" can be
found in the following way: #20
It is
=y
a 2l x,
(7)
not difficult to notice the similarity of formulae
(8) to formulae (6)
and
(8)
from
Similarly for the regression of A"
(7)
and
1.2.7.
on
Y we
have (9)
00) Parameters # 21 and an we shall
call regression coefficients
of
the sample.
Y on X the standard error of by analogy to formula (16) in
In case of the regression of the estimate in the sample 1.2.7.
is
defined as follows: (11)
Similarly, for the regression of
The sample
X
on Y:
correlation coefficient r
is
an estimate of the
Estimating linear regression parameters
103
We
correlation coefficient of the general, population Q.
define coefficient r
by the formula r
by analogy
shall
f
=.fli,
(13)
to formula (6), 1.2.8.
We shall illustrate by an example the method of determining numerical values of the estimates of regression line parameters obtained by the method of least squares. Example
Analyse the relationship between the consumpY and the amount of coal extracted
1.
X
tion of compressed air
from mine Z.
The
relationship between the
amount used and
the
volume
of production expressed in the form of increased input consumption in consequence of increased production is a result
of a causal relation existing between input consumption and production; the only economic reason for increased input consumption is increased production. The amount of con-
sumption is influenced not only by the volume of production, but also by secondary causes such as differences in attitude toward
work among employees during working hours, changes in the condition of equipment, differences in the quality of raw materials, damages to machines and many others. As a result of these causes, the relationship between the amount of input consumption and the volume of production appears to be of a stochastic nature.
The consumption of compressed
air is related to the use
of machines and technical equipment needed primarily for the
The most typical are: hammer drills, drills and punching machines. The characteristic feature of such machines is that at the moment they stop operating the conextraction of coal.
sumption of
air
used as operating power also stops. In this different from other machines driven
they are other sources by
respect
of energy like
steam or
oil;
these
use
Linear regression
104 substantial
up
amounts of energy during
their unproductive
work.
The monthly data on the volume of coal production and amount of compressed air used up cover a period of three
the
Both production and consumption are expressed in physical and not monetary units. This eliminates the disturbances which might appear in the relationship as a result of years.
price changes.
The data on which
the analysis of the relation-
ship between the consumption of compressed air and the
production of coal
is
based are shown in Table
TABLE
1.
1
CONSUMPTION OF COMPRESSED AIR AND THE PRODUCTION OF COAL IN A MINE
x y
the extraction of coal in thousands of tons per month, the consumption of air in millions of cubic metres per month.
The corresponding
scatter diagram is shown on Graph 1. graph should always be made before the equations of the regression lines are computed, because it supplies initial
A
105
Estimating linear regression parameters
GRAPH
1.
24
20
6 6
^
14
80 WO 110 30 coal extraction (thousand tons /month)
120
information about the nature of the relationship between the variables studied. This information enables us: 1) to form
an opinion whether the relationship between the variables strong or weak, 2) to choose a mathematical function to serve as an approximation to the relationship is
between the variables. In Graph
the relationship between
1
the consumption of compressed air
coal
is
and the extraction of
presented. Both regression lines are shown; their equa-
tions are
computed below. The
graph indicates a linear trend. that the correlation in this case
one variable
distribution of points in the
We is
can see from the graph positive since
an increase
accompanied by an increase in the other. The correlation between the variables studied can be conin
is
sidered fairly strong. This statement
is
based on observations
and magnitude of changes in consumption and production generally correspond to one
indicating that the direction
another.
In other words,
when production
increases,
con-
and the greater the increase in production, the greater the increase in consumption and vice versa, in most cases a drop in production causes a drop in consumption and the greater the former, the greater the latter. sumption also increases
Linear regression
106
We
have completed reading the graph. Let us now determine the parameters of the regression line. To calculate a
,
2
y); Z(xx) 2 involved in (10)). Computations (see (7), (8), (9), Z(y-y) are these usually placed in a table determining quantities
aw
^20
an d A10 we have to find
Z(xx) (y
;
l
(see
Table
2).
In the last row of the table, marked
Z(x-x)* =
x) (y
read:
65-14,
- y) =
280-24
Zx ==
3,672, hence
=
630-0, hence
Ey
we
1,774,
2^-30* = Z(x -
27,
Having the above data we
-
1-9
= y= x
=
278-34,
102, 17-5.
calculate the parameters of both
regression lines:
Z Z(x-x)*
1,774
In order to determine the dimensions of parameter # 21 insert into the formula
Z(x-x)(y-y) _.A ----LL ^/ x y
,
..
dimension
thousands of tons of coal per month, of cubic metres of air per month.
~ millions
We
obtain
X mln.m s
thous. tons of coal/month
thous. tons of coal/month thous.
x
of air/month
thous. tons of coal/month
m
3
of air
tons of coal i.e.:
m -----
n i cr thous. a21 =0-156
3
of air
tons of coal
-
m of air r , ----= 156 3
,
tons of coal
.
we
Estimating linear regression parameters
107
TABLE 2
METHOD OF LEAST SQUARES APPLIED TO TABLE
x
-
102,
17-5.
1
Linear regression
108
We
calculate 6 2 o : 20
=y
a21x 3
=
17,500,000
m
3
of air/month
f air
102,000 tons of coal/month
156 tons of coal
= 17,500,000 m of air/month = 1,588,000m 3
3
15,912,000
m
3
of air/month
of air/month.
Thus the equation determining the average consumption of air in relation to the extraction of coal has the following
form:
y
=
m
1,588,000
3
of air/month
+
m3
f a
x.
156 tons of coal
The equation may be called a characteristic of consumption of compressed air. For any data concerning production, providing they are taken from the interval of validity of the function, this equation provides the estimate of the average con-
sumption of air for a given volume of production. The interval of validity of the function lies between the lowest and the highest value of the
random
variable appearing in the regres-
sion line equation as an argument. In our example this interval is:
(84,000 tons, 118,000 tons).
The
regression line
is
a geometric representation of the
input consumption characteristic; after Falewicz, it
we
shall call
the line of normal input consumption.
To
satisfy
their
needs,
people have to produce various
goods.
"The labour factors, is as
process, resolved into
the production of
substances to site
for
nature;
is
simple elementary seen, purposive activity carried on for use-values, for the fitting of natural
human
effecting it
its
we have
wants;
it is
the general condition requi-
an exchange of matter between
man and
the condition perennially imposed by nature
Estimating linear regression parameters
upon human
life..." ([38], p. 177).
109
Thus, labour
is
a tribute
man pays to nature for her products. He tries for natural reasons to minimize this tribute; he tries to achieve his economic aims with the minimum of effort. This principle lies that
at the basis of all
We
economic
shall introduce
of the totality
enterprise.
of the
By
activities.
the following definition of efficiency the efficiency we shall understand the
activities
of the enterprise aimed at the attain-
ment of its economic objectives with the least outlay both in the form of "living" and "stored up" labour. The control of the efficiency of a socialist enterprise is one of the most important tasks of socialist economics. The normal input consumption line is an effective tool of such control. This line determines the average, the most probable,
and thus the
normal amount of input consumption corresponding to different levels of production. If the consumption is lower than
"normal" we can say that the enterprise has successfully its efficiency; if it is higher it means that the enter-
raised
1 prise has failed in its efforts to increase its efficiency .
X
Let us compute the values of the regression parameters of on 7. It follows from formulae (9) and (10) and from Table 2
that
278-34
= 4-27; .
65-14
dimension #12
:
m /month x thous. tons/month mln. m /month x mln. m /month
mln.
3
3
3
tons
thous. tons -----
mln.
1
An
m
=. v/UUl
3
m
3
extensive discussion of applications of linear regression to the
economic control of an enterprise can be found
in studies [23], [37].
Linear regression
110
Hence
= 4-27
.
0-001
=
_
m
3
0-00427
m
of air
3
of air
In this case
b1Q
=
- 0-00427
102,000 tons of coal/month
--
m
3
= 27,275 The equation of
.
17,500,000
m
3
of air/month
of air tons of coal/month.
the regression line which gives the average
production of coal in relation to the amount of compressed air used, has the following form:
x
- 27,275
tons of coal/month
---
+ 0-00427 m
3
.
y.
of air
The regression lines are shown on Graph 1. From the formal statistical point of view both regression lines are of equal importance but in an economic interpretation this is not so. practical use of the regression line determining the most
The
probable value of production corresponding to a given consumption of one production factor, is rather limited. On the is great practical importance in an analysis of the relationship between the volume of production and several production factors. The method of multiple correla-
other hand, there
tion
used for
is
The
this type
correlation coefficient
-
correlation between the
follows
of analysis.
a=
from our previous ir ,
156
m
3
of air
tons of coal
a measure of the degree of
is
random
Since
calculations that
and
^^ ^ = 0-00427 A
,
,
variables studied.
la
tons
m
3
of
air
it
Estimating linear regression parameters
\\\
then on the basis of formula (13) r
2
= 0-00427
HenCe
The
A
156
=
0-66612.
r
= 0-815.
correlation in this case
is fairly
Example level
.
2.
strong.
Analyse the relationship between the average
of inventories and the cost. socialist
is
enterprise
an independent economic
entity
(within the framework of accounting regulations) which tries to
production targets in the most rational and economway. This tendency manifests itself in efforts to fulfil and surpass production targets, to observe the production
fulfil its
ical
schedules, to improve the quality of the product, to economize and to lower the cost of production. To fulfil these is equipped with an appropriate amount of capital goods and liquid assets. It is desirable that the amount of both capital goods and liquid assets necessary to
tasks the enterprise
carry out production targets be as low as possible. cult to realize this situation in practice.
It is diffi-
The amount of
capital
determined by an analysis of the effectiveness of the investments. In the determination of the liquid assets requirements, however, a study of the liquid assets
goods needed
turnover
is
is
involved.
The purpose of an
analysis of the rela-
tionship between the average stock of liquid assets and costs is
to determine the parameters of the regression equation
which enable us to assign an appropriate amount of average stock to particular costs.
The scatter diagram on Graph 2 describes the relationship between the average level of stocks and costs in a clothing factory.
The
regression line
shown on
the graph expresses
this relationship statistically.
On
measured in millions of
costs in milon the X-axis The scatter diagram is based on
zlotys,
lions of zlotys per quarter.
the statistical data
the Y-axis average stocks are
shown below, comprising a period of
three
Linear regression
112 years.
The data come from
quarterly accounting reports.
The
statistical material and the computations involved in determining the regression parameters are shown in Table 2.
TABLE 2
AVERAGE LEVEL OF STOCKS AND COSTS FACTORY
x
=
=
y
13-7,
IN
A CLOTHING
14-4.
Let us calculate the parameters of the regression line of
Y
on X: 42-89
= 0476,
8842 The
b 2Q
regression equation that
= we
144
-
0476
.
13-7
= 7-9.
are trying to find has the
following form:
y
7*9 million zlotys
In this example sion line of
X
on
we
Y
+ 0476
quarters
.
x.
are not trying to determine the regres-
or the correlation coefficient.
Estimating linear regression parameters
113
The
regression line is a good tool for appraising the average of stocks; if the points corresponding to new reporting data appear above the regression line, this means assuming
level
that the points belong to the same population on which the that there was a set-back in the regression line is based efforts
to increase the turnover of liquid assets; and vice below the line it indicates that the
versa, if the points are
turnover of liquid assets has risen.
GRAPH
10
11
12
13
14
2.
15
17
16
18
(million zlotys/ quarter)
Correlation
analysis
can be applied
related to the analysis of liquid assets.
to
other problems
studying the relationship between the volume of production in a given period of time and the average level of warehouse stocks we could
By
determine whether such a relationship exists and what its degree is. This would provide valuable material for setting stock control norms and
appraising the efficiency of the merchandise control department. Another yardstick for meas-
uring
its
efficiency
is
provided by a study of the relation-
ship between the flow of incoming and outgoing warehouse stocks measured in predetermined periods of time.
Linear regression
114
The technique of computing regression parameters 1 in a small and a large sample Contingency table
3-2.2.
.
The computation process connected with the determination of regression parameters in a small sample can be simplified u) x) and (y by replacing the differences (x y) by (x and (y w) where u and w are certain constants selected t
i
i
-
i
so as to facilitate computation.
Let us denote
x= u + A U y=w+A w ,
.
Since
j; (x
-
f
2 t/)
=
j;
-
[x,
- ^u 2
(x
)]
/-i
/-I
j;
(x,
- xp +
/-i
then
;
(x,
-
2
A-)
;
(x,
-
2
w)
- w/i
2
(i)
.
/=!
i-l
In consequence of similar transformations
it
is
easy to show
that
(2)
and n
n
V y
/-
fvv-^i
1
Y\ */ (\) V.M
+j\ j)
V
1
(x
A >^ \ t
i-
IJL\
**/
(v v^i
w\ "J
n
A uw* A
ri ^-1
(^\ \ JJ
1
Therefore
(4)
1 Samples comprising not more than 30 items we Other samples we shall call large.
shall call small.
Estimating linear regression parameters
115
(5)
When
the sample
is
large the contingency table
for computing regression parameters (Table
1).
is
often used
The top row
of the table contains the centres of class intervals of the bution of
ys
and
in the left
column the
vals of the distribution of x's are
shown. The
contains the
...,
frequencies
n. lt
n. 2 ,
rt.
t
of y's and the extreme right-hand column > HI- Wj>., HA- of the distribution of x's.
TABLE
distri-
centres of class inter-
bottom row
of the distribution the frequencies
1
CONTINGENCY TABLE
"ai
"ij
nk
.
77.,
In the contingency table the frequency distribution in the sample is shown. The number n in the extreme lower right-hand
panel denotes the size of the sample. Let us write down three important relationships following directly
from the contingency
table:
,.= Y,,, 8*
(6)
Linear regression
116
J
/
Assuming
random
i
that the values of the
random
J
variable
X and
of the
Y which
belong to particular classes of the are distribution equal to the central values of these frequency classes
variable
we
get
x
=
2n
if
n
x
(9)
lf
i
00)
(11)
=
% '
--,/*,
J
-x)(y,-
y)
(13) (14)
Computations connected with the application of formulae (11) and (12) may be simplified when the ranges of all the intervals for each variable are the same. Instead of x and j>
we then
introduce the
new u
=
variables
x
-
u
f
n
,
..
(15)
117
Estimating linear regression parameters
and w' are constants, d is the range of the class interval for the distribution of jc's and h is the range of the class
where
u'
We
interval for the distribution of y's.
x and
-x=
u'+ud -u'
-ud =
have: d(u
- u),
(17)
similarly
y-y = h(w-w).
(18)
we
Introducing (17) and (18) in (11) and (12)
get
(19)
(20)
Now we have to take only one step to arrive at the formulae which are needed to compute parameters a 2i and a12 on the basis of the data in the contingency table. After simple trans-
formations
we have
Vn
h
t
" We
i}
u,
Wj
/
n u w U y fy a
^(
ww)
t
j
~ ww
/
shall illustrate the technique
i
/
"' _
j>
\
_^^-j^)
of calculating regression
parameters by two examples. The first will show the calculation of regression parameters from a small sample and the second from a large sample.
Linear regression
118
Example
1.
An
different countries is ficult.
The
and comparison of welfare in important and interesting, although dif-
analysis
difficulties arise
mentalities,
traditions,
because of differences in national
cultures
and customs which cause
such substantial differences in the average structure of wants in different countries that comparisons present a multitude of problems. The different price and wage ratios in the various and the necessity of rate of exchange com-
countries studied
putations magnify these difficulties. However,
it
is
relatively
some of the research objectives with the help of correlation analysis. The higher the level of welfare in the country the greater number of wants is included in the basic wants group (see 2.2.1.). The characteristic distinguishing basic wants from others is the fact that the relaeasy to achieve at least
tionship between the degree of satisfaction of these wants
and income
is
rather weak.
Food
constitutes the
most im-
portant group of basic wants. In our further considerations
we
shall
assume that the consumer
in the country studied
able to find on the market every food product he demands. This assumption means that in all countries studied, the buying inducements to which the consumer is exposed with
is
regard to food products are the same. Let us also assume that people in all countries
means
at their disposal
if
would
they had sufficient financial satisfy their nutrition require-
ments in such a way as to maximize their satisfaction. On the basis of these assumptions we can say that if incomes were sufficiently high people would satisfy their food requirements in the best possible way, earmarking a sufficiently large portion of their income for this purpose. Since food requirements have been
made optimum then
the portion of
income earmarked for food will not increase with a further increase in income. This means that, other things being equal, the correlation between the expenditures for food size
and the
of income becomes weaker as income increases.
Estimating linear regression parameters
119
We
can surmise, therefore, that the correlation coefficient between the level of food expenditures and the size of the consumer's income is one of the welfare characteristics of a group of people. Of course, there are other yardsticks for measuring welfare.
TABLE 2
MONTHLY INCOME
AND EXPENDITURES (y) IN 20 FOUR-MEMBER LOWER SILESIAN FAMILIES
= 280-95, - 280, w = 130,
x
.
=
'
(jc)
133-85, 0-95, 3-85.
Linear regression
120
When
the
correlation coefficient
close to zero,
is
food
when it approaches food requirements is so poor that the possibility of starvation cannot be excluded. However, requirements are met in an optimum way;
unity, the satisfaction of
when
the correlation coefficient
ceases to perform for instance, in cient
is
the
is
close to zero or unity
function as a yardstick of welfare.
its
B
two countries A and
close to zero
same
is
we cannot we can
it
If,
the correlation coeffi-
say that the level of welfare say that it is so high that it
in both, but
allows the citizens of both countries to achieve an
optimum
food requirements. To compare the level of welfare in the two countries we have to introduce another satisfaction of their
measure which takes into consideration Similarly,
low
if,
we cannot contend
in the
two countries
their unsatisfied needs.
that the level of welfare
is
equally
studied, the coefficient of correla-
tion between food expenditures
and income
is
close to unity.
We can say that the standard of living in both countries is low. To
decide in which
lower and in which
it is
it is
higher
we have
to obtain additional information.
The
correlation coefficient should be used with care in meas-
We should remember that we are measuring a complex phenomenon which depends upon many factors. If we heed this warning the correlation coefficient will be a useuring welfare.
measuring the welfare of a nation. In accordance with formulae (4) and (5) we have ful tool for
26,447
- 20
85,391
17,053 r2
.
0-95 3-85
26,374
.
-20. (0-95) 2
85,373
26,374
26,374
-20. (3-85) 2
=
16,757
26,374 85,373 r
= 0-69.
.
2
16,757
Estimating linear regression parameters
Column x
121
Table 2 shows the average monthly size of income in tens of zlotys; column y of this table contains the statistical in
data on average monthly food expenditures, also in tens of
The
zlotys.
With
twenty four-member families
statistics pertain to
living in the
Lower
Silesian area.
reference to this example let us
make
the following
observation: although the calculation of the correlation coefficient is expedited by the method of least squares, it is a time-
consuming operation. Example 2. One of the most fronts the clothing industry
of
sizes
of
clothes
in
is
order
difficult
that to
problems that con-
of deciding the ensure a good
range fit
for
a large number of people. Each size range is characterized by a set of several numbers. The problem consists in assigning these numbers to particular characteristics in such
a way as to obtain an appropriate combination. Until 1955 the clothing
simple way:
be
selected
Poles,
industry used to solve this problem in a very
a well-proportioned woman and man would a typical representative of the majority of
as
and ready-to-wear clothes were made according to method resulted in the production
their measurements. This
of clothes that could be worn only by a small number of people. Warehouses were overstocked with a large number of unsaleable products. In 1955 anthropologists and mathemawere called in to help solve the problem. The anthro-
ticians
pologists have taken about 85 thousand anthropometric pictures of men, women and children. The results were analysed by mathematicians under the direction of Professor Hugo Steinhaus. The sample was large enough to provide reliable
information about the measurements of the whole population. In order to select an appropriate set of characteristics for the
model the
correlation
was calculated for
pairs of such char-
acteristics as: height, chest, waist, shoulder, neck,
arm meas-
urements. The characteristics selected had a low degree of
Linear regression
122
correlation with one another
and strong correlation with other
characteristics.
Table 3
is
a contingency table containing
data
statistical
and computations connected with the determination of the coefficient of correlation between the chest measurements and the height of 500
men
selected at
random from
included in the anthropometric studies
all
the
men
1 .
The above example is a good illustration of how technologand economic problems are interrelated. The preparation of models is a technological problem but
ical
its
consequences have an economic aspect.
model
results
in
ill-fitting
A poorly constructed
which nobody wants to the waste of thousands of metres clothes
wear and consequently in of expensive nuterial and in thousands of hours wasted by tailors. Without correlation analysis it would be difficult to find proper measurements for the models. This shows useful
and valuable correlation
purposes
used.
if it is skilfully
The contingency
how
analysis can be for practical
table contains all the data necessary for
the computation of the correlation coefficient.
w
Thus we have:
= 0-016,
= 500
^=
_ = o-174.
87 .
500
500 1
These
Wanke.
statistics
were obtained through courtesy of Professor
Adam
123
Estimating linear regression parameters
o
7 o
a
T-H 1
* 8
S
z w
O O r> s
"8
8 2
8
ON ON
3 <
3
oo ON co ON
i
T-^
CM
00 I
1
s
CM~
ffi
u R ON I
S 3
Linear regression
124
h=2 d=4. 9
Hence, on the basis of (21) and (22) we get: 2
1-268-0-003
1-265
2-464
4-928
4
1-265
2-530
2
4-994
4-994
= 0-258,
= 0-506.
Therefore r2
And
= 0-258. 0-506 = 0-130548.
finally
r=
0-361.
3.3. Estimating linear regression
parameters by the two-point
method
3.3.1.
The
derivation
In section 3.1.
of formulae
we gave a
brief justification for introducing
a new method of estimating regression parameters, which we called the two-point method. It is easy to master and convenient to use. We shall now describe
into regression analysis
this
method.
Let us denote by general population.
as in 3.2.1. Q A pair of values (x,y) -
a two-dimensional of
random
variable
(X,T) corresponds to each item of this population. We assume that the regression I lines of the general population are straight lines, i.e.
f
= anX + P*
(1)
x
=a
(2)
and 12
x
+
/310 ,
Estimating linear regression parameters
where a 2i, a12
125
and ^10 are regression parameters of the general population. We take from this population a random sample co comprising n items. We get n pairs of numbers (xi9 y ) (i
=
,
j8 20
t
1,2,
...,
n) corresponding to the items drawn. These
num-
bers can be interpreted as the coordinates of points located
on a plane. Such a random point corresponds to each item of population Q.
We
compute:
1
v
We
divide set
CD
into
"
~n
^V v ,^i
two subgroups
in such
a way that we
X
not include in the first subgroup the points with abscissae and all the rethan into the second x, subgroup greater
maining points. If in the second subgroup there are k points, then in the first there will be n k points. Let us note that in this division of set
dom
variable
co
which
into
may
two subgroups, quantity A: is a ran1. assume the values 1,2,..., n
Let us denote
We
Y X^ \
x,
compute
-jS x
Linear regression
126
The following theorem can be proved:
THEOREM
1.
-=0.
(4)
1
The proof of at the It
this
theorem
is
given on p. 213 of the Appendix
end of the book.
follows from
C*(2>> y<2>)>
(x
9
~y)
Theorem
that the three points
1
He on one straight
line.
As
in
(3c (1)
,
j),
the estimate
of parameter 21 we are proposing to accept the slope of this line; it can be expressed by any one of the following three ,
formulae:
a "21
-~
%=
(5)
y X
~" y<:
(6)
Xn (7)
An
estimate of parameter
can also be expressed in one of
/? 20
the following three ways: (8) (9)
620
=?
-<*nx.
(10)
In order to obtain analogous formulae for estimates of parameters a12 and /?10 we have to divide the set of points co into
two subgroups
in such a
are points with ordinates
second,
all
Y
remaining points.
way
that in the first subgroup
not greater than y, and in the
Estimating linear regression parameters
127
Let us further denote:
Below are the
= ----
3>
Xi>
Letter
m
definitions of other symbols:
J> ( i)>
m
n
= ------- ^V x n m
stands for the
=
J(2>
=
x&>
9
number of
y
m m
V
(
*}
,
.v<2>
points which are in the
second subgroup as a result of the division of the a> into two subgroups. Of course, m is a random variable which may as-
sume the values formulae meters of
(510) we
X
on
interchanging letters in get the formulae for the regression para-
l,2,...,n
By
1.
Y.
= -i~
3f2>
i2
j( 2)
y
x
--
-y
- *t)
*
(12)
,
(13)
^I^.
(14)
O7)
Linear regression
128
follows from the definition of these regression parameters
It
required to determine the position of the regression line by the two-point method is to know how to draw a straight line through two points. When we want to determine that
all
that
the position
is
of the regression
any two of larly when we want
the three points
line
ofX
(7(i)
>
on
*)
3.3.2.
Y G>
(
line (5r (1)
of Yon ,
J^u),
X we draw a line through (3c (2)
,
J<2>)> (x> J>)-
Simi-
of the regression any two of the three points
to determine the position
we draw a
line through
*< 2 >), (y, x).
2)
The technique of computing regression parameters a small and a large sample. Examples
in
easy to use the formulae given in 3.3.1. We shall illustrate this by two examples. In the first example the statisIt
tical
is
material covers a period of two years.
The
regression
parameters have been calculated by two methods: by the method of least squares and by the two-point method. This will enable us to
show
the advantages of using the two-point
method. Comparing the computation tables for the two methods we can see that the two-point method is simpler than
and
to compute. In of regression paraExample 2 illustrating the determination meters by the two-point method in a large sample, we shall
the classical, easier to
not
calculate
these to
comprehend
by the method of least possible in Example 2 a compari-
parameters
make
squares. However, son of the two methods and of the results obtained by them we shall use the statistical data of Example 2 from 3.2.2.
Example
1.
Table
1
contains data
kilometres driven and the
on the number of
number of
kWh
cars of the City Transport Corporation in data).
car-
used up by the Wroclaw (monthly
Estimating linear regression parameters
TABLE KILOMETRES DRIVEN AND IN
We
KWH
129
1
USED BY ELECTRIC CARS
WROCLAW
want to calculate the regression parameters by the method of least squares and by the two-point method: we shall start by rounding off the figures to the nearest ten thousand car-kilometres and ten thousand kWh. Below is shown the sequence of computing regression parameters by the method of least squares:
Linear regression
130
TABLE 2
METHOD
OF LEAST SQUARES APPLIED TO TABLE
w>=100, =0-699, 2
=
0-699. r
= M80, M80 = 0-83, tfi 2
=
0-91.
1
Estimating linear regression parameters
131
In the following table the computation of regression parameters by two-point method is shown:
TABLE 3 TWO-POINT METHOD APPLIED TO TABLE
=
130-8,
y
-
99-5,
--
148*8, j;<,>- 112-5,
> u) -
110-9,
= 1-26, = 0-72, = = r 0-72. 1-26 0-91, r = 0-95. tfia
fl.ii
s
9*
1
145-2,
Linear regression
132
Comparing Tables 2 and
we can
3
easily see that the
computa-
determination of regression parameters by the two-point method are much simpler and less time-consuming than those required for the method of least tions connected with
the
squares.
Let us explain the sequence of computations that have been in order to fill in Table 3 and to find the values of re-
made
gression parameters: 1)
the figures in columns
arithmetic
means
x and y have been added and
calculated
24
24
column x
2) in
the
numbers greater than
the
been marked with a
x=
131
have
there are ten of them;
*;
marked values of column x have been written down column Jt (2) and the corresponding values of y in column j< 2 >; in column y the numbers greater then y = 99-5 have been marked with a v there are twelve of them; the marked values of y have been written down in column y (2)9 and the corresponding values of x in column *<2>
3) the
in
4)
,
;
5)
;
6) the following averages
x (1)
7) using
=
1,488
-^-
=
formulae
culated
have been calculated:
148-8,
(5)
and
_ y<2>
(12),
1,125 =_ =
112-5.
=
145-2;
a 21 and an have been
.91==
:
- 130-8 145-2 - 130-8 ------148-8
aia
=
110-9-99-5
=
,
1-263.
cal-
Estimating linear regression parameters
133
we can
the estimates of the regression parameters
Knowing
estimate the value of the correlation coefficient. r2
=
21
.
12
= 0-722
.
1-263
We
have:
= 0-9119.
Hence: r
= 0-95.
To
fill in the computation table in the two-point method does not require any calculations. We simply write down in the appropriate columns the numbers marked * and v and ,
their corresponding "joint" 1 It
is
numbers.
not necessary to subtract, square and multiply the
numbers with
different signs as
was the case when the method
of least squares was used. If the assumption about the linear character of correlation between variables and Y is valid,
X
both methods give approximately the same seen from our example:
GRAPH
100
120
110
130
results, as
can be
1.
140
car- kilometres driven (tens of thousands)
1
The word
"joint"
is
with a two-dimensional
used here in the following sense: since
random
variable for every abscissa
x
a joint ordinate y lt and vice versa, for every ordinate yt there abscissa
x
t
.
t
is
we
deal
there
is
a joint
Linear regression
134
On Graph
are
1
shown two
regression lines determined by
least squares (broken lines), and two regression determined by the two-point method (continuous lines). There is very little difference in the location of the lines ob-
the
method of
lines
tained by the two methods, although the dispersion of points considerable.
is
On
the basis of the data contained in Table 3
3.2.2., calculate
by the two-point method the value both of
Example of
2.
the regression parameters and of the correlation coefficient.
The
by Table 4. In and defined there addition to the symbols already denoted are
solution of this problem
is facilitated
some new ones: n the frequency of variable
It
**<%>
99
99
95
99
#
99
99
9->
99
*'<2>
99
99
99
99
follows from the general form of
X& ~y yl <2>
*
*
1>
<2>
Theorem
1,
3.3.1. (see
note on p. 214) that set a) can be divided into two subgroups not only by using numbers 3c and y, but also by using any
numbers xl and y^ that
satisfy the inequalities
^ X ^ *max >Wn < J < Jmax*min
l
1
It
was assumed
in Table
4 that
^=172, The table
division of set
by two thick
o>
lines
^ = 91.
was marked in the perpendicular to one another and
into subgroups
intersecting the middle part of the table cross-wise.
Let us compute the arithmetic means appearing in the formulae for the regression coefficients. All the information
Estimating linear regression parameters
135
Linear regression
136
needed for calculating these averages
provided by Table
is
4.
Thus we have: 3c
=
(1) (}
172
--297 4= -
168-22,
314
=
172
+ --
.
4
=
178-52,
-
2
=
88-30,
186 391
=
91
-
=
91
+
5^
=
91
_i2L.2=90-32,
J<2>
=
-^--2
=95-52,
314
3f <1>==
193
91
+ -^-
2
=
93-08,
186 172
--
-.4=
170-81,
=
173-40,
289 3c<2 >
=
172
+-
.
4
211
Hence
- 90-32 = - 168-22
0-267,
= ^iZ^-J. -
0.359.
93-08 178-52
95-52 r"
88-30
= 0-0958,
r= 0-309. On Graph broken
2 two pairs of regression lines are shown. The determined by the method
lines are the regression lines
137
Estimating linear regression parameters
of least squares (see 3.2.2., Example 2) and the continuous lines are the regression lines determined by the two-point
method.
GRAPH
2.
! 90
8
60
150
160
190
180
170
height (cm)
As can be
seen from the graph, the positions of the lines determined by the two methods are not very different, in spite of the fact that the correlation is fairly weak (i.e. the points are widely scattered
3.3.3.
on the
scatter diagram).
The properties of estimates obtained by method
the two-point
In both examples discussed above we have seen that the numerical results of estimating regression parameters by the method of least squares and the two-point method did not differ
much. The computations
in
both cases were based on
the actual statistical material so that there selecting the figures
similar results.
The
on purpose similarity
in such
is
no question of
a way as to obtain
can be explained by certain
general properties of estimates obtained by both methods.
As we know
(see 3.1.), the estimates
of regression parameters
Linear regression
138
obtained by the classical method are consistent and unbiased. Estimates obtained by the two-point method have similar prop-
This explains the similarity of the results noticeable 1 and 2 of 3.3.2.
erties.
in
Examples
We
below theorems concerning the most impor-
shall give
tant properties of the regression coefficient # 21 3.3.1.
by one of the formulae
(5), (6)
and
(7).
defined
in
These theorems
can also be adapted to apply to the regression coefficient al2 THEOREM 1. Regression coefficient # a is a consistent esti.
mate of regression Q,
i.e.
coefficient a 2l for
the general population
>
for every e
0.
THEORE?V!
Regression coefficient
2.
mate of regression
coefficient (flu)
a 21
,
a^
is
(1)
an unbiased
esti-
i.e.
= oa-
(2)
In order to appraise the effectiveness of estimates obtained by the two-point method we should compare the variance of these estimates with the variance of the estimates obtained
Theorem have a
We know
from the Markoff the estimates obtained by the classical method
by the method of that
minimum
least squares.
To distinguish between estimates we shall denote them as follows
variance.
obtained by the two methods
:
~~ *h e regression coefficient obtained
by
the
method
of least squares, the
regression
coefficient
obtained
by the two-
point method. Let us also denote: class)
Point)
where ul9 u2
,
...,
= Ffatt = P(a
class
point
= wl>-> Xn = *1 = b-* Xn =
*l
I
I
un are certain constants.
*X n ),
Estimating linear regression parameters
THEOREM
139
3.
Kflil point
XI
I
=
Wl
-> *n
= Z^'l^ ^ !%L
^i
.
=
JLli.
(3)
n
sl
point)
wn)
Applying the Slutsky Theorem to the right side of formula (3), we find that e converges in probability to DlJol where
A=
\x-i*i\fi(xydx,
=
0;
(x-
- oo
The
oo
definition of
If the distribution
mula esting
(1)),
symbol /iO)
given in
is
of population
Q
is
1.2.3.,
normal
formula
(6).
1.2.9.,
for-
(see
then e converges in probability to 2/n. It is interif the random variable has a normal
X
to note that
distribution
N(m,a)
then
the
as an estimate of parameter
THEOREM
m
the median
also equal to 2/n.
The distribution of random
4.
of
effectiveness is
variable
(4)
approaches a normal distribution AT(0, proofs of Theorems 1-4 are given in
3.3.4.
Comments on estimating by the two-point method
In examples
1
for n -> oo
Q
The
correlation
coefficient
coeffi-
by the formula r
2
=
an
.
al2
,
#12 are regression coefficients
two-point method. This procedure
Theorem which
.
[31],
and 2 we have estimated correlation
cient Q in population
where # 2x an^
the
1)
states that if
is
random
obtained by the
justified
by the Slutsky
variables
Xn Yn
are stochastically convergent to the constants x, y,
,
...,
,
...,
z,
Zn
then
Linear regression
140
any rational function of these variables R(Xn stochastically convergent to the constant R(x y,
...,
,
...,
9
To prove
)
is
z).
that r
= j/<0
21
.
a12
tends in probability to correlation coefficient Q the following theorem.
THEOREM
Zn
1.
Let the
sequence
{Xn }
we
shall
prove
of random variables
to the number a, and let y (x) denote tend with probability the continuous function of x. Then the sequence {y (Xn)} of 1
random
variables converges in probability
Proof.
Pflim \n-oo
Xn = a] =
to
^
(a).
.
J
For each continuous function lim
means
1
1
ip(x),
Xn =
the condition
a
that
Hence the events lim
X =
a
and
lim
are equivalent and therefore
Pllim
It
follows
that if r 2 converges in
from the above theorem 2
then r tends to g. , probability to In the conclusion of our discussion on estimating the cor-
by the two-point method we should menmore problem. As we know, the product of the regression coefficients obtained by the method of least squares relation coefficient
tion one
a& is
class* ^12 class
a positive quantity with zero-one
norm
(see
1.2.8.).
The
Estimating linear regression parameters
product of the regression coefficients obtained point method does not have this property.
We
141
by the two-
can give numerical examples in which the product 12 point
is
greater than one or less than zero. If the sample
is suffi-
on
the assumption that there is linear correlation between the variables studied, the probability of large
ciently
the
then,
of such an event
realization
is
negligible because the
tends stochastically to Q*. product 31 p 0l nt -ah point In estimating the correlation coefficient by the two-point
method we should observe the following convention: when a 21 > and alz 1) correlation coefficient r > when azl < and an 2) correlation coefficient r < 3) if
#JX
.
a12
>
1
4) if coefficients
that r
When
= 0.
we assume
that r
=
1
> <
;
;
;
a ai and alz have different signs we assume
case 3 or 4 occurs in practice
we
suspect that the
assumption about the linearity of correlation is not true. We also suspect this when the product a 2l -an I but the points
=
on the (see
scatter
Graph
3).
diagram are not located
on the
straight line
Linear regression
142 It
follows that there are situations in which the two-point
method provides reasons is
for postulating the hypothesis that
one of the regression lines is not a straight line. This an important advantage of the two-point method.
at least
ON TESTING CERTAIN STATISTICAL
4.
HYPOTHESES Two
4.1.
of
Q
is
the
distribution
normal 1
The formulation of the problem
4.1.1.
In
tests to verify the hypothesis that
the general population
of probability and mathematical assumed that the distribution of the random
many theorems of theory
statistics
variable
is
it
is
difficult to
normal. In practical applications it is often very check to what extent this assumption is justified.
When the subject of statistical research is an ordinary random variable, there are several methods of testing the hypothesis that the distribution of the population is normal. These methods do not provide sufficient grounds for accepting the hypothesis, but in some cases enable us to reject it. The usual procedure in practice is to assume that if the
information obtained from the sample does not give grounds it can be regarded as true and
for rejecting the hypothesis,
can be accepted in the sense that the population is normal, without any further justification. Although this procedure is open to objection it has to be accepted because there is no other sensible
For
way
statistical
variables,
it
is
out.
research involving multi-dimensional
more
random
difficult to test the hypothesis that the
population is normal. We shall again be concerned here with continuous two-dimensional variables. As we know (e.g. see [7]
1
,
section 29.6) in
many theorems
involving such a variable
Published in Przeglqd Statystyczny (Statistical Review), No. 3, 1957.
143
Linear regression
144 it is
assumed that
distribution
its
is
normal,
i.e.
that the two-
dimensional density function of this distribution by the formula
is
expressed
(1)
x-*i y-m, 2(1
-
y1 (see 1.2.9.,
formula
Let us denote by
random
the hypothesis that the two-dimensional
normal
variable (X,Y) has a
given by
ity
(1)).
#
distribution with the dens-
(1).
If follows from the generalization of the Central Limit Theorem
on two-dimensional
variables
(see [7])
that
often deal with this type of distribution. It tablish
this
fact
by experiment because
is it
in
practice
we
difficult to esis
inconvenient
spatial diagrams of the distribution. This increases the importance of statistical methods in testing hy-
to construct
pothesis
The ing
is
of these methods
is
particularly important in select-
a function for the equation of the regression
know lines
H.
role
line.
As we
(see 1.2.9., Corollary 1) regression I lines are straight
when
the joint distribution of
normal. This explains
why we
random
variable
(X Y) 9
so often deal with linear
regression in practice. However, a distribution does not necessarily have to be normal every time a visual inspection of the scatter
diagram based on a sample suggests that we deal with When a population is normal the regression
linear regression. lines
are straight lines,
but
when
the
regression lines are
of the population may or may Under these circumstances the results of
straight lines the distribution
not be normal.
testing the hypothesis that the distribution is
of great practical importance.
normal may be
On
testing certain statistical hypotheses
145
The difficulties encountered in verifying this hypothesis with reference to a two-dimensional random variable are caused by the fact that the tables of the two-dimensional density function for a normal distribution are not easily accessible
1 .
In this section
we
shall discuss
two methods of
testing
the hypothesis that the distribution of a two-dimensional random variable is normal. When these methods are used
such tables are not required. Both methods can be applied to large samples.
412.
Testing
H
hypothesis
by
coordinate
the
rotating
system (method A)
The consistency of the two-dimensional distribution of a general population with a normal distribution can be checked 2 by the # test. As we know (1,2.9., Theorem 1), variables
X
and
Y
are stochastically independent two-dimensional normal distribution
meter Q
^
we
replace
X' and Y' using a
random
equals
variables
zero.
X and
g in
a
If para-
Y by variables
linear transformation.
X' =- (X
m^ cos
-(X- m
Y'=
when parameter
z)
sin
+ (Y m^ sin 0, + (Y m ) cos 0. ?
Then
E(X (see 1.2.8.,
are
,
Y')
Theorem
Thus we can sional
f
random
= 0,
and hence g(JT,
Y')
=
4),
see that if a joint distribution of a
variable (X' 9 Y')
stochastically
independent.
is
two-dimen-
normal, then these variables
Hence we can write
f
where q> (x' y ) denotes the two-dimensional density of the normal distribution of variable (X' Y') and 2 (y') (*') and 9
9
1
10
However, such
tables exist. See [40]. [44].
^
<7?
Linear regression
146
and Y'
are the symbols of marginal densities of variables X'
Since a joint distribution
in this distribution.
normal, the
is
and
.
are expressed by the formulae:
= E[(X - w cos + (Y - w sin 0] = 0, sin + E[X' - E(X')] = [(Jf - Wl cos 9 + (Y - m = + (^ Wj) (7 w sin 2 0] of cos 4- of sin 0, = [- (JT - wO sin + (Y - w cos 0] = 0, (Y') sin2 + (7 - m cos E[Y - (7')] = [(* = 2 sin w sin 0] + a cos 0. iwj) (F (JT 2
E(X')
x)
)
2
2
2
2
2
2)
)
2
2
2)
2)
f
2
2
2
2
2
cr
a)
The construction of
2
2
2)
i)
the test for hypothesis
H
2
is
based on
the fact that variables X' and Y' are stochastically independent.
By using
the # 2 test
we can
easily
check whether or not
empirical marginal distributions are essentially different from a normal distribution. Let us denote by vx (# ') the empirical f
X
r
from the sample and by marginal distribution of variable n the size of the sample. In this case the divergence between this empirical distribution
of variable X' from the sample and
a theoretical normal distribution ^(x/)
is
measured by the
expression 2
_
For variable Y' we get an analogous expression V 2_
Ay
where 1
v a (X)
denotes the empirical distribution of variable 7'.
Tables for a normal distribution end of the book.
in
one dimension are given
at the
On
testing certain statistical hypotheses
147
We shall denote by a the probability of an event that we consider as practically impossible. There exists a positive dependent upon a and such that
real
number
We
have further
#jj
and analogously
H
when hypothesis is true, are independent, and so are variables %%
should be remembered that
It
variables
f
and
f
.
y
We
Y
X' and
reject hypothesis
H
when
(xl>x$(J(xS>%Z),
(4)
where symbol ^J means "or".
The
probability of this event
is
= _ + 2a- a - 2a- a < 2a. 1
2
1
2
values of 75 are taken from appropriate tables 1 call the number a the level of significance.
The
I2
.
We
In an electric power station the relationship between the consumption of coal and the output of electric
Example
power
is is
period
.
studied. In Table
1
the statistical material for a 6-year
shown. The ^-column of
this table represents the
monthly output of electric current (in tens of thousands of kWh) measured at the generator contacts and the y-column 1
2
X
2
distribution tables are given at the
The
end of the book.
data for this example have been obtained through the courtesy of Professor J. Falewicz. 10*
statistical
Linear regression
148
on the consumption of slack coal (in tens of tons) used for production. The data come from a metalurgical electric power station equipped with two SKODA turbogener-
contains the data
TABLE
1
CONSUMPTION OF COAL AND ELECTRICITY GENERATED
IN
AN
ELECTRIC POWER STATION
ators,
each
of
boilers heated is
by
3,100
kW
slack coal.
capacity, and two DUQUESNE The consumption of slack coal
given in gross terms as recorded during control weighing.
The output of power is also given in gross terms because was measured at the contacts of the generators.
A is
scatter
diagram on the basis of the data given in Table
shown on Graph
1.
it
1
On
teeing certain statistical hypotheses
GRAPH (
149
1.
,V
260240 220 200 *
180
160
140 120
140
260
220
180
output of electricity (ten thousand kWh/ month)
The
distribution of the points
on the diagram suggests that
the distribution of the variable (X,Y) pothesis should be tested. To prepare the testing the hypothesis
rotating the
we
normal. This hy-
statistical
material for
use formulae for translating and
coordinate system.
y by formula
is
from
We
find the angle of rota-
replacing the population parameters by the sample parameters. In this way we test the hypothesis that the population has a normal distribution
tion
with parameters Let us denote:
~x=-It
\
follows
x,
from
(7)
m = 1
~x,
1.2.8.,
m = y,
y.
z
y
y,
calculations
w-the size of the sample.
(see
Appendix, pp. 215-216),
that
(x-x)-(y-y)= *)
2
56,344,
=79,124,
=43,952,
Linear regression
150 therefore
tan2y =
-- = 112,688 '
3-20.
35,176 Further,
we have
y=3621', sin
7
= 0-592, = 0-806.
After a linear transformation, according to formula (1) in 4.1.2., we obtain the values of the new variables x\ y' which are given in Table
2.
TABLE 2 TRANSFORMATION COMPUTATIONS APPLIED TO TABLE
1
On
The
scatter
(Graph
testing certain statistical hypotheses
diagram for the data in Table 2
is
151
shown below
2).
GRAPH
2.
20-
% -too
100
V
9
.-
20
from Table 2 we now construct a and for variable /. Then on the basis of formulae (2) and (3) we calculate %* and Xl (see Table 3 and Table 4).
On
the basis of the data
frequency distribution
for variable x'
TABLE v 2 TEST APPLIED
where
n\
3
TO VARIABLE
x'
Linear regression
152
TABLE 4 TEST APPLIED TO VARIABLE
where
w,
/
=
Let the level of significance 2a
=--
0-02. In this case
a
= 0-01. we
2
In the x distribution tables for 4 degrees of freedom 13-277. Since *J
find
-
=
13-86
>*= 13-277
H. We have rejected it on the basis of same sample on which hypothetical parameters of the distribution were determined. Thus the reason for rejecting
we
reject hypothesis
the
the hypothesis
is
icantly different
that the distribution in the sample
to stop at variables
is signif-
we do not want of random distributions marginal
from a normal
distribution. If
comparing the X' and Y' with a one-dimensional normal
distri-
bution by the consistency test, we may check the consistency of the joint distribution of variable (X' Y') with a two-dimensional normal distribution. For this purpose we have to con9
struct a special contingency table cies
in particular panels
The
and compare the frequen-
of this table with the theoretical
by multiplying the frequencies of the sample by the product of probabilities cor-
frequencies.
latter are calculated
On
testing certain statistical hypotheses
153
responding to a given panel and taken from marginal
distri-
butions. Knowing the frequencies in particular panels of the contingency table and the theoretical frequencies, we can check by the % 2 test whether these frequencies differ signifi-
cantly
from one another.
The above method of testing hypothesis H requires many cumbersome computations connected with the use of formula (1). We do not refer here to computations connected with the calculation of parameters x, y and the sums:
These computations are needed for
all
methods of
testing
We
refer to computations involved in the prohypothesis //. cess of testing the hypothesis. The verification of the hypothe2 sis by the # test in the way described below requires com-
putations which become more time-consuming as the size of
sample increases. This is the drawback of this test. Tests described in textbooks (e.g. see [28] 1 ) suffer from the same the
drawback. Only a
test for
which the time needed for compu-
depend upon the size of the sample, convenient and practical. This type of test is described below. We shall call it the B test to distinguish it from the tations does not greatly
is
test
described above which
advantage of the
B
we
shall call the
test is its simplicity:
computations. The main disadvantage
4. 13.
Testing the hypothesis
H
it
is its
A
test.
The main
requires very few
low
"sensitivity".
by dividing the plane
into
quadrants (method B)
Let us denote by
A
the event that variable
value greater than the average value
1
A.
Hald:
York, 1952,
Statistical
p. 602-604.
m
l9
X
assumes a
and variable
Theory with Engineering
Applications,
Y
as-
New
Linear regression
154
sumes a value greater than the average value
m
z.
This means
that (1)
where p) means "and". Let us further denote
C=
(AT
When m l = w a = lies
a ),
(2)
> mO HO" <*i,),
(3)
2).
(4)
then
,4,
point (x, j) rant of the coordinate system.
formula
D
C,
,
respectively in the
1st,
It
denote events in which 3rd, 4th
and 2nd quad-
can be shown (see
1.2.9.,
(9)) that
ft == p(A)
=
P(B)
=
=
P(D)
-
4
+ 2n arcsin e
(5)
and
-
ft
P(C)
- - - arc 1
J
4
sin g.
Thus we know the probability of the random chance point (X, Y)
is
that the
located in a particular quadrant of the plane.
The knowledge of a
(6)
2:re
these probabilities allows us to construct
hypothesis //. This hypothesis may be veri2 tests, but the % test seems to be the most con-
test to verify
fied
by many
venient.
Knowing
the probability of a
random occurrence of a point we can easily calcu-
in the individual quadrants of the plane,
numbers of points in these quadrants. numbers the hypothetical frequencies of the quadrants of the plane. The hypothetical number of points
late the hypothetical
We
shall call these
On
testing certain statistical hypotheses
155
quadrants will, of course, differ from the empirical number of points. We shall call the empirical number of points the empirical frequency of the quadrant of the in the individual
plane. Let us denote the hypothetical frequencies
the empirical frequencies by n
t
(i=
1,
2,
3,
by
n\
and
4).
The measure of divergence between the hypothetical and empirical frequencies is expressed by the quantity
a random variable and has the % 2 distribution with when three degrees of freedom. We reject hypothesis
which
is
H
X*>Xl where xl
is
a
number dependent upon
the significance coef-
ficient a. test
B
the population parameters
m
The construction of tice
this
know for
happens very
is l9
rarely.
based on the assumption that w 2 and Q are known. In prac-
when we do not we have to substitute
Therefore,
the values of these parameters
them the estimates from the sample. (A
was
similar procedure
also used in test A.)
Example 1. A sample of 72 items was taken from a twodimensional population. Each item in the sample may be treated as a point on the plane. The coordinates of these points represent the two-dimensional random variable (X,Y). On the basis of statistical data obtained from the sample we
want to check hypothesis
H that
the distribution of the two-
dimensional population is normal. The shown in Table 1 in 4.1.2. follows from the calculations which
It
here 1
1
that
See the Appendix pp. 215-216.
statistical
we
shall
data are
not quote
Linear regression
156
v ^
Vx ^
190-4 y\j ~.7
1
.
n
^_j l
= -yv == 1
v
r
-
179-0,
= 0-96.
=
Assuming that
m = :
x,
m
y and
z
#
=
r
we
calculate the
A,B,C and D. a random chance
frequencies of the actual occurences of events
The occurence of event A
is
equivalent to
of a point being located in the 1st quadrant of the plane. It is assumed that the origin of the coordinate system lies at point (x,
~y).
easy to check (using Table 1 in 4.1.2.) that event A has occured 24 times, event B 34 times, event C 1 times, event It is
D
also 7 times.
Assuming that Q = 0-96 we determine p and p% (see formulae (5) and (6)). The values of p t and p 2 are functions of Q. pl and p 2 correspond to difshown in Table 2. Using this table Q. They = 0-456 and p 2 0-044 correspond to the pl
Different values of parameters ferent values of
are
we find that number Q = 0-96. Since
we know p and p% we can
calculate the hypothetical
frequencies of points in particular quadrants of the plane and we can check by the # 2 test the significance of the deviations of the empirical frequencies
from the hypothetical
frequencies.
The
calculations are
shown
in Table
1.
On
testing certain statistical hypotheses
TABLE
157
1
TEST APPLIED TO VERIFICATION OF HYPOTHESIS
H
BY
DIVIDING THE PLANE INTO QUADRANTS
=
Let us assume that the level of significance a 0-02. For of significance with two degrees of freedom 1 the
this level
corresponding value of %\ should be rejected since
f=
is
1141
7-8.
Therefore, hypothesis
>* = 2
H
7-8.
We
can see from the above example that test B is very simple to use. For this reason it has a variety of applications, is conducted particularly when the testing of hypothesis
H
on a
large sample.
In conclusion
about both 1
2
it
tests.
might be worth while to say a few words They can be used in two cases:
when all the parameters of the distribution and we are checking only its shape;
when we ulation
known
are
are testing a simple hypothesis that the pop-
two-dimensional and normal,
is
with
given
parameters.
1
Since, as a result of
we now have only
combining the
last
two
three instead of four classes.
classes in
Table
1,
Linear regression
158
TABLE 2 PARAMETERS
Pt AND P2
AS FUNCTIONS OF
H
The verification of hypothesis by both tests jointly should be carried out in two stages (as in two-stage sequence analIn the first stage
we
use test B. If
for rejecting hypothesis
H
the analysis
ysis).
9
does not enable us to reject hypothesis the second stage, i.e. we apply test A.
provides grounds
it
is
finished. If test
H
9
we move on
B to
On 4.2.
159
testing certain statistical hypotheses
Checking the hypothesis that the regression
Q
general population 4.2.1. General
As we
lines in
are straight lines
comments
said in 1.2.5., the
most
difficult
problem
in the pro-
cess of estimating regression parameters in a general
lation
on the
lation, is
random sample taken from
basis of a
popu-
the popu-
a proper choice of the approximation function. The that we have when we make such
amount of information a choice
is
usually small
:
we have
as a rule, all
numer-
are the
data from the sample and the scatter diagram. the distribution of the points on the scatter diagram ical
tempt
to
From we
guess to which class the function appearing in the
regression equation of the general population belongs.
word "guess"
When
reflects very well the idea
cannot
ing
the :
state anything;
The
behind this procedure.
we are groping we can only guess.
searching for this class
We
helpful
at-
in the dark.
In this guess-
information supplied by the sample is useful and allows us to formulate a statistical hypothesis that
it
the function appearing in the regression equation belongs to
a certain class of functions. The data from the sample enable us to test this hypothesis.
In this section
we
shall discuss the
methods of
hypothesis that the regression equation
is
testing the
a linear function,
that
i.e.
g(x)=ax+p. This hypothesis
we
shall
denote by the symbol
HL
.
In
this
case
The
verification of this hypothesis
tance: as long as there are
HL
we can
is
of great practical impor-
no grounds for
rejecting hypothesis consider that the regression lines in the general
Linear regression
160
No
population are straight lines. will be better than hypothesis
them
all
even
if it
statistical
therefore
\
HL
and retain to hypothesis
other hypothesis,
from the
HL
other
.
hypothesis
we can abandon
The acceptance of some
equivalent to hypothesis
is
HL
point of view, results in serious inconconnected with dealing with a regression curve veniences instead of a regression line and thus with the necessity of statistical
determining the parameters of a curve instead of those a line. In the literature on the subject we can easily find (e.g. see 397) a description of methods of testing hypothesis L by a large sample. In practice, however, it is often necessary to test this hypothesis on the basis of a small sample. [16],
p.
H
we propose a test which enables the verification of hypothesis HL when the sample is small. In the following item we describe, after Barkowski and Smirnow, a method In 4.2.2.
of testing hypothesis
4
2.2.
HL
in a large sample.
Testing hypothesis
HL
in
a sma
sample by a run
I
test
Hypothesis
HL
can be verified by a run
test.
We
shall de-
scribe this test briefly.
Let xl9
X
;x:
2,
...,
xn denote the
realization of
random
variable
determined on the basis of the elements of general popuQ, and let F(x) denote the distribution of variable X.
lation
a sample composed of n items and taken from Q, then Xi, x& ..., xn can be treated as the values of items selected for the sample (by "the values of items" we mean the actual If co
is
values
of random variable
X
corresponding to particular
items of the sample). In repeated sampling, the values drawn may be treated as the realization of a finite sequence of variables 19 ..., corresponding to the numbers of the 29 n9
X X
X
items of sample 1)
random
ro.
If the
variables
X
l9
sample
X&
...,
is
Xn
random, then are independent;
On
testing certain statistical hypotheses
161
2) they have the same distribution. In this case
P(xA x
The
-
*B )
...,
,
P(xl9 x29
probability
order of the variables.
It
P(xJ P(*2), .
xn )
...,
...,
P(xJ.
(1)
not dependent upon the
is
follows that
if
sample
co is
random ..., xn
then the n\ permutations formed from the numbers xl9 x 2 have the same probability of occurrence.
,
Suppose that we have a sequence of n items composed exclusively of elements A and B. Here is an example of such a sequence:
A
We
=
have here n
9
A,A,B,A
10 items,
=
9
B B A B A. 9
9
9
(2)
9
among which
there are n t
=6
4 elements B. Each sub-sequence with the largest possible number of items of the same kind is called a run. The number K of items comprising a given run is called elements A, and n 2
the length
of the
number of runs
run.
R
are
Both the length of the run
random
variables.
The
K
and the
distributions of
known. This enables us to test the hyco was taken at random. Below we show
these variables are
pothesis that sample
a table taken from
[16],
p. 340,
(A more detailed discussion
of the problems related to run theory can be found in Chapter XIII of [20].) This table helps to verify the hypothesis by test. Symbol R K appearing at the heading of the second column denotes the total number of runs with lengths not less than K; symbol R1K denotes the number of runs composed of elements A, of length no less than K, and R% K denotes the number of runs composed of elements B of length no less
a run
than K. In the table are given the maximum values of the number of observations n which satisfy one of the inequalities
shown
at the
with probability
head of the second, third or fourth column, than 0-05. The method of using the run
less
test
described above for the verification of hypothesis
will
be explained by an example.
HL
Linear regression
162
TABLE
1
THE RUN TEST The
greatest
number of observations n
for which
the probability of satisfying the inequalities
Length
Example
shown below
1.
Table 2 contains
is
less
statistical
than 0-05
data collected in
a Wroclaw brewery. The monthly production of beer in hectolitres is given in column x, and the cost of labour in zlotys in
column
y.
TABLE 2 BEER PRODUCTION AND LABOUR COSTS IN A
BREWERY
WROCLAW
On There
is
testing certain statistical hypotheses
163
a relationship between the cost of labour and
production. The regression line is a statistical expression of this relationship. The scatter diagram shown in Graph 1 suggests that the regression line
position
may
a straight
is
be treated as hypothesis
GRAPH
HL
line.
This sup-
.
1.
60-
15
'0
5
production of beer ( thousand hi/month)
The consecutive sis
stages of the verification of this hypothe-
are stated below. 1.
H
Assuming that hypothesis L is true we estimate (by any method) the parameters of the regression line. In our example the equation of
where the
coefficient b
this line is:
= 23-8
is
measured in hundred
thousand zlotys per month and the coefficient a in hundred thousand zlotys per hectolitre. 1
11*
Computation Table
is
shown
in the
Appendix,
p. 217.
= 0-141
Linear regression
164 2.
We
A
denote by
regression line,
the event that a point
and by
B
the event that a point
this line. Points lying directly
into consideration.
above the
lies
on the
In practice this
below
not taken
line are last
lies
no
event has
chance of being realized since in cases of continuous variables 3.
its
probability
zero.
is
We
arrange points according to increasing values of
the
abscissa.
On
the
HL
assumption that hypothesis
true we can consider that the deviations of particular 23-8 f 0-141;c are of random points from the line character and do not depend upon the order of the succession of the points. In our example the points are
is
y=
in the following order:
arranged
1,2,3,11,12,21,10,20,22,6,4,7,13,9,19,5,8,14,17,15,18,16.
The It
figures denote the
numbers of points
sides of the regression line, but with the
In practice this variable (X,Y)
is is
possible only
same
on both abscissa.
the abscissa
X
of
a discrete variable. In this case, how-
needed and therefore we 4.
when
verification of hypothesis
the
ever,
in Table 2.
that in a sample there are points
may happen
shall
HL
no longer
is
not consider this case.
Using the accepted arrangement of the points, we write A and B. In our
the sequence of the realized events
example the following sequence of events has been obtained:
A B BBBA A A 9
5.
We
9
9
9
find
Table
9
the
9
9
9
AJ B B B,A,A B A B A B A B. 9
9
maximum
9
length
9
9
9
9
of run
9
9
9
K
and, using
we
1, analyse whether there are grounds for rejecting 0-05. We hypothesis L at the level of significance a reject this hypothesis when the test shows that the de-
=
H
viations of the points
from the regression
line are
not
On
testing certain statistical hypotheses
165
random, but that the points show a certain tendency in their location above or below the line. In our example
Using Table
1
we can
state that there are
for rejecting hypothesis
When
the sample
is
HL
no grounds
.
small, the testing of hypothesis
HL
by
very convenient since, as a rule, there are no computations involved other than those related to the deterof regression parameters. mination Checking whether
a run
test is
lies below or above the regression line is done from the graph. To avoid difficulties the scale of the graph
a point
has to be properly selected. If a point on the graph
lies
exactly
on the regression
line
we have
appropriate calculations and check whether
to
make
(on Graph
1
points
No. 6 and
11)
on the regression line or whether it only on the line because of the scale used in the graph and the drawing technique (upon which the thickness of the line and the size of the point depends). If random variable (X,Y) is continuous and if the sample is small and the graph sufficiently large, then it is seldom neces-
this point really lies
appears to be located directly
sary to carry out computations to check whether a point
on the
line or close to
4.2.3.
lies
it.
Testing hypothesis
HL
in
a large sample by Fisher's
test
The
verification of hypothesis
HL
by a run
test
becomes
troublesome when the sample is very large. In such cases the checking of each point, whether below or above the regression line, takes too much time even if we do not make calculations but use only the graph.
The
verification of hypothesis
HL in
a large sample can be
Linear regression
166
done by Fisher's
test F.
can be proved that the random
It
variable
F
has the distribution
with the number of the degrees of de1. In formula (!) symbol r/J
=
= 12
n and k 2 freedom k^ notes an estimate of parameter
rfa
1.2.8.); n as usual, denotes the number of values that variable 9
words
table (in other
/ is
A detailed description as verify hypothesis
be found in
4.3.
An
The with
HL
,
[16], p.
the
on the basis of the sample (see of the sample, and / the
size
X assumes
how
to
397.
own
shown
(see
variable (^,7)
[4]) is
We if
reasons
shall
mention
in
in 4.1.
,
passing
we gave two
n2
that
(X Y) 9
the is
2 degrees of freedom
Bartlett
is
test
normal. That
tests for verifying
distribution of the variable (X,Y) 8
a random variable
normal1 then the variable
the distribution of variables
why
is
expected value and variance. that if the distribution of the
its
distribution,
has Student's distribution with
1
should be used to
analysis of the significance of regression parameters
Bartlett has
random
used
F
test
in this table).
together with a numerical example, can
regression coefficient in a sample
its
in the contingency
number of rows
.
can only be one of the
is
the hypothesis that the
normal.
Tables for Student's distribution are given at the end of the book.
On
167
testing certain statistical hypotheses
In formula (1)
Z(*> -x)(y, -y)
i
By elementary transformations we can show $! ]/
n
2
that
/0 , (2)
.
,
where
The knowledge of the
distribution of variable
enables us to
t'
determine the confidence region which will cover the unknown value of parameter a 21 with probability a. Knowing the distribution of variable i
P
I
/
/-{a 21 i
We
-
/' *
^
we can
t'
e 21
=r
,
write that
^ on < ^- # < + \
21
S^ii-2
c
~ ^_=, = a.
/' ' l
Sjj/n
i
21
I
2
/"*\
(3)
| J
can also prove that the variable
'
has Student's distribution with n
2 degrees
Hence
2
l/ii
2
a
i/n
2
of freedom.
Linear regression
168
In formulae
(4)
and
(5) the
parameter
M
b.
ya^x.
It
can
also be proved that the variable
2
l/ii
J.
X)
(6)
2
SJ
has
Student's
distribution
with n
2
degrees
of freedom.
Therefore
1
+ 5?
y
-2
(7)
-2 We
shall
and
(4),
show
the formal relationship between formulae (2) (7). Let us consider the regression line
and formula
equation in the sample
where
Let us assume that
ii
-2
On
169
testing certain statistical hypotheses
Hence ?2 >21
S?(fi-2)
(it
-2) (9)
Therefore,
on the
basis of (8)
and
(9) 2i
(10)
b on
#>r
(11)
(12)
All
comments concerning
the regression parameters of
also apply to the regression parameters of
Example
1.
The
scatter
X
on
diagram on Graph
1
Y on X
Y.
shows the
relationship between the average
monthly expenditures for consumption and the average monthly income of twenty four-member families drawn from among four-member
Lower Silesia. based come from
families included in family budget studies in
The
statistical
Table 2 in
data on which the diagram
is
3.2.2.
GRAPH
1.
200
150
"200
400 350 300 250 average monthly income (ten zlotys/month)
Linear regression
170
shown on Graph 1 are regression lines determined by the classical method and the two-point method. The positions of the two lines do not differ much. The equation of the regression line determined by the method
The two
straight lines
of least squares
is
y
= 47 + 0-31X,
and the equation of the regression two-point method is y
=
55
line
+ 0-28.X
determined by the
1 .
The equations of the two continuous curves shown Graph 1 are expressed by the formula
on
-__ -
y+
^
.
(13)
2
n
In our example =,
y
47+0-31
S2l
= 4,269, = 20-69,
x
=281,
n
-20.
SI
x,
Thus the equation of the continuous curve located above the regression lines and corresponding to the value t = 1 will assume the following form:
= 47 + 0-31x + 20-69.
A
'
1/18 1
The computation
table
is
shown
in the
Appendix
p. 218.
On
testing certain statistical hypotheses
171
and the equation of the continuous curve located below the 1 is regression lines and corresponding to the value t = expressed by the formula
= 47 + 0-31*- 20-69.
y
Curves (13) determine the confidence region which
will cover
Q
with probability a. the regression line in the general population For a 0-98 with 18 degrees of freedom, we find in Student's
=
distribution table:
t
The two t
= 2-55
interrupted curves
=
2-55.
on Graph
1
are curves (13) for
1 .
Using the confidence region we can decide whether the position of the regression line obtained by the method of least squares differs significantly from the position of the regression line obtained by the two-point method. Let us denote
by
H
T
the statistical hypothesis that there
is
only a random
difference between the position of the regression line deter-
mined by the method of least squares on the basis of statistical data from sample a> and the position of the regression line obtained on the basis of the same data by the two-point we have to select a number a method. To test hypothesis r
H
and accordingly draw two region.
We
reject
lines
hypothesis
by the two-point method
determining the confidence when the line determined r
H
intersects
one of the curves deter-
mining the confidence region. We can see from the graph that in our example there are no grounds for rejecting hy1
See the Appendix p. 219.
Linear regression
172
H
pothesis
does not intersect either
Graph a
1
.
by the two-point method of the two broken curves drawn on
since the line determined
r
These
lines
- 0-98.
correspond to the confidence coefficient
We know
that if a regression I line in a general population a straight line the regression parameters in the population calculated by the two methods will be the same. This means is
that the estimates of regression parameters obtained
from the
sample by the method of least squares and by the two-point method should not show any significant difference if the as-
sumption a straight
H
r
is
true that the regression line of the population
is
line.
It
is
follows that the verification of hypothesis
equivalent to the verification of hypothesis
H
example hypothesis
r
has not been rejected.
HL
.
It is
In our easy to
find out (using the run test described in 4.2.2.) that there are also no grounds for rejecting hypothesis L
H
.
H
The verification of hypothesis L by the determination of the confidence region for the regression line is too cumbersome. Let us remember, however, that the regression lines of the sample determined by both methods go through point (x,y). Therefore, instead of checking whether the positions of the two regression lines differ significantly from one another,
it
sufficient to find
is
significantly.
We f r a zi
To do
this
out whether their slopes differ
we proceed
as follows.
number a and determine the confidence region according to formula (3). Then we check whether
select class 1
#21 point
when a&
^ es within polnt
lies
this
region.
We
reject
outside the confidence
hypothesis
region,
i.e.
is
HL in
the critical area.
1
Let us remember that a^
by the
classical
c i ass is
method, and a2i
the regression coefficient obtained
point denotes the regression coefficient
obtained by the two-point method (see List of Symbols at the end of the book).
On
173
testing certain statistical hypotheses
we have:
In our example
= 0'31, = 0-28, 5a = 20-69, ^ 4,269, hence 5J
*21 class
a
point
^
=
65,
-- 0-98,
a
= 2-55, = 20.
1
t
/i
Therefore
nil 31 '
"^
20-69.2-55 -
^ n 3i+ ,,,
20-69.2-55
'
65
--4>3~^5-
or 0-12
21
<0-50.
Since 0-12
21 Folnt
- 0-28 < 0-50,
two lines are not significantly different so no grounds for rejecting hypothesis HL The application of the two-point method to the verification
the slopes of the that there are
of hypothesis
.
HL
is
very convenient since there are few extra
computations involved. If we have made all calculations required to determine parameter a^ class the determination of parameter a ?l
polnt
is
simple because most of the calculamethod can be used in the two-
tions needed for the classical
point method.
We
can see from this example that the two-point method not only useful for estimating the regression parameters, but can also be applied to the verification of hypothesis L
is
H
2.
Table
the cost of electric
>
contains data from monthly reports
Example on the production of beer 1
power
in hundreds of hectolitres (x)
and
in thousands of zlotys (y).
The
figures follow the chronological order.
Linear regression
174
TABLE
1
BEER PRODUCTION AND POWER COSTS IN A
WROCLAW
BREWERY
The data from
this table
were used for Graph
GRAPH
2.
2.
600
400
WO 200 250 150 50 production of beer (hundred hi /month)
There are three straight lines on this graph. The line with 359 0-69;c was determined from full equation y
=
the
statistical
points.
+
material,
We
shall
i.e.
call
it
from the data pertaining to line I. At first glance this
all
line
does not arouse' any doubts. Let us note, however, that 8 consecutive points
(marked on the graph by crosses) are
On
testing certain statistical hypotheses
175
located below the regression line. These points correspond to the data for the last eight months shown in Table 1. An
event consisting of 8 consecutive points from among 20 points random on both sides of the regression line
distributed at
and located on one of occurrence.
A
side of the line, has a small probability
run
test indicates that the line
y=
359
+ 0-69.X
For
this
reason the data contained in Table
cannot be considered as a regression line. 1 should be di-
vided into two parts 1 In the .
these data
is
first
The equation of
are included.
y=
with equation
+
277
part the first 12 observations
the regression line based
1-93*. Let us call
it
line II.
on
The second
part comprises the remaining 8 observations. The equation of the corresponding regression line is y 300 -j- 0-75x. We
=
shall call
it
Using formula
line III.
we can
(6)
discover that
from of line IIL This much information has been provided by a formal analysis of the data shown in Table 1. Let us now comment upon the economic aspect of the problem. The efforts of the factory the position of line 11 differs significantly
that
personnel to reduce the cost of production were effective: the cost of electric power was substantially lowered. Success
came
imperceptibly.
electric
power
The daily had a
finally
efforts of
each worker to save
cumulative effect over
visible
a period of several months. The nature of the relationship
between the cost of power and production 2 changed significantly. The variable part of the cost of electric power was lowered.
In
the
first
19-3 zlotys/hl.; in the 7-5 zlotys/hl. This
is
12
months
this
cost
amounted
to
only amounted to a very important achievement by the next 8 months
it
workers of the enterprise. Let us here make the following comment: an analysis as to whether the position of the regression line in population fJ L 1 2
See the Appendix computation table on pp. 220-222. Let us note in passing that we could not learn about this relation-
ship without the assistance of correlation analysis.
Linear regression
176
from the position of the regression
differs significantly
line
in
population Q
Q%
are the same. This hypothesis can be verified by Fisher's
only when
the hypothesis
is possible z that the standard errors of estimates (sec 1.2.7.) in
F
test,
or (especially
when
the sample
is
true
is
Q
and
l9
by Sadowski's
small)
test [48].
An
4.4.
analysis
of the significance of the correlation
coefficient If the distribution
of the random variable (X,Y) defined on
the elements of the general population
Q
normal and
is
correlation coefficient Q of the population
if
the
close to zero,
is
then, for a sufficiently large n> the distribution of the correlation coefficient r in sample fer
much from a normal
co
drawn from
Q
does not
\n
1
Therefore, for Q close to zero and for a large tion of the random variable
n,
the distribu-
^pL.j/^T i-e
a)
2
is
close to It
normal
1
jV(0,l)
(see [28]
can also be proved that
variable
t
if
r
2
has Student's distribution with n denote
then the
Q
j/i
by
random
"2
H
Q.
(2)
2 degrees of freedom.
easily test the hypothesis that Q
this hypothesis
1
).
r
= 1
Thus we can
= 0. We
shall
In this case
A. Hald ^Statistical Theory with Engineering Applications, York, 1952, pTfiOS. :
dif-
distribution with parameters
New
On In practice fore
we
177
testing certain statistical hypotheses
often necessary to test hypothesis 7/ ; therebelow the procedure involved in check-
it is
shall illustrate
ing this hypothesis. Fisher, who studied
the
distribution
of the correlation
coefficient, obtained interesting results with wide practical
implications,
by introducing the variable ,
= llniI~ 1-513 logiA 2
where
1
In formula
1
r
1
(3)
r
1
oo
;
(3) In denotes the natural logarithm,
and log stands
for the logarithm to the base 10.
As n
z
increases the distribution
converges
rapidly
to
the normal distribution. Since
'
2
l
2(H
e
1)
and V(T\ Y \) <~>
then the distribution of the
( $\ \^J
5
n
3
random
variable
^3 is
close to the
P
\
where
normal distribution N(0,l). In
z
/L
i/n-3
I
<
Knowing
a
<
this case
< E(z) < z + -7=^=4 = a, A-3J
(7)
1.
the confidence region for E(z)
the confidence region for E(r)
= Q.
1
_<>
2(n
we can
easily write
Let us denote
In this case
12
(6)
- 1)
Linear regression
178 If
we omit
the second
inequality signs /since
it
component of the sum between the is
small in comparison with-
then after elementary transformation
a. >
now
hypothesis
Example
+
(8)
1
inequality within the brackets determines the con-
fidence region for
Let us
obtain
1
<e< The double
we
=^-_),
H
Q
1.
.
illustrate
by an example the
verification
of
.
Table
1
contains
the
data
on the average
monthly income (x) and the average monthly expenditures on drink (y) for 30 four-member families drawn from among four-member families included
Lower
in family budget
studies
in
Silesia.
TABLE
1
MONTHLY INCOMES AND MONTHLY EXPENDITURES ON DRINK FOR 30 FOUR-MEMBER FAMILIES
On
testing certain statistical hypotheses
The scatter diagram shown on Graph from this table. GRAPH
1 is
179
based on the data
1.
200
I 100
70
The
25
35 40 30 monthly Income (hundred zlotys/month)
distribution of the points
on the
scatter diagram sugbetween the expenditures on drink of income is very weak. In economic language
gests that the relationship
and the size means that the want
this
called "drink"
is
fully satisfied
in
Poland. t
This rather regrettable fact
tics serve
only to confirm
it.
is
It
generally known and statisfollows from the calculations
that the correlation coefficient between the expenditures on drink and income is 0-19. Let us formulate the hypothesis
HQ
=
that the correlation coefficient Q 0. To test this hypotheus calculate the value of t according to formula (2).
sis let
We
have
,=
_L
19
_y28-,H)4. F
1-(0-19)
2
In Student's distribution tables for a
of freedom, we find that
t=
1
t
-04
=
= 0-05
with 28 degrees
2-045. Since
2-045,
H
there are no grounds for rejecting the hypothesis that the correlation coefficient between the expenditures on drink and income equals zero. 12*
5.
THE TRANSFORMATION OF CURVILINEAR INTO LINEAR REGRESSION It
may happen
so distributed
that the points
that
we should
on the
scatter
diagram are
reject the hypothesis
4.2.) stating that the correlation is linear.
We
HL
(see
can then proceed
in one of the/ollowing two ways either by determining the parameters of the segments of straight lines forming a broken line or by selecting a family of curves and determining the :
parameters of one of the curves belonging to this family. We shall not discuss here the method of determining the
broken
line since this
can be reduced to the determination of
the parameters of a linear regression, but
we
shall deal
with
certain cases of curvilinear regression.
Suppose that the stochastic relationship between variables x and y can be well described by a function which is graphically presented as a curve. Let us consider a family of such functions;
by appro-
priate transformations they can be reduced to the linear form:
z-Av + y, where z
=
(1)
=
z(y), v v(x) and A and y are constants. Table 1 shows the most commonly used transformations and the functions that can be obtained by them. In equation (1) there
are two parameters. However,
it
may happen
that for the
approximating curve to describe properly the distribution of the points on the diagram, the equation of the curve must
be a function with more than two parameters. Then, of course, be used. In economic research,
linear transformation cannot
however, it is seldom necessary to use the type of function having more than two parameters. 180
The transformation of curvilinear regression
V Z
o o
I 1 p S3
g H
Z
3
3
w w
I o
1
^
K
-N
H
181
Linear regression
182
The functions reviewed in econometrics.
in Table
1
have
many
applications
on demand we deal 2.2.3.). A parabola, an ex-
For
instance, in studies
with hyperbolic relationships (see ponential curve and a logarithmic curve are used in the theory
of wants (see 2.2.1.). In demographic research exponential curves are most frequently used (this will be discussed in hyperbola is used in the theory of costs (see Example 1).
A
2.2.4.).
To
the analysis of time series various functions are
most frequently used after linear function being trigonometric functions and the function that is geometrically applied,
the
represented by a "logistic curve".
provides a review of the more important functional relationships in economics. For this reason
Winkler's work
the
book
is
[61]
well worth reading. In studying techno-economic and in determining distribution cur-
relationships (see 2.2.6.)
ves (see 2.2.2.,
Graph
1)
various functions are used. However,
in these cases, too, the functions
are those
shown
in Table
Linear transformation
most frequently employed
1.
is
used mainly because
it
enables
us to satisfy the conditions required by the Markoff Theorem.
any approximating function obtained determined by the method of least squares,
If the parameters of
by formula then
(1) are
we know from
the Markoff
Theorem
that these para-
meters are consistent, unbiased and the most effective mates. Linear transformation
is
also useful because
simplifies calculations. This
and, therefore,
As we know,
we
is
it
esti-
considerably
of great practical importance
shall discuss this aspect in greater detail.
in order to determine the values of constant
parameters in the equation of the approximating function g(x) by the method of least squares the partial derivay
=
tives
have to be calculated for the expression
The transformation of
GRAPH
curvilinear regression
GRAPH
1.
GRAPH
GRAPH
4.
2.
3.
GRAPH
5.
183
Linear regression
184
GRAPH
This
is
GRAPH
6.
to be equaled to zero
and the
set
7.
of normal equations
so obtained then solved. It
is
this set
tion
and sometimes impossible to solve
by algebraic methods. The application of approxima-
methods further complicates the computation procedure
so that
To
usually difficult
it
is
of
value in practice.
little
illustrate, let
us consider the exponential function
We
want to determine the parameters of therefore, we want
S=
1
JT \y
i
ba Ti ] 2 to be a minimum.
/
Calculate the partial derivatives
da
this function and,
The transformation of curvilinear regression
\
85
Equate them to zero:
JTV* y
i
-b
a^i
= 0,
i
i
a2 ''- 1
As we can see, mal equations culties
= 0.
there are difficulties in solving this set of nor-
(containing only two unknowns). These diffi-
we apply linear transformation. Thus we y and minimize the expression
disappear when
calculate In
i
where
A
From the solution of known formulae
= In a,
y
In b.
the set of normal equations
we
obtain
the
Having the values of A and y we can calculate parameters a and 6 without difficulty. It can be seen from the above comments that there are good
We
have
to remember, however, that in spite of the fact that the
com-
reasons
why
linear transformation should be used.
putation is thus considerably facilitated, it is fairly difficult to determine the parameters of the regression line by the method of least squares. This is due to the fact that by per-
forming the calculations on the numbers x and y required t
we
t
obtain three-, four,- or fivedigit numbers which cannot be rounded off radically without for linear transformation,
Linear regression
186
endangering the accuracy of these calculations. Under these circumstances the two-point method is very useful because it
enables us to determine the regression parameters without
cumbersome computations. This method ical
is
used in the numer-
examples given below.
Example
1.
TABLE 2
GROWTH
OF POPULATION IN SWEDEN 1750-1935
The transformation of curvilinear regression
\
87
In Table 2 the population of Sweden is shown for 17501935 (see [61], p. 158). In column / the consecutive numbers of years are given, and in column
rounded off to three
x
the population figures
significant digits.
Parameters A and y are calculated by formulae
(7)
and
(8),
3.3.1.
We
have
54^1207 =5-441 10
==
7(2)
7
(1)
5
=
15-450000,
=
5-500000,
=-
=
5*17557,
20
=
10-4800000,
20
- 5-5 y = In b = 5-81756 - 0-075548 b = 148. 15-45
A
straight line with the equation z
shown on Graph
8.
As we can
10-48 ** 5-02582,
.
=
see,
5-02582+0-075548^
it fits
is
very well to the
on the graph. These points show a clear linear tendency. Note that the points have been plotted in the coordinate system voz and not in the system toz, bedistribution of the points
shows the distribution of points after linear transformation. This transformation was needed for the cause
Graph
8
Linear regression
188
determination of parameters a and b. After these parameters are found we can return to the original distribution, i.e. the distribution before transformation. This distribution, together
with the regression line fitted to it, is shown on Graph 9, p. 189. As can be seen, an exponential curve well represents the dynamics of population growth in Sweden. A major devia-
and last years of the period from the graph that the parameters of the curve have been properly chosen and so it can be said that the dependence of the population growth in Sweden
tion can be noticed only in the first studied. It can be seen
upon
in
time,
the
period 1750-1935, can be approximated
by the exponential function with the equation
*= The
148. e 075548 '
'.
application of linear transformation enabled us to de-
termine parameters a and b without difficulty; the known formulae for determining the parameters of the straight line
were used.
Substantial simplifications in the computation were also achieved by the application of the two-point method to
the
determination
of the
regression
GRAPH
parameters.
8.
bb
'
%6 I
1
5-5
I 50 2
4
6
8
10
12
time
14
16
18
20
The transformation of curvilinear regression
GRAPH
189
9.
*K
2
8
6
10
14
12
16
18
20
time
Example studies
2.
that
W.
Stys has noticed during his demographic
there
is
an
interesting
relationship
be-
tween the number of children in peasant families and the size of farms possessed by these families. This relationship can be consider significant since
it appears very clearly data (the sample covers 8,505 families). The relationship noticed by Stys can very well be described by a power function. The equation of the
on the
basis of
abundant
statistical
method of least squares and with the application of a linear transformation (see [55]) is
regression line calculated by the
Since
in
logarithms
linear
which
transformation
provide
it
five-digit
is
necessary
numbers
to
and
take since
Linear regression
190
the frequency of the distribution
is
expressed by numbers the computations con-
with three of four significant digits, nected with the determination of the parameters of the regression curve by the method of least squares are cumber-
some and time-consuming. Incomparably simpler tions are required for the two-point
method. To
calculaillustrate
we
shall compute the parameters of the power curve by method. The data are taken from Table 3.
TABLE
NUMBER OF CHILDREN
IN,
this
3
AND FARMS OWNED
BY,
PEASANT FAMILIES
Y
denotes the average number of children born to stands a farmer of the previous generation and variable
Variable
X
for the size of the farm.
tween variables
X
and
Y
Assuming that the relationship beis expressed by the formula
The transformation of curvilinear regression after using logarithms
we
get the linear expression:
z
where z
= log
j>,
v
= log
=
fa
x A 9
+ y,
=
a,
y
= log
6.
should be explained that the division of the sample
It
191
co
into
the two subgroups co l9 and o> 2 required for the two-point method has been done in such a way as to make the frequen-
of these subgroups as close to one another as possible. Because of the asymmetry of the distribution of variable (X Y) subgroup co^ contains 4 classes of the frequency districies
9
9
bution, and subgroup
8
o> 2 ,
classes.
Below are the calculations connected with the determination of the values of parameters A and 7.
=
0-8531,
==
0-7598,
4,437
J<1>=
4,068 3,492
623 n-
0-7869
7
-0-1531
= 0-8531 - 0-147
.
0-7869
= 0-7374.
Hence
a=
0-147,
b=
5-46.
Therefore, the equation of the regression curve determined by the two-point method is
Linear regression
192
GRAPH
*
I
10.
d 7
J
fi
*
5
10
size of
15
farm
20
25
30
(in hectares)
On Graph 10 a scatter diagram is shown with two regression curves determined by the classical method (broken line) and the two-point method (continuous line). It can be seen from the graph that both curves are good representations of the distribution of points on the graph. However, the determination of the regression curve by the two-point method is much easier and, therefore, this method turns out to be more
useful in this example.
6.
THE REGRESSION LINE AND THE TREND The definition of trend 1
6.1.
Q
Let us denote by
random
t
a collection of items composing a genitem of this collection a value of the
To each
eral population.
variable
X
t
is
t.
The
*=E(X\t) where of
y(f)
is
when
In the special case the following form:
y(i)
* where a and
ft
distribu-
its is
such that
= y(t\
(1)
< < m.
-
t
is
at
linear,
formula
(1)
assumes
+ p,
(2)
are constants.
Let us denote by
we know
that
relationship
a function determined for at least those values
that satisfy the inequality
t
known
it is
assigned;
on the time
tion depends
r
and
s
two moments of time of which Points r and s determine a
< r < s < m.
that
certain time interval
the length of time
T
whose length into
n+l
is
T= s
Let us divide
r.
t=
1,2, ...,. by points At every moment of time t (t 1, 2, ..., ri) we draw from population Qt and return to it k ^ 1 items, examine which values of random variable Xt correspond to these items and calculate the arithmetic means of these values. In this way we get n pairs of numbers (l,^),
These parts we
parts
shall call segments.
t
(2,3c 2),
...,
function 1
(w,3cn ).
\p(i) is
Published
in
These numbers constitute a time
linear
it is
Przeglqd
to be expected that if Statystyczny
3/4, 1958. 33
193
(Statistical
we
series.
If
represent
Review),
No.
Linear regression
194 the whole time series
on the graph the time curve
will also
display a linear tendency.
The above model of a random process dependent upon time enables us to formulate the following definition: Definition
Function
1.
a trend
is
\p(i)
I
random
of the
X
depending upon time (see also [15], [27], [32]). series can, of course, be regarded as a realization time Every
variable
of the
t
random
hypothetical
variable
general
X
t
corresponding to items from the
population drawn for the sample at
moments of time 1,2, ..., n. The advantage of the above
definition of trend
is
that
it
any consepts such as "law" or "tendency" but employs only notions having a definite meaning in statistics. For this reason this definition is not does
not
subject to
use
non-statistical
any reservation of a formal nature.
In connection with our remarks concerning a correct definition of trend, one important problem requires explanation. Before
of a time
we
describe
its
nature
let
us discuss an example
Suppose that we are conducting
series.
statistical
research on the dynamics of the average sugar-beet yield per hectare on individually owned peasant farms in the whole country. Every year
we draw a sample from
the total
number
of farms and on the basis of the data on sugar-beet crops obtained from this sample we calculate the average sugarbeet yield on the farms drawn for the sample. In this way we get a time series.
The function
y>(t)
in this
assigning average sugar-beet yields in
example the
is
a function
whole country to
consecutive years.
This function basis of the data
is not known. We can estimate it on the from the time series. It is not easy, however,
because: 1) statistical
research can be conducted only once a year flow of statistical data
after harvests. Therefore, the
very slow;
is
The regression 2) the function
y>(t) is
line
and
certainly not
the trend
195
an expression of the func-
tioning of some law and its graphical presentation will not be a smooth, "nice" curve. On the contrary, it can
be expected that the curve will be irregular, "whimsical", will have bends and twists.
We
now
are
going to formulate the problem mentioned way of estimating the function
before. It consists in finding a
on the
basis of a time series, if
may have
assumption which is
we know
that the curve
a completely irregular shape. If is
we
difficult to accept, that every
reject the
time series
governed by some law, and if we define the trend as we did we have to recognize that the curve can have any
above, then
shape, which factors.
A
means
line
that
it
may
also
depend on the random
determined on the basis of time
series
data
is
an estimate of the shape of the curve y(t). The time series be interpreted as the data from the sample taken from a population Q t which changes with time. When the time
may
way we can speak not only of a trend in the population but also of a trend in the sample, which can be considered as an independent population. The trend series is interpreted in this
in the sample is, of course, ihe time series curve 5c l5 # 2 xn Let us denote the trend in the sample by A(t). When the whole >
population
is
analysed,
i.e.
the population, then <^(0
The trend of
when
= A(t).
the sample
is
the sample
is
identical
.
with
an estimate of the trend of the
be a sample composed of k items drawn at the moment t. Because of our assumption that items for the sample are drawn and then returned,
population. Let
co t
from population
Q
t
the conditions of the theorem that the sequence of arithmetic
means of the sample and the arithmetic mean of the population converge stochastically, are satisfied.
On
the basis of this theorem
P{\x -V(f)\ t
13*
we can
<*}-*!. fc-00
state that (3)
Linear regression
196
The above formula
how
explains
to estimate a trend of a
population by a trend of a sample. Note that all considerations which depends on time, apply concerning random variable
X
t
^ ^ m.
exclusively to the interval
Let us
now
t
consider the situation that
one item at a given moment of time situation that the
moment of
=
D
t
2,
1,
contains n,
...,
number of items in the sample equal to the number of items in
is,
only
and the at every
the populaof course, \p(i) =A(t). In pracseldom happens that the trend line is expressed by a
tion. In tice it
t
time,
both these
situations,
simple, uncomplicated function. Hence,
when we know from
experience the values of function y(0, and this is so in both cases mentioned, function y)(f) can be replaced by another
function &(f) which will be represented on the graph by a smooth curve, free of the irregular breaks in the curve representing y(t). We shall call function 0(t) trend II. Let us define this term. Let 9(i) be a function of variable t, deter-
mined for
r
< < t
5-
and
let
H
t
be the
of time re T can differ at most
at
statistical
x
hypothesis
moments x^ x^ random from the corresponding
that the values of the time series
...,
t
at the
values of function 0(t). Definition 2.
The
set
of functions
R
determined for teTand
such that for a certain a satisfying the condition the hypothesis
time series It
H
cannot be rejected,
t
xl9 *a
,
...,
x
t,
is
1,
called trend II of the
...
follows from this definition that that function for which
the deviations of the time series are of a
a trend
II.
In particular a trend II
by a broken line passing through
is
all
random nature
is
a function represented points
(/,
x ). t
Definition 2 poses the problem of the choice of function 9(t).
It
might appear at
first
functions 0(f) belonging to for which the
sum of
glance that from
R we
among
the
should select the function
absolute deviations of the values of the
The regression
line
and
time series from the values of (9(0 this is not so.
The minimum sum of absolute broken
the trend
197
a minimum. However,
is
deviations will belong to the
line passing through all points (1,^), (2,#a),
...,
(/,*,),....
This sum, of course, will equal zero. However, the following considerations are against the choice of this function. The notion of the trend has been introduced into the analysis of time series for two reasons: because, on the assumption of a status quo, the trend
1)
enables us to predict the development of enon to be studied in the future; line
2) because
it
phenom-
permits us to describe functionally the develthis phenomenon in the past; this descrip-
opment of tion plays
an important role in
follows
It
cases in which it is from the time series 1
all
necessary to eliminate a tendency
.
from point 1) that the more the scientific preon the trend line extend into the future, the
dictions based
more valuable they
are in practice.
From
the formal point of
view such predictions are an ordinary extrapolation of the trend
line.
possible
Naturally,
objectively justified
extrapolation
is
only for simple functions which are represented by continuous curves not having many fluctua-
graphically tions
and breaks.
If
we were
to extrapolate a curve having
irregular, complicated shape, we would be unable to provide a sufficiently convincing explanation for a bend or break in the extrapolated part of the curve. This leads to the neces-
an
of selecting only the simplest functions from among those belonging to set R. There is a contradiction between these sity
two
choosing function Q(f) from set R: that the shape of function 0(t) be as simple as possible and that the sum of absolute deviations of the values of the time series 1
criteria for
Such a necessity
is
most
likely to
appear when the problems studied
deal with stationary stochastic processes and in particular with the theory
of correlation of stationary random variables.
Linear regression
198
from the values of function 0(t) be as small as
possible. It
is
the author's opinion that in selecting function Q(t) we have first to consider the requirement that the shape of the function
be as simple as possible, and then the requirement that the sum of the deviations be a minimum. In Definition 2 it is stated that every function O(f) can be a trend series
II line if the deviations
of the values of the time
from the values of the function are random.
An
anal-
ysis whether a function belongs to set R consists in the veriThere are many tests that can be fication of hypothesis
H
t
.
used to verify hypothesis the
6.2.
most
t
.
We
discuss here only
shall
useful.
Some
6.2.1.
H
tests
far verifying hypothesis
The run
The run
H
t
test
test (see 4.2.2.)
can be used to verify hypothesis
Hr
We shall demonstrate this by an example. Example 1. Table 1 contains monthly operating data from the Wroclaw Transport Corporation. The data cover a period of two years and are expressed in thousands of car-kilometres.
TABLE
1
CAR-KILOMETRES OPERATED IN WROCLAW
The regression
The time curve based on
line
and the trend
these data
is
199
shown on Graph
1.
This curve displays a clearly marked growth tendency.
GRAPH
$
1.
' 5 i-s
5 12
2
4
6
8
10
12
16
14
IB
20
22
24
time
A
linear function has
been used to express
X in
this
tendency
= at + b,
which (1)
where
<,
1+1
and b
= xat,
(2)
Linear regression
200
whereas
=
x
1
n
yx it
t
When
n
an even number
is
n
1
t=
,
yt.
T
1
can
(in practice this condition
always be satisfied) then
4
we should check whether
In our example expressed
by
the
equation x
The parameters a and 6 of and
(3).
We
=
n
/n/2
= at-\-b
is
this line are
the straight line
a trend
II line.
determined by
(2)
have
^389 -14,022) =-^=23.4, ,= H022+
7=12-5,
17,389
=
24
6= We
would
like to
1,309
draw the
-23-4.12-5=
1,017.
reader's attention to the simplicity
of the computations involved in the determination of parameters a and b by formulae (2) and (3).
These computations are much simpler than those required for the
method of
The method of determining trend line proposed by the author is
least squares.
the parameters of the
a special case of the determination of the regression parameters by the two-point method 1 Let us denote by A an event in which the point with coordinates (t,xt ) is located above the trend line x at+b, and .
=
by 1
B
an event
The
in
which such a point
usefulness of this
is
located below the trend
method for determining trends was
cated to the author by J. Oderfeld.
indi-
The regression line.
We
do not take
on the trend
line.
of occurring since
and
line
201
into account the points located exactly
In practice the its
probability
from Graph
the calculations and
the trend
last situation is 1
zero.
has no chance
As we can
see
from
in our example, the fol-
lowing sequence of events occurred
ABABABAABBBBABABBAAABAAA.
=
The maximum length of run in this sequence is k 4. 24 the value of follows from Table 1, 4.2.1., that for n
=
required to reject hypothesis
a
= 0-05
are
would have
no grounds for
H
t
this
to be at least
8.
This means that there
rejecting the hypothesis that the values
=
/+
23-4 series deviate from the line x random. In accordance with definition 2 the
line
1,017
with
a trend II line of the time series under con-
is
equation
k
at the level of significance
of the time only at
It
sideration.
6.2.2.
The x 2
The x* test
assume that
from the
if
line
test
H
can be used to verify hypothesis Let us the deviations of the values of the time series t
x
= at+b
.
are random, then the probability
of a positive deviation equals the probability of a negative deviation, so
P(A) Let us denote by sufficiently
large
r
the
= P(B)=
number of
sample,
the
1/2.
events A. In this case, for a
distribution
of the random
variable
2 approximates the x distribution with one degree of freedom.
Linear regression
202 Pitman's
6.2.3.
test
Both tests described above are very simple to use. Their drawback is that they react only to the sign of the deviations of the values of the time series from the trend line and not to the magnitude of these deviations. Pitman's test this
We
drawback.
shall
now
is
free of
discuss this test (see [33], p.
128-131).
Suppose that we have two samples ft> l9 and a> 2 with frequenm and r respectively. In sample o> 1? the sequence of values
cies
y^ y&
ym has been obtained, and
of values zx z2 ,
,
...
zr
.
in sample
o> 2
the sequence
Let us define
v
=
The number of combinations of m+r items taken m at a time C +r This is the number of ways a set of m+r equals N items can be divided into two subgroups numbering m and r items respectively. Samples o> L and co 2 form one such division into subgroups of m and r items. .
M
Let us denote by the number of such divisions which have a certain property distinguishing them from the redivisions. In this case the probability of the maining
NM
W
W
occurrence of a division with property is equal to the fracLet us introduce the M/N. quantity
tion
R=\y-z\, which we
shall call the
range of the division, or briefly
(2)
The regression
W consist
the range. Let property
number. Let us
certain positive
M/N^a, where a
We
i.e.
population,
shall
the trend
of
R
select
203
>R RQ
Q
where
the
if
samples
and
that they can differ
coj
and
co 2
come from
R
come from two
co 2
a
(3)
the
.
1.
same
from one another only
corresponding range
assume that
is
M
i.e.
co :
R
so that
a real number satisfying the condition
is
shall consider that
random
and
line
at
Otherwise we
different popula-
tions.
Let us denote by y
1, 2,
(/
l
...,w) the positive deviations
of the values of the series from the probable trend
y>
and by
Zj
(j
1,
=
2,
[xt
...,
line, i.e.
-e(ty\>o,
(4)
r)
the negative deviations of this
[*t
-0(0]<0.
series, i.e.
-z,=
(5)
The computations connected with the verification of hypothesis H by Pitman's test will be explained on a numerical t
example.
Example
1.
The
1949-1955 was as follows (see Years
in
employment
average
[65],
Poland
in
p. 277)
1949
1950
1951
1952
1953
1954
1955
43-5
51-6
56-3
58-9
62-7
65-2
67-6
Employment in tens of thousands
Assuming a
= 0-05 check whether
be considered a trend
The
the line
xt
=
3-5f+44 may
II line.
deviations of the time series from a line with this equa-
tion are given in Table
1.
204
Linear regression
TABLE
EMPLOYMENT
The
total
number of
To of
POLAND 1949-1955
IN
the combinations in our example equals
N= Hence
1
M< 0-05.
C%
=
21
=
21
.
M=
i.e.
1-05,
1.
arrange the combinations according to declining values we shall use the formula
jR
r
In our example
r
=
r |z
\zv\ = \Zzrv\. 2,
=
Zz
=
ji |
4-9,
4-9 |
?
=
(6)
1-39.
=
-2-78
|
In this case
2-12.
Here are 3 out of 21 combinations for which the corresponding pairs of numbers inserted into formula (6) give a value
not
less
than 2-12:
Since to reject hypothesis
may
not be greater than
no grounds for Pitman's
test
4-0
1-8
4-0
0-9
4-0
0-9.
H
t
M=
number of combinations
the 1,
in
rejecting hypothesis is
awkward
our example there are
H
t
.
to use because of the necessity
of finding combinations with property W. The computations
The regression
become cumbersome when the
N
is
large. In
such cases,
and
line
the trend
205
number of combinations however, we can use the random total
variable
wk
sn\
which has a distribution similar to Student's distribution with the
number of degrees of freedom given k = m +
where
w
is
r
= OL
(5
r
which
it
~~ s
*)*
(8)
2
s denotes the standard deviation calculated
basis of the data
6.3.
2,
expressed by the equation
w in
by:
from both samples
co x
and
on the
co 2 .
The determination of trend ex post and ex ante
Let us consider two examples, taken from real is necessary to determine trends.
Example
1. It is
desired to
life,
in
which
study the relationship, in
a
amount of production outlay volume of production X. The purpose of this to find the regression equation of Y on X. The
certain enterprise, between the
Y and
the
research
is
this equation is of great practical importance enables us to assign to a given volume of production
knowledge of since
it
the expected size of outlay.
The question
arises, however, whether production and outlay, or at least one of these quantities, are not correlated with time since the efforts of the em-
ployees are constantly concentrated on increasing production and lowering costs, thus creating a regular factor which would explain such a correlation of variables
To answer
this
X and Y show a
question
we have
X
and
Y
with time.
to check whether variables
time tendency. There are no difficulties as far
Linear regression
206
concerned. The situation
as variable X,
i.e.
more
with respect to variable 7,
difficult
production,
is
i.e.
is
outlay which
be correlated not only with time, but also with variable X. If variable Y is correlated with time then instead of the regression line equation we have to calculate the regression line
may
surface.
then
we
When any
of the variables shows a time tendency, always have to remember that it is not advisable to
use the same regression equation for two periods which are far apart. As we can see, an analysis of tendencies plays an
important part in proper research on the tween outlay and production.
Example
2.
The workers
in a
have extracted more coal than consider that this
is
relationship
be-
mine have reported that they preceding month. They
in the
an achievement deserving
notice. It seems,
however, that only those production effects can be regarded as achievements that are of a permanent nature. The workers of a mine can claim a worth-while achievement in increasing
production only when the production trend is an increasing function of time. There is hardly any economic advantage to increasing production in one month in comparison with the preceding
month
if it
drops considerably in the following
month. Such fluctuations in production may be caused by random factors and they cannot constitute a basis for an appraisal or an economic decision.
In the two examples given above the trend line was needed to appraise ex post the phenomenon studied. The conclusions
reached on the basis of the trend line pertain to the past. Below is the procedure connected with the determination
of the trend line needed for the analysis of a given phenom-
enon in the past: 1)
the accumulation of statistical data for the period to
be studied; 2) the preparation of
mulated
a time graph on the basis of accu-
statistical data;
The regression
equation of the line of the time curve; 4) the verification
and
the trend
207
of the by appropriate methods which is to express the tendency
determination
3) the
line
whether
this line
can be considered a trend
line.
If the determination of the trend line follows the order
mentioned above then we
shall regard this line as determined
ex post.
The
situation
is
different
when the trend line is determined become available, and when
currently as the statistical data
the trend appears before the statistical analysis
and when
it is
much
used not so
is
finished,
for the appraisal of a given
in the past as for predicting its behaviour in the In this case the procedure involved in the determinafuture. tion of the trend is as follows:
phenomenon
the selection of the significance coefficient a; the determination of the minimum value of n which 2) enables us to reject hypothesis at the level a (note: t 1)
H
when
the two-point method is used for the determination of the parameters of the trend, then n has to be an
we
even number). If
n= 3)
10,
on the
verify hypothesis
= 0-05
when a
basis of n
1
(see Table
H
by a run
t
4.2.2.);
1,
consecutive points of the time
=
curve 1 the is
a^+b^ equation of the straight line x that formulated determined and the hypothesis }
in
H
the interval
x
=
Ojt+bi. formulate the will
We
[l,/i]
If
be a trend
the
hypothesis
Hf (
}
is
not
that
rejected
is
we
this
equation equation in the interval [!,+!].
hypothesis line
(
the equation of the trend line
continue this procedure until there appear grounds $ where r is the number of
for rejecting hypothesis
H
{
the hypothesis at the time of rejecting
or n
test,
2
when
the two-point
method
is
used.
it;
Linear regression
208
4) after rejecting hypothesis is
H $ the new line x $ = (
ar t+b r
last points
of the time
(
determined on the basis of n
1
curve and a new hypothesis is formulated; it is considered that since none of the remaining points provides grounds r~ r~~ for rejecting hypothesis H} l) then the equation x t l) == a r-\ f+^r-i is th e equation of trend II of these points. (
The procedure
then repeated according to the instruc-
is
and
tions in points 3
4.
This procedure prevents us from recognizing as a trend line a curve which does not satisfy the condition of random deviations formulated in Definition 2. On the other hand this procedure enables us to determine the trend line currently, without waiting until "the law governing the time series" emerges.
The trend
line obtained in this
ante line since
it
way we
determined before the
is
the
shall call
ex
statistical analysis
finished. The procedure involved in the determination of the ex ante line is a sequential procedure. The ex ante trend line is composed of different straight
is
line
segments following each other, and the equation of this a sequence of linear functions corresponding
line is written as
to these straight lines.
troublesome, but it should be remembered that after the accumulation of sufficient statistical data, the
This
is
a
little
ex ante trend
line
can always be replaced by the ex post trend
line.
Example
3.
Table
1
contains the
data on the monthly
production of automobiles in the United States in 1905-1928. The time curve based on the data from Table 1 is shown
on Graph
1. It is
a broken line shown as a thick line on the
Graph. The thin broken line composed of two segments is a trend II line determined by the sequential procedure. The equation of the
first
segment of the trend
line is
The straight line of this equation is a trend line The trend for the following years is the line x t
Jc t
=
3-8*
5.
for 1905-1912.
= 22-5f
168.
The regression
and
line
TABLE
the trend
209
1
MONTHLY AUTOMOBILE PRODUCTION IN THE
USA
1905-1928
(see [35], pp. 193-194).
The dotted It is
line
shown on Graph
1
the ex post trend line.
320-83
_ 1
I _J_ gl'4925
To determine
^
e
- 0-1569
'
t
the equation of this curve the data for
1941 were used. 14
is
a logistic curve with the equation
The
statistical
data for 1929-1941
1903-
are not
Linear regression
210
shown in Table 1 because the author believes that the deviations from the trend line can only be of a random nature.
Two
events which took place in the period
made it impossible to analyse the nomena by trend methods. These crisis
1929-1941 have
majority of economic pheevents were the economic
and war.
GRAPH
1.
100-
4
As can be
8
6
seen from
tO
12
Graph
14
1,
16
20
18
??
94
the broken line with the
equation
'
f3-8f-5
1<8,
\22-5t - 168
9
and the continuous
line
t
with the equation
320-83
*.=
< < 24,
.
-0
24
1569*
have approximately the same shape. Both
lines
can be
(2)
re-
garded as trend II lines because, according to Definition 2 in 6.1., these lines are equivalent since they belong to set R.
This does not
mean
that
it is
a matter of indifference which of
these lines should be considered as
more
useful in practical
The regression
line
and
the trend
211
The great advantage of line (1) is the simplicity of the computations involved in the determination of its parameters and the lack of any indication that there exists a law applications.
the
governing
drawback
is
segments.
The
it is
stochastic
process
under consideration.
Its
a broken line composed of straight line advantage of the line with equation (2) is that
that
it is
a continuous line without breaks and can be expressed
Its disadvantages are the more involved in the determination of the computations a and of its temptation to interpret the equation parameters
analytically as
one function.
difficult
equation of the line as a law expressed in mathematical language, governing the development in time of the phenomenon studied. According to the author, both lines can be
of service in the analysis of time series providing the notion
of the trend
The
is
properly interpreted.
definition of the trend proposed in this
work has im-
portant practical implications: a)
it introduces the concept of the set R of functions which can be considered as trend functions. In the interpreta-
tion hitherto prevailing each function could be a trend function;
computations involved in the determina-
b)
it
c)
tion of the parameters of the trend line; it makes possible the discovery and recognition of regular fluctuations (seasonal or cyclical) in the time
simplifies
if such fluctuations exist; they will be shown a broken line determined by the sequential procedure; by it enables us to reduce the random variable t to the form d) in which this variable is not dependent upon time. This
series,
X
transformation
is
accomplished by the formula
The determination of the trend is of great practical importance to economic research because the correct knowledge 14*
212
Linear regression
of economic processes
is
are
interpreted
possible only
dynamically.
when
Statistical
these processes
methods of
deter-
mining trends are part of a branch of mathematical statistics
known
as the theory of stochastic processes which has devel-
oped rapidly in recent years. It is difficult to take time into account in scientific research
and of
it is
not surprising that the more important achievements in this field are only a matter of recent years.
statistics
As we know,
correlation analysis
is
one of the main research
tools used in the theory of stochastic processes.
and broad
fields for applications are
Thus new
opening up before the
and regression methods. This leads us to believe that correlation and regression theory will be studied with correlation
interest and, in consequence, will
be further developed.
APPENDIX PROOFS OF THEOREMS AND STATISTICAL DATA USED IN THE BOOK Proof of Theorem 1 from
3.3.1.
The proof can be written as follows:
To prove
the theorem
In the proof
Lemma
we
y
x,
(i)
(
y
i)
we have
to
^
show
shall use the following
1.
^^ V* A/o\
A
""V
A' n
Proof of the Lemma:
1
Y* > X
l
~
n-k 213
fl
fir
If v
that:
Linear regression
214 Similarly
Lemma
we can prove
2.
y* The correctness of mas 1 and 2. Note:
k
y
the theorem follows directly
from Lem-
a similar proof can be given for the more general
theorem that points (x^y^), (x (8) y^\ (x, J) are located on the same straight line. In this theorem the set to is ,
arbitrarily
the
number
divided
into
two subgroups,
x, satisfying the inequality
*min
^ x ^ *max l
where xmln denotes the smallest abscissa and est abscissa
by means of
e.g.
of the points belonging to
ro.
max the great-
.*
215
Appendix
COMPUTATION TABLE FOR EXAMPLE
1
FROM
4.1.2.
Linear regression
216
|13,712 |12,888
~=
190-4,
7=
803 |
179,
771 |
w
=
I
595 |667
190,
u
1
79,136
180.
[
44,024
|
56,669
354 [
217
Appendix
COMPUTATION TABLE FOR EXAMPLE
2,
x u
= =
417 412
896
2,645 I
|
j
|
120-2,
/= 40-7,
120,
w
=
40,
(x-x) (y-y) * (x-x)*
7,026,
fe 49,745,
]?(y-y)*K
1,341,
60
75 |
|
49,745 |
1
FROM
7,201
1,341 |
4.2.2.
|
175
Linear regression
218
COMPUTATION TABLE FOR EXAMPLE
=
133-85
- 0-28
.
280-95
^
55.
1
FROM
4.3.
219
Appendix
I
I
& r-
]
**-
vbrffntNf^Tl-vbONrJinoX rHl-HT-lt-lT-li-H^r-lCSfVlCN
1K|
2
\~f
a fa
^ ^
^
i
!
^i^>
-J
t>-
W
666666666666
5
H
s s o H
ONOTfrnmcnTl-ONOOt-iooO
666666666666 5 4 + O* I
3
IK I
C-"
rf T^ "^
^
1^
O*"
T^
00 C*T
220
Linear regression
COMPUTATION TABLE FOR EXAMPLE 2 FROM
4.3.
Appendix
The determination of
x=
= 4, = u
x
JT (x
the parameters of line
117-4,
y
= 439-9,
120,
w
=
35,454
- xf =
I.
450,
4,=
-2-6,
- x) 0- -> =
221
-10-1,
- 20 ( - 2-6) - 10-1) = 34,929, (
- 20 - 2-6)
51,087
2
(
=- 50,952,
i
=
=
,
50,952
ba
= 439-9 - 0-69
117-4
.
=
The determination of the parameters of
3e)
(
X
line IT.
- 457-3,
x
= 93-9,
y
M
=
w= 450,
J.=
-
359.
120,
4.
-26-l,
= 7-3,
O - S) = 32,405 - 12 - 26-1) (7-3) - 34,691, (
- x) 2
26,099
- 12 (- 26-1) = 2
17,953,
i
_
_
34 691 '
= 457-3 - 1-93
.
.
93-9
= 276-5.
Linear regression
222
The determination of
(x
-
3c)
the parameters of line
x
=
152-5,
w
=
120,
J.
=
32-5,
- J) =
(^
3,049
III.
= 414, w = 450, A w = - 36, y
- 8 (32-5) - 36) = (
12,409,
12
20
X
(
_
2 jf)
= 24,988 - 8 (32-5) = 2
12
a zi
=
-12,409
.
16,538,
__
0-75,
16,538
a = 414
b.
- 0-75
.
152-5
=
300.
LIST OF SYMBOLS The meaning of the symbol
Symbol
Page*
147
a
the significance level
021
the regression coefficient of
012
the regression coefficient of
a
the estimate of a regression coefficient
021 class
the estimate a 21 obtained
021 point
the
X
in
a pop-
X on Y
in
a pop-
Y
on
29
ulation
29
ulation
squares estimate
by the method of least 138
cr al
obtained by the two-point
method the /~ th ;
101
138
67
commodity
a constant term in the equation of the linear in a population regression of Y on
X
29
Ac
a constant term in the equation of the linear
b
the estimate of the constant term
C
consumption
62
covariance in a population
23
regression of
X
on
Y
in a population
covariance in a sample set of events
D
102 12
139
Ar
d
the class interval of
D
demand
*
29 101
for
X
116
67
commodity
Page on which the symbol was used for the
223
first time.
Linear regression
224
The meaning of
Symbol
the symbol
Page
69
increase in consumption
E
1
E* e
e
the /-to residual
32
energy produced energy used up
79 79
non-negative constant the relative efficiency of the regression coefficient obtained by the two-point method
138
139
the density function of a two-dimensional
random
variable
18
MX)
marginal density two-dimensional distribution function
f(y\x)
the conditional density function the density of the two-dimensional
18
17
20
normal 43
distribution
rotation angle in a sample regression of
Y
regression of
X
on on
X
25
Y
25
hypothesis that the regression line in the population is a straight line fft ffr
hypothesis that the line belongs to set R hypothesis that the regression lines obtained
by the
159 196
method and the two-point
classical
method do not class interval of
149
differ significantly
Y
171
116
correlation ratios
37
the coefficient of efficiency
79
k
the
k
number of points with abscissa greater than x the number of classes in the frequency distri-
number of
classes in the frequency
distribution of
X
115
the
bution of
Y
125
115
the slope of the orthogonal regression line in
E(X) E(Y)
a population
42 43 43
List
of symbols
225
The meaning of the symbol
Symbol
Page
21
22 22 22 22
EfX) E(Y)
mn
E(XY)
t*ik
22 22
/'oi
22
/<20
23
22
23 23 the
number of
greater than
points with abscissa
127
y
frequency in the contingency table the number of variables
115
n
the size of the sample
100
N
population
|
69
normal distribution with parameters a---
12
m=
0,
139
1
general population
100
O)
sample
100
e
the rotation angle of a population probability that X--=x and Y- y t
42 13
l
14
marginal probability
14
PI-
p(y
p(x
p Q
l
conditional probability
15
the /~ th need
56
price
71
two-dimensional space
13
15
\y j )
the correlation coefficient in a population
r
the standard error of estimation in a population >
>
>
>
the standard deviation in a population
38
103
sample '
33
33
43 43
Linear regression
226
The meaning of
Symbol
the symbol
Page
the standard error of estimation in a sample
167
(random variable) the standard error of estimation in a sample (realization of
random variable)
102
the standard error of estimation in a sample (realization of
random variable)
102
the standard deviation in a sample (realization of
random
102
variable)
the standard deviation in a sample
166
variable)
(random
the standard deviation in a sample (realization of random variable)
102
the standard deviation in a sample
(random
s S" s
the
166
variable)
sum of sum of
the
56
financial resources
57
free decisions
76
supply
176
166 177
V/"~2
167
>V
1
+
168
(y-y)
t
time
67
U U
Y
33
revenue
u
an arbitrary constant introduced
**t
78 to simplify
calculations
the
empirical frequency distribution
iable
X
f
114
of var146
List
15*
of symbols
227
Linear regression
228
The meaning of
Symbol
the
symbol
Page
127
127 125
125
//
m
127
I
127
n
m
m
.*
125
125
1
101
profit
T"T^7
78 177
TABLES TABLE
1
NORMAL DISTRIBUTION
(/)
dv
229
230
Linear regression
Tables
231
Linear regression
232
vb
s
cs
rs
ri
1^66666666666
Ooo^t^t^t^r^t^r^r^vpvo
666666666666 +
6666666666
666666666666 A aT
Tables
233
cit--r-i-ioocoi-tni-
oooooooooooooooooo
666666666666666666 ON
ON
ON
ON
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
666666666666666666 ONOOOOOOl**'t -t *^t^VOVOVOVOVOVOVOVOVO >1
>
666666666666666666 oooooooooor^t^t^r-t^i^t^t^t^t^r^r^rrjcNCvic4cSfSfNCN<s
666666666666666666|
234
Linear regression
TABLE 4
NORMAL DISTRIBUTION
(Density function)
BIBLIOGRAPHY LITERATURE CITED R.G.D.:
Statistics for Economists,
1.
Allen,
2.
Allen, R. G. D.: Mathematical
London, 1949.
Analysis for
Economists,
London
1938. 3.
and Bowley, A.
Allen, R. G. D.,
L.:
Family Expenditure, London,
1935. 4. Bartlett,
Soc. 5.
6.
t
M.
S.
:
On
the Theory of Statistical Regression, Proc. Roy.
Edinb., 53 (1933).
de Castro, J.: Geografia glodu, Warszawa, 1954. Cournot, A. Recherches sur les principes mathematiqiies de :
la theorie
des richesses, Paris, 1838.
8.
Cramer, H.: Mathematical Methods of Statistics, USA, 1946. David, F. N., and Neyman, J.: Extension of the Markoff Theorem on Least Squares, Statistical Research Memoirs, Vol. 2 (1938).
9.
Davis, L. O.
7.
:
Statistical
Methods
in
Research and Production, London,
1949. 10.
Davis, H. T.: The Analysis of Economic Time Series, The Cowles Commission for Research in Economics, Monograph No. 6, Bloomington, Indiana, 1941.
11.
Davis, H.T.: Theory of Econometrics, Bloomington, Indiana, 1941.
12.
Dean,
J.:
Statistical
Cost Functions of a Hosiery Mill, Studies
in
Business Administration, School of Business, University of Chicago, Vol. 11, No. 4, Chicago, 1941. 13.
Dean,
J.:
The Relation of Cost to Output for a Leather Belt Shop, 2, National Bureau of Economic Research, New
Technical Papers
York, 1941. 14. Dlin,
15.
16.
Doob,
A. M.: Matematicheskaya J.
statist ika
L.: Stochastic Processes,
Dunin-Barkowski,
J.
New
w
tekhnike,
Moscow,
1951.
York, 1953.
W., and Smirnow, N. W.: Teoriya veroyatno-
tekhnicheskimi prilozheniami, Moscow, 1956. 17. Ezekiel, M.: Methods of Correlation Analysis, New York, 1947. stiej s
18. Falewicz, J.:
Biezqca kontrola gospodarnosci przedsi^biorstw prze-
myshwych, Wroclaw, 1949.
235
236
Linear regression
19. Falewicz, J.
Kontrola niezmiennosci zwi^zku korelacyjnego, Zeszyty we Wrodawiu, No. 1, 1956.
:
WSE
Naukowe
W.: An Introduction to Probability, New York, 1950. M.: Rachunek prawdopodobienstwa i statystyka matematyczna,
20. Feller, 21. Fisz,
Warszawa, 1954. 22. Frisch, R.: Pitfalls in the Statistical Construction
of Demand and
Supply Curves, Leipzig, 1933.
Management and Control New York, 1955. W.: Rachunek prawdopodobienstwa, Warszawa- Wroclaw,
23. Gardner, F.: Profit 24. Gliwienko,
1953. 25.
Gniedienko, B. W.: Kurs
teorii veroyatnostiej,
Moscow- Leningrad,
1953. 26.
Gossen, H. H. Entwicklung der Gesetze des menschlichen Verkehrs und der daraus fliessenden Regeln fur menschliches Handeln, third :
edition, Berlin, 1926. 27. Grenander, U.,
Time
Series,
and Rosenblatt, M.: York, 1957.
Statistical Analysis
of Stationary
New
28. Hald, A.: Statistical Theory with Engineering Applications,
New
York,
1952. 29. Hellwig, Z.
Elementy rachunku prawdopodobienstwa
:
i
statystyki
ma-
tematycznej, L6dz, 1957. 30. Hellwig, Z.
kowe
Uwagi
:
WSE
i
wnioski z zakresu teorii potrzeb, Zeszyty Nau-
we Wroclawiu, No.
2,
1957.
Z.:
31. Hellwig,
Wyznaczanie parametrow regresji liniowej metoda, dw6ch punkt6w, Zastosowania Matematyki Vol. 3, 1956. Jagtom, A. M.: Obshchaya teoriya statsyonarnykh swuchaynykh fun,
32.
kcii,
Uspekhi matematicheskikh nauk, Vol. 7, series 5 (51). M. G.: The Advanced Theory of Statistics, Vol. 2, London,
33. Kendall,
1946. 34.
Keynes,
J.
M.
Ogolna
:
teoria
zatrudnienia
t
procentu
i
pieniqdza,
Warszawa, 1952. 35.
Lange, O.: Teoria statystyki, Warszawa, 1952. W. W. Ein Versuch zur statistischen Analyse von Angebot
36. Leontief,
:
und Nachfrage,
Weltwirtschaftliches Archiv, Vol. 30, 1929.
Regression Analysis of Production Costs and Factory Operations, London, 1946.
37. Lylc, P.
38. 39.
:
Marks, K.: Kapital, Vol. 1, Warszawa, 1951. Moore, H. L.: Economic Cycles: Their Law and Their Cause,
New
York, 1914. 40. Nicholson,
C: The
trika, 33, 1943.
Probability Integral for
Two
Variables, Biome-
237
Bibliography 41.
Oslroumow,
S. S.:
Sudebnaya
statistika,
Moscow,
1949.
42. Pareto, V.: Cours d* economic politique, Lausanne, 43. Paulscn, A.: Allegemeine Volkswirtschaftslehre,
44. Pearson, K.:
Tables for Biometricians
and
Band
1896-1897. 2, Berlin,
1956.
London,
Statisticians,
1931.
A
45. Pigou, A. C.: Elasticities
46. Pigou,
Method of Determining
the Numerical Value of
of Demand, Economic Journal, Vol. 20 (1910),
A. C.: The
Statistical
mic Journal, Vol. 40 (1930), 47. Richler-Altschaffer,
H.:
Derivation of p.
384
Demand
p.
636
ff.
Curves, Ekono-
ff.
Einfiihrung
in
die
Korrelationsrechnung,
Berlin, 1931.
W.: O niepararnetrycznym tescie na porownywanie rozsiewdw, Zastosowania Matematyki, Vol. 2, No. 2.
48. Sadowski,
49.
Schmalenbach, E.r Grundlagcn der Selbstkostefirechnung und Preispolitik,
Leipzig, 1930.
50. Schultz, H.: Statistical
Laws of Demand and Supply,
with Special
Application to Sugar, Chicago, 1928. 51. Schultz, H.:
The
Theory and Measurement of Demand, Chicago,
1938. 52.
Sheppard, W. F.: On the Application of the Theory of Error to Cases of Normal Distribution and Normal Correlation, Phil. Trans. Roy. Soc. (1998), p. 101.
Engineering Data and Statistical Techniques in the Analysis of Production and Technological Change in Fuel Requirements of the Trucking Industry, Econometrica, Vol. 25, No. 2, 1957.
53. Smith, V. L.:
54. Stielties, T. J.: Extrait
d'une
lettre adressee
a
M. Hermite,
Bulletin
Scientifique Math., series 2, 13 (1889), p. 170. 55. Styg,
W. Zhidzenia :
statystyczne
wywolane przez wplyw czasu
daniach zjawisk demograficznych
w
ich ruchu
i
w
ba-
rozwoju, Przeglqd
Antropologiczny, Vol. 23.
58.
Wprowadzenie do ekonometrii, Warszawa, 1957. New York, 1954. Tintner, G.: Mathematics and Statistics for Economists, London,
59.
Tschuprow, A, A.: Principles of the Mathematical Theory of Cor-
56. Tinbergen, J.:
57. Tintner, G.: Econometrics,
1954.
London-Edinburgh-Glasgow, 1939. Wald, A.: The Approximate Determination of Indifference Systems by Means of Engel Curves, Econometrica, Vol. 8, 1940.
relations,
60.
61. Winkler,
W.: Podstawowe zagadnienia ekonometrii, Warszawa,
1957.
Linear regression
238
\
62.
Wold, H.:
A
Study
in the Analysis
of Stationary Time
Series,
Uppsala,
1938.
On
G. M.:
63. Yule,
the Theory of Correlation for
Any Number
of Var-
Treated by a New System of Notation, Proc. Roy. Soc., London, series A 79 (1907). iables,
64. 65.
The USA (joint work), Chicago, 1957. Rocznik Statystyczny 1956, GUS, Warszawa.
LIST Anderson, T. W.
:
OF SUPPLEMENTARY, LITERATURE
Introduction to Multivariate Statistical Analysis,
New
York, 1958. Bartlett,
M.
S.: Fitting
a Straight Line
When Both Variables are
Subject
to Error, Biometrika, 5, 1949.
and Fox, K.: Methods of Correlation and Regression Analysis, York, 1959. Lange, O.: Wst^p do ekonometrii, Warszawa, 1960. Ezekiel, M.,
New
PawJowski, Z.: Ekonometryczne metody badania popytu konsumpcyjnego
Warszawa, 1961. Wald, A.: The Fitting of Straight Lines to Error, Ann. of Math. Statist., 11, Williams, E.
J.:
Regression Analysis,
New
,
if
Both Variables are Subject
3,
284, 1940.
York, 1951.