This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
p(x) is a n odd function and, consequently, (i2r + a =
x*'
+1
4>(x)dx = 0
82
STATISTICS
(e) The mean-moment
generating function
is
+
Mm(t) = 6{et*) = f " e>*ij,(x)dx •— ' 00 1 /- + 00 = —7= / exp [te — *a/2a8]
=r =
e
h C
aVE
=
ex
x
i(*2" +°H2) + 7 ]
p
P
/
exp
J
6XP
[ -
~
^
exp ( - y»/2a»)«fy
where y = x — GH. MJt) or
= _ L = exp (*,««») • 1
.
.
.
(5.4.2)
Since the coefficients of all the odd powers of t in the expansion of this function are zero, we see again t h a t the mean-moments of odd order are all zero. To find the meanmoments of even order, we first find t h e coefficient of t2r\ this is (i^Ylr!; then the coefficient of t2rl2r! is (ia2)r . 2 r \ j r \ Hence
jjL2r = (i)ra2r2r \jr ! = 1.3.5 . . . (2r — l)a 2 r a
I n particular,
(5.4.3)
4
|i 2 = a and (x4 = 3d .
W e also have a useful recurrence relation (V = (2r -
l)<jV 2 r- a
.
.
.
(5.4.4)
1
By 3.C.3.; alternatively, the probability that x shall take a value somewhere between — <*> and +oo is the total area under f + °° 1 the probability curve, i.e., J is certain.
Hence + »
/
— SO
aV2v
exp
x 2 /2a i )dx.
exp (— x%j2at)dx = 1
+ co
i
exp (— x'(2a')dx = csVZn
But this
d
*
STATISTICAL
MODELS.
ILL
83
(/) Since the area under the probability curve between x = 0 and x = X is t h e probability t h a t the variate x will assume some value between these two values, P(0 < * < X) =
P (—x2/2a2)dx .
a
V2nJg
(5.4.5)
Now, referring t o Fig. 5.4.1, we see t h a t - 4 = f" exp ( °V2 nj0
x*/2a2)dx (-x*/2c2)dx
= —~J*exp
+
exp ( - * * / 2 o * ) d x
B u t t h e integral on the left-hand side of this equation gives the half the area under t h e entire curve and is, therefore, equal to also
(-x*/2o*)dx
- 4 = f°
W2tz)x
= P(0 < x ^ X) a n d
exp ( - x2/2a*)dx = P(x >
X).
Therefore P(x > X) = i - P(x < X)
.
.
(5.4.6)
If, however, we are concerned only with the absolute value of the variate, measured f r o m its mean, then P(|*| > X > 0) = 1 - 2P(0 < x ^ X ) This is frequently the case. thetical, problem:
(5.4.7)
Consider the following, hypo-
In a factory producing ball-bearings a sample of each day's production is taken, and from this sample t h e mean diameter a n d t h e s t a n d a r d deviation of t h e d a y ' s o u t p u t are estimated. The mean is in fact t h e specified diameter, say 0-5 c m ; t h e s t a n d a r d deviation is 0-0001 cm. A bearing whose diameter falls outside the range 0-5 ± 0-0002 cm is considered substandard. W h a t percentage of the day's o u t p u t m a y be expected to be substandard? This is w h a t is generally called a two-tail problem, for we are concerned with t h e probability of a bearing having a diameter which deviates f r o m t h e mean b y more t h a n 0-0002 cm above
84
STATISTICS
the mean and of a bearing having a diameter which deviates by more t h a n 0-0002 cm from t h e mean below the mean, i.e., we are concerned with finding the two (in this case, equal) areas under Normal curve between —00 and - 0 - 0 0 0 2 cm and between + 0-0002 cm and +00. This will be given by 1 — 2P(0 < x < 0-0002). How, then, do we find probabilities such as P ( 0 < * < 0-0002)? Consider ( 5 . 4 . 5 ) again. Let us take the standard deviation of the distribution as the unit for a new variate and write t = x/o. Then ( 5 . 4 . 5 ) becomes P ( 0 < i < T) = — L j ^ exp (-< 2 /2)rf< where T =
(5.4.8)
X/a.
Table
5.4.
P(t
Area under the Normal Curve: =
exp ( -
T (= X/a).
0.
1.
2.
3.
0-0 0-1 0-2 0-3 0-4 0-5 0-6 0-7 0-8 0-9 1-0 1-1 1-2 1-3 1-4 1-5 1-6 1-7 1-8 1-9 2-0 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 2-9
•0000 •0398 •0793 •1179 •1654 •1915 •2257 •2580 •2881 •3159 •3413 •3643 •3849 •4032 •4192 •4332 •4452 -4554 •4641 •4713 •4772 •4821 •4861 •4893 •4918 •4938 •4953 •4965 •4974 •4981 3-0 •4987
•0040 •0438 •0832 •1217 •1591 •1960 •2291 •2611 •2910 •3186 •3438 •3665 •3869 •4049 •4207 •4345 •4463 >4564 <4649 •4719 •4778 •4826 •4866 •4896 •4920 •4940 •4955 •4966 •4976 •4982 31 •4990
•0080 •0478 •0871 •1255 •1628 >1985 •2324 •2642 •2939 •3212 •3461 •3686 •3888 •4066 •4222 •4367 •4474 •4673 <4656 •4726 <4783 <4830 •4868 •4898 •4922 •4941 •4966 •4967 •4976 <4983 3-2 -4993
•0120 •0617 •0910 •1293 •1664 •2019 •2357 •2673 •2967 •3238 •3485 •3708 •3907 •4082 •4236 •4370 •4485 •4582 •4664 •4732 •4788 •4834 •4871 •4901 •4925 •4943 •4967 •4968 •4977 •4983 3-3 >4995
T,
.
4.
5.
•0159 •0199 •0557 •0596 •0948 •0987 •1331 •1368 •1700 •1736 •2054 •2088 •2389 •2422 •2704 •2734 •2995 •3023 •3264 •3289 •3508 •3631 •3729 •3749 •3925 •3944 •4099 •4116 •4251 •4265 •4382 •4394 •4495 •4605 •4591 •4599 •4671 •4678 •4738 •4744 •4793 •4798 -4838 -4842 •4875 , 4 8 7 8 •4904 •4906 •4927 •4929 •4945 •4946 -4959 •4960 >4969 •4970 •4977 •4978 •4984 •4984 3-5 3-4 •4997 •4998
mdt
6.
7.
8.
9.
•0239 •0636 •1026 •1406 •1772 <2123 •2454 •2764 •3051 •3315 •3554 •3770 •3962 •4131 •4279 •4406 •4516 •4608 •4686 •4750 •4803 <4846 •4881 •4909 •4931 •4948 •4961 •4971 •4979 •4986 3-6 •4998
•0279 •0676 •1064 •1443 •1808 •2167 •2486 •2794 •3078 •3340 •3577 •3790 •3980 •4147 •4292 •4418 •4525 •4616 <4693 •4766 •4808 •4850 •4884 •4911 •4932 •4949 •4962 •4972 •4980 •4985 3-7 •4999
•0319 •0714 •1103 •1480 •1844 •2190 •2518 •2823 •3106 •3365 •3599 •3810 •3997 •4162 •4306 •4430 <4636 <4625 •4699 •4762 •4812 •4854 •4887 >4913 •4934 •4951 •4963 •4973 •4980 •4986 3-8 •4999
•0359 •0753 •1141 •1517 •1879 •2224 •2549 •2852 •3133 <3389 •3621 <3830 •4015 •4177 •4319 •4441 •4645 •4633 •4706 •4767 •4817 •4857 •4890 •4916 •4936 •4952 •4964 •4974 •4981 •4986 3-9 •5000
STATISTICAL
MODELS.
ILL
85
» Now P(t ^ T) is t h e area under the curve y — 2
(— / /2) between t = 0 and t = T.
1
. exp
The integral
vrj0 exp(-<2/2)<" frequently called t h e probability integral, is a function of T. I t cannot, however, be evaluated in finite form, b u t if we expand t h e integrand a n d integrate t e r m b y term, t h e integral can be computed t o a n y degree of accuracy required. Table 5.4 gives values of this integral for T = 0 t o T = 3-9, correct t o four decimal places. j /• 0-800 Worked Example : Find the value of-j= / exp (—Pj2)dt correct 0 to four decimal places using four terms of the expansion of the integrand. We have exp ( - P/2) = 1 - *8/2 + <4.2 ! - <«/8.3 ! + . . . Therefore tT ^ L J exp ( - t'/2)dt = (T - ra/6 + r*/40 - T'/336 . . .) Now r = 0-50000; T 3 = 0-12500 and r®/6 = 0-02083; T® = 0-03125 and T 6 /40 = 0-00078; T> = 0-00781 and r 7 /336 = 0-00002 _
.
/• 0.500
exp ( - t 2 j 2 ) d t ^
Taking 1/V2ir = 0-39894, we have — V2=nJ o/
0-1914, which should be compared with that of 0-1915 given in Table 5.4. This method is satisfactory when T ^ 1, but, for larger values, we use what is called an asymptotic expansion (see Whittaker and Watson, Modern Analysis, Ch. 8, or Nicholson, Fundamentals and Techniques of Mathematics for Scientists, Ch. 14). ~=f
exp ( - t*!2)dt 1
- ^ j
F °
exp(-t'/2)dt
exp ( - <*/2)A = 0-5--j=J
1
/ " I -f
. exp ( - t*\2)tdt
and, integrating successively by parts, we have, for T > 1,
[1 - l / r a + 1 . 3 / r 1 - 1 . 3 . 5IT' + 1 . 3 . 5 . 7IT* - . . .] where 1/V2^ = 0-39894228...
STATISTICS
86
Exercise: Find
.
I
e xp
(—t'/2)dt correct to four decimal places
and check the result with the value given in Table 5.4. We m a y now return to our ball-bearing problem. The standard deviation of the sample was 0-0001 cm. Since we have to find P(x > 0-0002), X = 0-0002 and T = 0-0002/0-0001 = 2. Hence the probability t h a t the diameter of a bearing will lie between 0-5 and 0-5002 cm is 0-4772. Therefore the probability t h a t the diameter will exceed 0-5002 cm is 0-5 — 0-4772 = 0-0228. Since the Normal distribution is symmetrical, t h e probability of a bearing with a diameter less t h a n 0-4998 cm will also be 0-0228. Hence the probability t h a t the diameter of a bearing will lie outside the tolerance limits will be 0-0456. This means t h a t we should expect, on the data available, just over 4^% of the bearings produced on the day in question to be substandard. (g) If we pick at random a value of a variate known to be distributed normally about zero-mean with variance a2, what is the probability that this random value will deviate by more than a, 2a, 3a, from the mean? Entering Table 5.4 at T = 1 -00, we find t h a t the area between the mean ordinate and t h a t for T = 1-00 is 0-3413. This is the probability t h a t the random value of the variate will lie between 0 and a. By symmetry, the probability t h a t it will lie between 0 and — a is also 0-3413. Thus the probability of it lying between — a and + a is 2 x 0-3413 = 0-6826. Consequently the probability t h a t it will deviate from the mean by more t h a n a in either direction is 1 — 0-6826 = 0-3174, or less than -J. Similarly, the probability t h a t a random value will lie between — 2a and -)- 2a is 0-9544; this means t h a t the probability t h a t it will lie outside this range is 0-0456, or only about of a normally distributed population deviate from the mean by more than 2a. Likewise, as the reader may ascertain for himself, the probability of a deviation greater t h a n 3a is only 0-0027. (h) Suppose now that we were to ,plot values of the integral
against the value of X. This is not possible over the full range, — oo to -)- oo, but since we have just found that
STATISTICAL
MODELS.
ILL
87
deviations of more than 3<j from the mean are very rare, we can confine our attention to t h a t section of the range lying between — 3-9ar and + 3-9o, the range covered by Table 5.4. If we do this we obtain a cumulative probability curve for the Normal distribution. The function F(X) defined by F(X) E= — f
X
exp ( _ x2/2a2)dx .
(5.4.9)
—CO
is called the Normal distribution function. Clearly F(X) = i + P(0 < x < X) . (5.4.10) The graph of a typical F(X) is shown in Fig. 5.4.2. From (5.4.9) it follows t h a t the value of the ordinate of this graph at X is equal to t h e area under the curve of the probability
density function
88
STATISTICS
•999
/
•995 •99
/
•98 •95 •90
1
•80
J X
•70 2 p<
•60
•50 •40
y
t,/
A
/
s4 v/
w
>—« / 7
•<
•30 •20
A •10
/
'
u
•05 •02
•01
•005 '
- 2<J
- 18
<)
+1(5
+2 <5
FIG. 5.4.3.—Cumulative Probability Curves on Normal Probability Paper. Worked Example : Use the Normal distribution to find approximately the frequency of exactly 5 successes in 100 trials, the probability of a success in each trial being p = 0-1. The mean of the binomial distribution is np = 10 and the variance, npq, = 9. The standard deviation is, therefore, 3. The
STATISTICAL MODELS. ILL 89 binomial class-interval for 5 will then correspond to the interval — 4-5 to — 5-5 of the normal distribution (referred to its mean as origin) and dividing by a = 3, this, in standardised units, Is — 1-50 to — 1-83. Owing to the symmetry of the normal curve, we may disregard the negative signs, and entering Table 5-4 at 1-50 and 1-83, we read 0-4332 and 0-4664 respectively. Hence the probability of 5 successes is approximately 0-4664 — 0-4332 = 0-0332. Thus in 100 trials the frequency of 5 successes will be approximately 3-32. The reader should verify for himself that direct calculation of the binomial frequency gives 3-39, while the frequency obtained from the Poisson series with m — 10 gives 3-78. 5.6. Three Examples. We conclude this chapter with two other typical problems in the treatment of which we make use of some of the properties of the Normal distribution. The reader should work each step himself, following carefully the directives given.
Example 1: To fit a Normal curve to the distribution given in Table 5-1. First Treatment: (1) Draw the frequency histogram for the data. (2) Calculate the Mean and Standard Deviation, correcting for grouping (2.16) the value of the latter. It will be found that the mean is 172-34 cm, and the standard deviation 6-60 cm. (3) The normal curve corresponding to these values for the mean and standard deviation is drawn using Table 6.5 on page 90. For each class-interval, work out the deviation, from the mean, of each of the boundary values. For instance, the boundary values of the first interval are 150 and 153 whose deviations from the mean are 22-34 and 19-34, i.e. 3-385
90
STATISTICS T A B L E 5.5.
Ordinates of Normal Curve Multiplied by Standard Deviation D i v i d e each value b y a t o o b t a i n y =
, T (= Xla).
0.
1.
2.
0-0 0-1 0-2 0-3 0-4 0-5 0-6 0-7 0-8 0-9 1-0 1-1 1-2 1-3 1-4 1-5 1-6 1-7 1-8 1-9 2-0 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 2-9 3-0 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 3-9
•3989 •3970 •3910 •3814 •3683 •3521 •3332 •3123 •2897 •2661 •2420 •2179 •1942 •1714 •1497 •1295 •1109 •0940 •0790 •0656 •0540 •0440 •0355 •0283 •0224 •0175 •0136 •0104 •0079 •0060 •0044 •0033 •0024 •0017 •0012 •0009 •0006
•3989
•3989 •3961 •3894 •3790 •3653 •3485 •3292 •3079 •2850 •2613 •2371 •2131 •1895 •1669 •1456 •1257 •1074 •0909 •0761 •0632 •0519 •0422 •0339 •0270 •0213 •0167 •0129 •0099 •0075 •0056 •0042 •0031 •0022 •0016 •0012 •0008 •0006 •0004
>
S A
•3902 •3802 •3668 •3503 •3312 •3101 •2874 •2637 •2396 •2155 •1919 •1691 •1476 •1276 •1092 •0925 •0775 •0644 •0529 •0431 •0347 •0277 •0219 •0171 •0132 •0101 •0077 •0058 •0043 •0032 •0023 •0017 •0012 •0008 •0006
•0002
3. ' •3988 •3956 •3885 •3778 •3637 •3467 •3271 •3056 •2827 •2589 •2347 •2107 •1872 •1647 •1435 •1238 •1057 •0893 •0748 •0620 •0508 •0413 •0332 •0264 •0208 •0163 •0126 •0096 •0073 •0055 •0040 •0030 •0022 •0016 •0011 •0008 •0005
8.
4.
6.
6.
7.
•3986 •3951 •3876 •3765 •3621 •3448 •3251 •3034 •2803 •2565 •2323 •2083 •1849 •1626 •1415 •1219 •1040 •0878 •0734 •0608 •0498 •0404 •0325 •0258 •0203 •0158 •0122 •0093 •0071 •0053 •0039 •0029 •0021 •0015 •0011 •0008 •0005
•3984 •3945 •3867 •3752 •3605 •3429 •3230 •3011 •2780 •2541 •2299 •2059 •1826 •1604 •1394 •1200 •1023 •0863 •0721 •0596 •0488 •0395 •0317 •0252 •0198 •0154 •0119 •0091 •0069 •0061 •0038 •0028 •0020 •0016 •0010 •0007 •0005 v. /
•3982 •3939 •3857 •3739 •3589 •3410 •3209 •2989 •2756 •2516 •2275 •2036 •1804 •1582 •1374 •1182 •1006 •0848 •0707 •0584 •0478 •0387 •0310 •0246 •0194 •0151 •0116 •0088 •0067 •0050 •0037 •0027 •0020 •0014 •0010 •0007 •0005
•3980 •3932 •3847 •3725 •3572 •3391 •3187 •2966 •2732 •2492 •2251 •2012 •1781 •1661 •1354 •1163 •0989 •0833 •0694 •0573 •0468 •0379 •0303 •0241 •0189 •0147 •0113 •0086 •0066 •0048 •0036 •0026 •0019 •0014 •0010 •0007 •0005
?
/
•3977 •3925 •3836 •3712 •3555 •3372 •3166 •2943 •2709 •2468 •2227 •1989 •1768 •1639 •1334 •1145 •0973 •0818 •0681 •0662 •0459 •0371 •0297 •0236 •0184 •0143 •0110 •0084 •0063 •0047 •0035 •0025 •0018 •0013 •0009 •0007 •0005 •0003 •0002 s •0001
9. •3973 •3918 •3825 •3697 •3638 •3352 •3144 •2920 •2685 •2444 •2203 •1965 •1736 •1518 •1316 •1127 •0967 •0804 •0669 •0651 •0449 •0363 •0290 •0229 •0180 •0139 •0107 •0081 •0061 •0046 •0034 •0025 •0018 •0013 •0009 •0006 •0004 v
L. fC
(2) Draw the theoretical cumulative normal frequency curve with mean 172-34 and s.d. 6-60. This is done using Table 5.4 as follows: > To find the ordinate of the cumulative frequency curve at, say, the lower end-point of the interval 153-156, i.e., at 153, we have to find the area under the normal curve from — oo to X = 153. But this is i — (area under curve between mean, 172-34, and the ordinate at 153). The deviation from the mean is — 19-34. But the area under the normal curve between X = 0 and X = —19-34 is, by symmetry, the area under the curve between X = 0 and
STATISTICAL
MODELS.
ILL
91
X = + 19-34. Dividing 19-34 by 6-60, we obtain 2-9.3; entering Table 5.4 at this value, we read 0-4983. The required ordinate of the cumulative frequency curve is then given by 58 703 x (0-5000 — 0-4983) = 99-8. The reader should calculate the other ordinates in a similar manner and complete the curve. (3) If now we mark upon the vertical axis percentage cumulative frequencies (with 58 703 as 100%), we can find the position of the median and other percentiles. (Median: 167-3 cm; quartiles: 158 and 177 cm; deciles: 164 and 181 cm.) Example 2: To find, using probability graph paper, approximate values for the mean and standard deviation of an observed frequency distribution which is approximately normal. Treatment: When plotted on probability graph paper, the cumulative frequency curve of a normal distribution is a straight line. If, then, we draw the cumulative relative-frequency polygon of an observed distribution on such paper and find that it is approximately a straight line, we may assume that the distribution is approximately normal. We next draw the straight line to which the polygon appears to approximate. Then, working with this " filled " line : (a) since, for the normal distribution, mean and median coincide and the median is the 50th percentile, if we find the 50th percentile, we shall have a graphical estimate of the mean of the observed distribution; (b) the area under the normal curve between — oo and /i + o is 0-5000 + 0-3413 = 0-8413. Thus 84-13% of the area under the normal curve lies to the left of the ordinate at /t + a. So the 84-13 percentile corresponds to a deviation of + a from the mean. If, therefore, from the fitted cumulative frequency line we find the position of the 84th percentile, the difference between this and the mean will give us an estimate of a for the observed distribution. Likewise, the difference between the 16th percentile and the median will also be an estimate of a. Example 3: The frequency distribution f(x) in obtained from the normal distribution N(t) = exp ( - i*2), by means of the equations "vl-n and
(ii) t = a log (x — 1).
If exp (1 /a') = 4, show that the median of f(x) is 2, the mean is 3 and the mode is 1-25. (L.U.) Treatment-. As * — 1 , > — oo ; as *—>.+ co, /—> + oo.
92
STATISTICS /OO
Jl
^+0O
f(x)dx = I N{t)dt = 1. — OO
is given by j f(x)dx — J = j 1
Hence the median value of f(x) N(t)dt, i.e., by 0 = a log (x — 1),
— to
i.e., since log 1 = 0, x = 2. x = / Xf(x)dx.
But
x = 1 +
Hence r+00
x^J
i
/•+»
— 00
,/
2i
^'("-DI dt
(1 + e'l")N(t)dl = 1 + ^ L J — 00
dt i.e.,
* = 1 + e»/»* = 1 + (4)» = 3.
Differentiating (i) with respect to t, we have N(t) = ^ or
/ W*r
/(*) = ae-l'Nit)
=
= \±^ -
/(,)*] exp
|
= f(x) . J e
(<• + J ) ]
. exp [ - Hf + 2t/a)] • ~ [' +
If then ^l^L1 = 0, which defines the modal value, we must have ax a Thus
x — 1 = e »' = £ or
* = 1-25.
EXERCISES ON CHAPTER FIVE 1. Fit a normal curve to the distribution of lengths of metal bars given in 2.1. 2. The wages of 1,000 employees range from 45p to ^1-95. They are grouped in 15 classes with a common class interval of lOp. The class frequencies, from the lowest class to the highest, are 6, 17, 35, 48, 65, 90, 131, 173, 155, 117, 75, 52, 21, 9, 6. Show that the mean
STATISTICAL MODELS. ILL 93 wage is £1-2006 and the standard deviation 26-26p. Fit a normal distribution, showing that the class frequencies per thousand of the normal distribution are approximately 6-7, 11-3, 26-0, 48-0, 79-0, 113-1, 140-5, 151-0, 140-8, 113-5, 79-5, 48-1, 25-3, 11-5 and 6-7. (Weatherburn, Mathematical Statistics.)
3. A machine makes electrical resistors having a mean resistance of 50 ohms with a standard deviation of 2 ohms. Assuming the distribution of values to be normal, find what tolerance limits should be put on the resistance to secure that no more than , ' 0 0 of the resistors will fail to meet the tolerances. (R.S.S.) 4. Number of individual incomes in different ranges of net income assessed in 1945-46 : Range of Income after tax (x). £ 150-500 500-1,000 1,000-2,000 2,000 and over
Number of Incomes. 13,175,000 652,000 137,500 35,500 Total
14,000,000
Assume that this distribution of incomes, f(x), is linked with the normal distribution
by the relationship I
N(t)dt = J f{x)dx, where t = a log (x — 150) + b. 150
Obtain estimates for a and b from the data, and find the number of incomes between £250 and £500. 5. Show that (3a for a normal distribution is equal to 3. 6. If p — A, use the normal distribution to estimate the probability of obtaining less than five or more than 15 successes in 50 trials. What is the actual probability ? Solutions 3. ± 6-6 ohms.
6. 0-0519; 0-0503.
4. See Example 3 of 5.6. a = 0-71(5) 6 = - 2 - 5 8 ; number of incomes between £250 and £500 is 2-5 x 10s. (2-48), to two significant figures.
CHAPTER
SIX
MORE VARIATES THAN O N E : BIVARIATE DISTRIBUTIONS, R E G R E S S I O N AND CORRELATION 6.1. Two Variates. In the last chapter we discussed the distribution of height among 58,703 National Servicemen born in 1933 who entered the Army in 1951. We could have discussed the distribution of weight among them. In either case, the distribution would have been univariate, the distribution of one measurable characteristic of the population or sample. B u t had we considered the distribution of both height and weight, a joint distribution of two variates, we should have had a bivariate distribution. 6.2. Correlation Tables. How do we tabulate such a distribution ? To each National Serviceman in the total of 58,703, there corresponds a pair of numbers, his weight, x kg say, and his height, y cm. Let us group the heights in 5-cm intervals and the weights in 4-5-kg intervals. Some men will be classed together in t h e same weight-group (call it the xi group) b u t will be in different height-groups; others will occupy the same height-group (the y} group, say) but different weight groups; b u t there will be some in t h e same weight-group and t h e same height-group, the group (x it yj) for short. Denote t h e number of men in this class-rectangle by fy. The joint distribution may then be tabulated as in Table 6.2.1. A general scheme is given in Table 6.2.2. NOTE :
(i) x{ is the mid-value of the class-interval of the ith *-array; yt is the mid-value of the _/th y-array. (ii) If the data is not grouped—to each value of x corresponds but one value of y and to each y corresponds but one value of x, the correlation Table becomes : x y
Vi
x*. y*
*3
...
*t
Vi
... ...
*B ViV
The (xt, y,) group in Table 6.2.2, for instance, corresponds to t h a t of those men whose weights are in the 58-5-63-0 kg weight class and whose heights are in the 164 cm height class, a n d / , , = 3,879. Such a table is called a correlation table. Each row 94
95
as a> — i i
11111
OJ
1 I I I | WHNOCI^WHHH | | \ I I
o00 I—)
I I I 1 OICNOM^ONIOMH 1 [ I 1 M M NIO^^NH M M
00
TT,
3 d aj sn
C l>5
N««H H«ONOOOOIOOT|(I>®HM« || Hf ^ oo co I-h
VI >
TX I-
|
— i i 'C e 01 CJ W -o; •S <3 S ^M §-S
a> CO
i |
* S
OS >a
•o o CO e s •S e fe; 5 I £
>o
U 0> 31 CO
-
a
to
O CO ^O ^«
s 2 -o j: s 2 I
H « EH
3s (3 K ai ® ^ "-i W> . W+; -• C3
ia C5
1 1
i rH
1 1 1
1
I-H |
cq W O © M O « H i-T co co i-h
- cn
RY OS S
i 1 1 - 1 - 1 1 11 I I
|
t*ONCOOl>OXt*T(iHH i i t i-H 03 t> CO t- CO r-t M l l"H CO iO »H I NH^OOffi05C5HLOO«OlCH HWWNl^^^OlOMH i i j | i | CO W5 CO CO i-H | | | (j N^COH MNMOONCOMH 1 1 J 1 II lOOOOJW 1 1 II 11
II
1 1 II 1
TH rH
1
1
a> C — i Oi
II
1 1« 1 1 I I 1 I I 1 1 I I
aw
CO
2 frg •M t3 t. +J ^ a. cS •a C
II
1 - 1 - 1 1 1 1 1 1 1 1 1 1 II I
>0010010 OlOOWOlOOlOOlOOj^^^^^ s / ol Oi oOdl O s^ccrt^NiHioo^fflOOHHN rfT^T^lOlOOtOt^r-OOOOOlOSOJiHHHHiH o /
//
00
I
I
1
1
1 II
11 1 1 1 1
HHHH
96
STATISTICS
and each column tabulates the distribution of one of the variates for a given value of the other. Thus each row (or column) gives a univariate frequency distribution. A row or column is often called an array : the xe array, for example, is t h a t row of y-values for which x = xs. TABLE 6.2.2. y
X
*.*
VI
Vt
fu tn hx
A,
fu
fi 2 fzt
/as As
...
h
fa
...
h Xp TOTALS
Correlation Table for Grouped Data TOTALS
y> u,
...
ft, K
h h»
fp8
f'n
/•i
Vz
A
AU /»•
Au-
f.
f;
N
6.3. Scatter Diagrams, Stereograms. How do we display such a distribution ? If we confine ourselves to two dimensions, we make a scatter-diagram. A pair of rectangular axes is taken, the abscissae being values of one of the variates, the ordinates those of the other. So t o every pair of values (xi, yj) there will correspond a point in the plane of the axes. If we plot these points, we have a scatter-diagram. The main disadvantage of this method of display is t h a t it is not well suited t o the representation of grouped data, for it is difficult t o exhibit a number of coincident points ! Nevertheless, a scatter-diagram is very often suggestive of directions along which further investigation may prove fruitful. (Figs. 6.3.1 (a), (6), (c) and (d).) To represent a grouped bivariate distribution in three dimensions, mark off on mutually perpendicular axes in a horizontal plane the class-intervals of the two variates. We thus obtain a network of class-rectangles. On each of these rectangles we erect a right prism of volume proportional to the occurrence-frequency of the value-pair represented by t h e rectangle in question. In this way we obtain a surface composed of horizontal rectangular planes. This is a prismogram or stereogram, corresponding in three dimensions to the histogram in two. Alternatively, at the centre of each class rectangle, we may erect a line perpendicular to the horizontal
MORE
VARIATES
THAN
ONE
IO105
plane proportional in length to the frequency of t h e variates in that class-rectangle. If we then join up all t h e points so obtained by straight lines, we obtain the three-dimensional analogue of the two-dimensional frequency polygon. Now, if we regard our distribution as a sample from a continuous bivariate population parent distribution, we can also think of a relative-frequency prismogram as a rough sample approximation t o t h a t ideal, continuous surface—the correlation surface—
_l 150
1 155
1 160
1 165
—i 170
1 175
1 180
1 185
1 190
1— 195
HEIGHT IN CENTIMETRES
FIG. 6.3.2.—Frequency Contours of Bivariate Distribution. which represents t h e continuous bivariate probability distribution in the parent population. Three-dimensional figures, however, also have their disadvantages, and we frequently find it convenient t o return to two-dimensional diagrams representing sections through the three-dimensional surface. Thus if we cut the surface with a horizontal plane we obtain a contour of the surface corresponding to the particular frequency (or probability) represented by the height of the plane above the plane of t h e variate axes. Fig. 6.3.2 shows frequency contours of ten, a hundred and a
98
STATISTICS
thousand m e n ' i n the groups of Table 6.2.1. I t also shows mean weights at each height and mean heights at each weight. If, however, we cut the surface by a plane corresponding t o a given value of one of the variates, we obtain the frequency (or probability) curve of the other variate for t h a t given value of the first. 6.4. Moments of a Bivariate Distribution. We confine ourselves here to the discussion of bivariate distributions with both variates discrete. A brief treatment of continuous bivariate distributions is given in the Appendix. We define the moment of order r in x and s in y about x = 0, y = 0 for the distribution of Table 6.2.2 as follows: Nm™' =fux1ry1s +fi2x1ry2" + . . . + Aix/yi* + A A W + • • • • • • + faxi'yf + . . . • • • + /w*pW mrS' = ^ j 2 2 f i j X i r y f . . . (6.4.1) ' i t where i = 1, 2, 3 , . . .p; j= 1, 2, 3, . . . q and N= 2 2 f y , the toted frequency of the distribution. *' I n particular, we have Nmla' = 2 S fyxi, and, summing for j, i ) Nmw' = S« { f a + fa + . . . + /(,)*<. or
Writing 2 fa= fi., the total frequency of the value Xi, m 1 0 ' = i — £ fi.Xi, the mean value of x in the sample. Denoting this •iV mean by x, we have nt10 ' = x (6.4.2) and, likewise, max' = y (6.4.3) Again,
mt<s' =
moment of x Y) ' Vi ~ V. m 20 ' = i
2 2 fyx? = i 2 fi.xi*, the second JSI j j JSI ( about the origiji. Writing Xi = Xi — x,
2 fi, (Xi + *)» = ^ 2 fi. (Xi* + ZxXi + x')
= I 2/,X,s +
since 1 2
=
0.
MORE
VARIATES
Denoting the variance of x by
THAN
sx%
io105
O N E IO105 2
and var (y) by % , we have
2
m 20 ' = s* + ** = w 20 + (m 10 ') a and, similarly, m02' = s,f -f y 2 = m02 + (m 0 ,') a
. .
(6.4.4) (6.4.5)
where, of course, m 20 and m 02 are the moments of order 2, 0 and 0, 2 about t h e mean. Now consider m
ii
iV
^
= iV 4 s
y
+ *)(v,' + y)
= * s s (/ijX.y, + i / i , y , + yfijXi + fijxy) i\ | ^ = ivJ S £ fijXi Yj + i j The quantity
fyXiYj is called the covariance of x and iv i j y and is variously denoted by Sxy or cov (x, y). We may therefore write win' = "in + io' • m01' or cov (x, y) = sxy= m n ' — mw' • m0i'= m x l ' — xy (6.4.6) 6.5. Regression. When we examine the d a t a provided by a sample from a bivariate population, one of the things we wish to ascertain is whether there is any evidence of association between the variates of the sample, and whether, if such an association is apparent, it warrants the inference t h a t a corresponding association exists between the variates in the population. We also wish to know what type of association, if any, exists. Frequently, if our sample is of sufficient size, a scatterdiagram of the sample data provides a clue. If, for instance, there is a fairly well-defined locus of maximum " dot-density " in the diagram, and if, when we increase the sample size, this locus *' condenses ", as it were, more and more t o a curve, we may reasonably suspect this curve to be the smudged reflection of a functional relationship between the variates in t h e population, the smudging resulting from the hazards of random sampling. In Fig. 6.3.1 (a) and (b) we have scatter diagrams of samples from populations in which the variates are linearly related. If, however, the dots do not appear t o cluster around or condense towards some fairly definitely indicated curve, and yet are not distributed at random all over the range of the
+ +
+ + +V +
+
x
•
L I N E A R REGRESSION; r
POSITIVE
Fig. 6.3.1 (a).—Scatter Diagram (I).
V+ + +
+++ +
+
+ + + + + X
LINEAR
REGRESSION;
V
NEGATIVE
FIG. 6.3.1 (6).—Scatter Diagram (IX).
IOI
+ + + T
+ +
4- + +
+
++ + +v + + +
X
—
T
> O
Fig. 6.3.1 (c).—Scatter Diagram (III).
X — •
T - O
FIG. 6.3.1 (d) .—Scatter Diagram (IV).
102
STATISTICS
sample, occupying rather a fairly well-limited region, as for example in 6.3.1 (c), it is clear that, while we cannot assume a functional relationship, nevertheless perhaps as a result of the operation of unknown or unspecified factors, the variates do tend to vary together in a rough sort of way; we then say t h a t the chances are t h a t the variates are statistically related. I t may be, however, t h a t the scatter-diagram is such t h a t the dots are p r e t t y uniformly distributed over the whole of t h e sample range and exhibit no tendency to cluster around a curve or to occupy a limited region; in this case, we may suspect t h a t there is no association between the variates, which, if this were so, would be called statistically independent (Fig. 6.3.1 (<*)). We cannot rest content with such a purely qualitative test and must devise a more sensitive, analytical technique. Now it is reasonable to assume t h a t if there is some tendency for x and y to vary together functionally or statistically, it will be more evident if we plot the mean value of each y-array against the corresponding value of x, and the mean value of each *-array against the corresponding value of y. In practice, it is customary to denote the means of ^-arrays by small circles and those of y-arrays by small crosses. Let the mean of the y-array corresponding to the value x = Xi be yi, and the mean of the #-array corresponding t o y = yj be x}. If we plot the set of points (xi, yi) and the set (xj, yj), we shall find t h a t in general each set will suggest a curve along which or near which the component points of t h a t set lie. Increasing the sample size will generally tend more clearly to define these curves. We call these curves regression curves and their equations regression equations: that curve suggested by the set of points (x,, y() is the regression curve of y on x\ t h a t suggested by the set of points (xj, yj) is the regression curve of x on y. The former gives us some idea how y varies with x, the latter some idea of how x changes with y. And it is intuitively fairly obvious t h a t if there is a direct functional relationship between the variates in the population sampled these two regression curves will tend to coincide. 6.6. Linear Regression and the Correlation Coefficient. If the regression curves are straight lines, we say t h a t the regression is linear; if not, then the regression is curvilinear. To begin with we confine ourselves to considering the case where regression is linear. Clearly, although the set of points (xi, yi) tend to he on a straight line, they do not do so exactly and our problem is to find t h a t line about which they cluster most closely. Assume
MORE
VARIATES
THAN
ONE
IO105
the line t o be y, = Axi + B, where A and B are constants to be determined accordingly. To do this we use the Method of Least Squares. The value of f i corresponding to Xi should be Axi + B. The diSerence between t h e actual value of fi and this estimated value is fi — Ax, — B. Now to each Xi there corresponds ft. values of y. To allow for this fact we form the sum S2V = S ( . ( f , - Ax, - BY
.
.
2
(6.6.1) 2
Now since all the terms making up Sy are positive, Sy - = 0 if and only if all the means of the y-arrays lie on this line. Thus Sy2 would appear to be a satisfactory measure of the overall discrepancy between the set of points fi) and this theoretical straight line. The " best " line we can draw, using this criterion—there are others—so t h a t these points cluster most closely about it will, then, be that line for which Sy2 is a minimum. Now Sy2 is a function of the two quantities A and B. To find the values of A and B which minimise Sy2, we equate to zero the two partial derivatives of Sy2 (see Abbott, Teach Yourself Calculus, Chapter XVIII, for an introduction to partial differentiation), with respect to A and B. Thus we have QJ (Sy2) = - 2 S fi. (ji - Ax, - B)Xi = 0 (6.6.2) and
^
(Sy2) = - 2S/,. (y, — Ax, — B) = 0
(6.6.3)
The latter equation gives us—remembering t h a t fi. = S S fifli - A S £ fijXi - B S S fa= 0. i j i j I i Dividing through by N = S S fa, we have i J y = Ax + B . . . .
Jlfy— i (6.6.4)
(6.6.5)
showing t h a t t h e mean of the sample (x, y) lies on the line yt = Ax( + B. Likewise (6.6.2) may be written S S fijXiyj - A S S fax? - B S S fyxt = 0 i I > ; i j and, again dividing by N, this is m u ' — Amm'
— Bx = 0 .
.
.
(6.6.6)
(6.6.7)
104
STATISTICS
Solving for A between (6.6.5) and (6.6.7), we have m
n ~ xy m20' — x2
= ia . sx*
miB
.
(6.6.8)
Finally, subtracting (6.6.5) from y< = Axi + B, we have fi -
y =
(sXylsx*)(Xi
-
x).
.
.
(6.6.9)
as t h e line about which the set of means of the y-arrays cluster most closely, the line of regression of y on x. Thus, f r o m 6.6.9, (x{, y() lies on the line (y - y)/sy = (SzylSzSy)(x — x)jsx If, therefore, we sample distribution from t h e mean in variate as unit, i.e.,
transfer our origin t o t h e mean of t h e and measure t h e deviation of each variate terms of t h e standard deviation of t h a t if we p u t
Y = (y — y) jsy and X = (x — x) /sx t h e equation of our regression line of y on x is Y = (s*j,/sssy)X.
.
.
.
(6.6.10)
•
(6.6.11)
a n d this m a y be further simplified by putting T
:
- SrylSxSy
Thus Y = rX
(6.6.12)
a n d in this form the regression line of Y on X gives us a measure, r, of t h e change in Y for unit change in X. The line of regression of x on y is immediately obtained by interchanging x and y, thus (x — x)lsx = (sxylsysx)(y or
— y)lsy
X = rY, i.e., Y = (l/r)X
. .
(6.6.9a) (6.6.12a)
now have our two regression lines, one with gradient r, • the other with gradient 1 /r, passing through the mean of t h e sample distribution (see Fig. 6.6). The angle between them, 0, is obtained b y using a well-known formula of trigonometry (see Abbott, Teach Yourself Trigonometry, p. 101), viz., ii
i >
tan
— t a n d>»
I n this case, t a n 0 = (I/r — r)j( 1 + ( l / r ) . r) = (1 - f ! )/2r
(6.6.13)
MORE
VARIATES
THAN
O N E IO105
1
We see immediately t h a t if r = 1, i.e., r — ± 1, 0 = 0 and the two lines coincide. This means t h a t in t h e sample the two variates x and y are functionally related by a linear relationship and in the population which has been sampled it m a y also be the case. We say, then, t h a t the two variates are perfectly correlated, positively if r = + 1 and negatively if r = — 1. On the other hand, if r = 0, 8 = 90°, and there is no functional relationship between the variates in the sample and hence, probably, little or none in the parent population : the variates are uncorrelated. I t is natural, therefore, t h a t , when regression is linear or assumed to be linear, we should regard r as a measure
of the degree t o which the variates are related in t h e sample by a linear functional relationship. We accordingly call r the sample coefficient of product-moment correlation of x and y or, briefly, the sample correlation-coefficient. The gradient of the line given by (6.6.9), t h e regression line of y on x, is (s xy ls x a ) or, as it is often written, cov (x, y) /var (x). This quantity, called the sample coefficient of regression of y on x, is denoted by byx; similiarly, s,,//sl,2 or cov (x, y)/var (y) is the coefficient of regression of x on y and is denoted by b x r I t follows t h a t b,x • = s^js*. s„* = . . (6.6.14) It should be noted t h a t regression is a relation of dependence and is not symmetrical, while correlation is one of interdependence and is symmetrical.
106
STATISTICS
6.7. Standard Error of Estimate. We must now examines r in a little more detail so as t o substantiate our statement that! it is a measure of the degree to which the association between M and y in the sample does tend toward a linear functional? relationship. Taking the mean of the distribution as our origin of co-; ordinates, we may write the equation of the line of regression of y on x is the form y = byzx. We now form the sum of the squared deviations not of the array-means from the corresponding values predicted from this equation, but of all the points (xi, y3) from the points on the l i n e y = byxx corresponding to x — Xi, i.e., from (Xi, byxXi). Remembering t h a t the frequency of (xt, yj) isfy, the total sum of square deviations will be S £ fy (y3 - byxXi)'* = NSy2, say. i
.
(6.7.1)
Then NSV* = S 2 fyyf - 2b yx S S fi3Xiy} + byx* E E f(jx? I
j
i
= NSy1 - 2byxNsxy
j
I
j
+ Nby^s**
Since byx — s^lsx*, we have NSya = Nsf or or
(1 - sx/ls^sy*) 2
= AV(1
S, = V ( 1 - r * ) r* = 1 - S y s /s„ 2
- r*)
. . . . . . . .
(6.7.2) (6.7.3)
Sy is called the Standard Error of Estimate of y from (6.6.9). Likewise, Sx, where Sx2 - s*2(l — r 2 ), is called the Standard Error of Estimate of x. Since Sj,2 is the mean of a sum of squares, it can never be negative. Therefore (6.7.2) shows t h a t r* cannot be greater than 1. When r — ±_ 1, Sy2 = 0 and so every deviation, (y3 — byxXi) = 0; this means t h a t every point representing an observed value, every (xi, yj), lies on the regression line of y upon x. But if r = ± 1. the line of regression of y on x and t h a t of x on y coincide. Consequently, all the points representing the different observed values (xu yj), and not just the points representing the array-means, lie on a single line. There is then a straight-line relationship between the variates in the sample and the correlation in the sample is perfect. We see then t h a t the nearer r approaches unity, the more closely the observed values cluster about the regression lines and the closer these lines lie to each other, r is therefore a measure of the extent to which any relationship there may be between the variates tends towards linearity.
MORE
VARIATES
THAN
ONE
IO105
Exercise : Show that the line of regression of y on x as defined above is also the line of minimum mean square deviation for the set of points (x{, yj). 6.8. Worked Example. The marks, x and y, gained by 1,000 students for theory and laboratory work respectively, are grouped with common class intervals of 5 marks for each variable, the frequencies for the various classes being shown in the correlation-table below. The values of x and y indicated are the mid-values of the classes. Show that the coefficient of correlation is 0-68 and the regression equation of y on x is y = 29-7 + 0-656* (Weatherburn.) x
y 52 57 62 67 72 77 82 87
42 47
52
57
62
3 9 9 26 10 38 4 20 — i
19 37 74 59 30 7
4 25 45 96 54 18 2
6 19 54 74 31 5
—
TOTALS
—
—
—
—
—
-
26
97
—
226
244
72
77 82
—
—
—
—
—
—
—
7 9 19 15 5
—
6 23 43 50 13 2
—
189
67
137 55
TOTALS
3 2
35 103 192 263 214 130 46 17
5
1,000
—
5 8 8
—
21
Treatment: (1) To simplify working we select a working mean and new units. We take our working mean at (57, 67) and since the class-interval for both variates is 5 marks, we take new variates : X = (x - 51)15', Y = (y - 67)/5 What will be the effect of these changes on our calculations ? Consider a general transformation, x = a + cX; y = b + dY\ then x = a + cX] i> = b + dY We have : = is/sto 2
l
*i
2
- X)*
s* = c sx* and, likewise, s„ = dlsT*
or Also =
- x)' = J i,MX,
2
- x)(y, r =
.
.
(6.8.1)
S / y f X , - X)(Y, - 7) = cdsXT m
g L =
SxS,
cdssr CSxdSy
( 6 8 3)
SXSY
'
108
108 S T A T I S T I C S
Again b
vx = sztlszs = "fc xy /c»s x 2 = (djc)bYX
and, similarly, K, = (cIWxy (6-8-4) We conclude, therefore, that such a transformation of variates does not affect our method of calculating r, while, if, as in the present case, the new units are equal (here c = d = 5), the regression coefficients may also be calculated directly from the new correlation table. (2) We now set out the table on page 109. (3) (i) X = 0-239; Y — 0-177. Consequently, x - 57 + 5 x 0-239 = 58-195, and y = 67 + 5 x 0-177 = 67-885. (ii) IS S ftsXf
= 2-541.
sz*=±
S S fitX,» - X*
= 2-541 - (0-239)2 = 2-484, and sx = 1-576. Consequently, although we do not require it here, st = 5 x 1-576 = 7-880. (iii) - i s s f t l Y * = 2-339. sr' = 2-338 - (0-177)2 = 2-308 w i j and sr = 1-519. Consequently, s ( = 5 x 1-519 = 7-595. (iv) I s X f ^ Y , = 1-671.
sz}. = I S S f ( j X { Y , -
XY
= 1-671 - (0-239) (0-177) = 1-629, giving s (not here required) = 5 s x 1-629 - 40-725. (v) The correlation coefficient, r, is given by r - sxrjsx . sT = 1-629/1-576 x 1-519 = 0-679. (vi) The line of regression of y on x is given by y ~ 9 = - *)• But 6 „ = sTXlsz* = 1-629/2-484 = 0-656. Hence the required equation is y = 0-656* + 29-696. Exercise: For the data of the above example, find the regression equation of x on y and the angle between the two regression lines. Use 6.6.13 and the value of tan 6 to find r. Check and Alternative Method of Finding the Product-moment: It will be noticed that along each diagonal line running from top right to bottom left of the correlation table, the value of X + Y is constant. Thus along the line from X = 4, Y = — 3 to X= — 3, V = 4, X + y = 1. Likewise, along each diagonal line from topleft to bottom-right, the value of X — Y is constant. For the line running from X = — 3, Y = — 3 to X — 4, Y = 4, for example, X - Y = 0.
I T
* W -
00 eo
O N M
O>
00
O
03
®
•OI
eo
a> CO
>O O>
M
K.
CI
00
IN IN
05 •O CO
•0< 00 eo
CO ICO
CO
O
S3
o
S3
7
1
IN
C|
CS CO
•>(1
eo
IN
•01 •01
1,671
LOG
CO
•01 IN
W-
"C £
\
-01
W-
1
i—(
-0< i—1
-0<
eo O
a>
CO IN
IN
1
1
1
1
1
1
1
1
1
1
1
8
IN
©
+ + IN+ eo
•<11
O eo
eo IN +
CO
CO •OI
l>
CO
II
+ S
t1-
IN
1o +
+
as
eo
t-
+
1
1
CO
M +
1
1
CO
+
1
l> >o
O
CO
Til
os I—1 lO
t-
en
CO
eo
•<(1
>a
eo
•OI
•OI
o o
eo
1
•OI
eo 1
o>
eo
OV lO
>o
IN
CO CO
o
•01
•O >o
IN
o©
CO
CO
CO
+
00
lO
00
•a
eo
l>
(N
•a CM
>o
Ieo
•O<
r-
CO
«
a>
00
CO
o
+
i—L
>o
CO
•01 00 +
Ol 00
©
1 CO OS
lO
CT1 o eo o
00
o
IN
IN
•0< "0<
1
t-
1
1
1
1
1
1
1
1
1
•OI
l> i—i
>o
+
IN
CO IN
tco_
1 »o
•0<
a>
CO CO
II 8
05
00
a
8
CO IN IN
CO IN
eo
00 OJ
ao
CO
CO CO
eo 00
o
IN
IN +
IN
vH 1
lO OH
®
CO
i-H O
iH lO •01
CO
t-
O
7
•01
tCO
•01 1
CO
CO IN
CO
rH vH 1
CO
IN IN
IN (N
eo
C—
1
—
M CO 1
IN 1
O
7
+
IN +
eo
+
>T
•01 +
•<5
^
•O
t-
>O
IN CO
CO
r-
CT-
IN 00
ft. W -
TQ
/
>T
£
j
— H O H < J o )
M S3
—
110
110 S T A T I S T I C S
Now + Y,)> = E £ fi)X{2 + ZS/^Y/ i i i i
SS il
+ 2 2 2 / , ^ , ij
and s s / A - y,)* = s s f ( j x ? + s s / „ y / - 2 s s / l j j : 1 y ( *j t i ii if If then, as in the present case, the entries in the table cluster around the leading diagonal, we may use the second of these identities to find the product-moment of X and Y. Tabulating, we have •
-3
-2
-1
0
1
2
3
•
17
85
209
346
217
103
23
(X, - Y,)' .
9
4
1
0
1
4
9
153
340
209
0
217
412
207
Xt-Yi h
•
MX,
- Yt) *
2 2 h,X? = 2,541, 2 2 ftJYi2= 2,339, i j it From the table, S S / „ ( X , - Y,y = 1,538. Therefore, S I f,JC,Y, i I i> = i(2,541 + 2,339 - 1,538) = 1,671. If the entries cluster about the other diagonal, the first of the above two identities is the more convenient with which to work. 6.9. Rank Correlation. Suppose we have n individuals which, in virtue of some selected characteristic A, may be arranged in order, so t h a t to each individual a different ordinal number is assigned. The n individuals are then said t o be ranked according to the characteristic A, and the ordinal number assigned to an individual is its rank. For example, " seeded " entries for the Wimbledon lawn-tennis championships are ranked : they are " seeded " 1, 2, 3 (first, second and third) and so on. The concept of rank is useful in the following ways : (a) We may reduce the arithmetic involved in investigating the correlation between two variates if, for each variate separately, we first rank t h e given values and then calculate the product-moment correlation coefficient from these rank-values. In this way we have an approximation to the correlation coefficient, r. (b) We m a y wish to estimate how good a judge of some characteristic of a number of objects a man is by asking
MORE
VARIATES
THAN
io111
O N E IO105
him t o rank them and then comparing his ranking with some known objective standard. (c) We may wish to investigate the degree of agreement between two judges of the relative merits of a number of objects each possessing some characteristic for which there is no known objective standard of measurement (e.g., two judges at a " Beauty Competition ".) Assume a set of n individuals, al, a2, . . . a„, each possessing two characteristics x and y, such t h a t the n individuals may be ranked according to x and, separately, according to y. Let the rankings be as follows : Individual
«1 «! • • • « ( • • • »,
*-Rank
XT
y-Rank
. . .
X,
. . .
*
N
yi y% • • • y< • • • y.
Here xt, x2 . . . Xi . . . xn are the numbers 1, 2, 3 . . . n in some order without repetitions or gaps. Likewise the y's. Now if there were perfect correlation between the rankings we should have, for all i, *< = y<. If this is not so, write "i — yi =di. Then S d? = E (Xi — y^ = S Xi1 + Sy<2 - 2 S xtyi {
i
i
a
But S *i = S y ^ = l <
a
i
i
+ 2» + 3» + . . . + n \ the sum of
i
the squares of the first n natural numbers. Consequently, = S y^ = n(n + 1)(2« + l)/6. Therefore. i
i
2xiyi = i
+ Sy<») i
i
d? i
= n(n + 1)(2n + l)/0 -
JSd?. i
Now cov (x, y) — - S x , y i — xy; n i and But
v a r * = v a r y = - (I s + 2» + . . . + «»)— y»; n ^ = y = ( l + 2 + 3 + . . . + n)/n = (» + l)/2.
112
STATISTICS
Therefore, — (« + l) 2 /4 i = («a - 1)/12 - ( S ^ a ) / 2 « and ' var x = (n + 1)(2« + l)/6 - (* + l)*/4 = (n« - 1)/12 cov (*, y) = (n + 1)(2n + l)/6 -
cov (x, y) /(var x . var y) ! = cov (x, y) /var *
So
= 1 - 6 ( iS d i * ) l ( n 3 - n ) . . . (6.9.1) This is Spearman's coefficient of rank correlation, R. If X di2 = 0, J? = 1, and there is perfect correlation by rank i (since di = 0 for all i, and, so, for all i, Xi — yi). W h a t happens, however, if the two rankings are exactly t h e reverse of one another ? This is t h e most unfavourable case and, consequently, £ d * is a maximum, while, since for all t, t x, + y, will be equal to n + 1, £ (x{ + y,)2 = n(n + I) 2 . We have,
then,
1'Zxiyl = n(n -f l')» - 2n(n + 1)(2» + l)/6
Xxiyi = n(n + 1 )(n + 2)/6.
Cov (x, y)
or
2
is
- (n — 1)/12
2
and var (.x) is + (n - 1)/12. R = — 1. Thus R varies between the two limits ± 1Worked Example 1: The figures in the following table give the number of criminal convictions (in thousands) and the numbers unemployed (in millions) for the years 1924-33. Find the coefficient of rankcorrelation. Year
1924
1925
1926
1927
1928
1929
1980
1931
1932
1933
Number convicted of crime
7-88
8-12
7-86
7-25
7-44
7-22
8-28
8-83
10-64
9-46
N u m b e r of unemployed .
1-26
1-24
1-43
1-19
1-33
1-34
2-5
2-67
2-78
2-26
Treatment: We rank the data, thus :
Number ployed
dt'
.
1928
1929
1930
1931
1932
1933
5
7
9
8
10
4
3
1
2
8
9
6
10
7
6
3
2
1
4
4
16
4
1
1
16
1
1
0
4
1924 1925 1926
Year Number convicted
6
unem-
1927
MORE
VARIATES
THAN
ONE
IO105
We have S< d? = 48, and, since n = 10, na — » = 990. quently, R = I - 6 x ^ = 0-709
Conse-
Exercise : Find r, the product-moment correlation coefficient, for the above data. Worked Example 2 : Two judges in a baby-competition rank the 12 entries as follows : X
1
2
3
4
5
6
7
8
9
10
Y
12
9
6
10
3
5
4
7
8
2
11 12 11
1
What degree of agreement is there between the judges ? Treatment: Here we have no objective information about the babies, but the coefficient of rank correlation will tell us something about the judges. We have Srf,2 = 416* »3 - n = 1,716 i Thus, R = — 0-455, indicating that the judges have fairly strongly divergent likes and dislikes where babies are concerned ! 6.10. Kendall's Coefficient. A second coefficient of rank correlation has been suggested by M. G. Kendall. Consider the rankings in Example 2 above. If we take t h e first number of the second ranking, 12, with each succeeding number, we shall have 11 number-pairs. To each pair, (a, 6), say, allot the score 1 if a < b, and the score — 1 if a > b. Thus for the 11 pairs {a, b), where a — 12, the total score is — 11. Now consider t h e 10 pairs (a, b), where a = 9, viz., (9, 6), (9, 10), (9, 3), etc. The total score for this set is — 1 + 1 — 1 — 1 — 1 — 1 — 1 — 1 + 1 — 1 = — 6. Continuing in this way, we obtain the 11 following scores, — 11,— 6, — 1,— 6, 3, 0, 1, 0, — 1, 0, — 1, totalling 22. H a d the numbers been in their natural order, as in the upper ranking, t h e total score would have been 11 + 10 + 9 + 8 + 7 + 6 + 5 + 4 + 3 + 2 + 1 = 66. The Kendall rank correlation coefficient, T, is the ratio of the actual to the maximum score, i.e., in this case, T = — 6i6 i = — 3 A The Spearman coefficient, R, for the same d a t a was — 0-451.
114
114 S T A T I S T I C S
Generally, if there are n individuals in the ranking, the Kendall coefficient is x = 2 S j n ( n - 1) . . . (6.10.1) where S is the actual score calculated according to the method used above. A shorter method of calculating S is : In the second ranking, the figure 1 has 0 numbers to its right and 11 to its left. Allot the score 0— 11 = — 11 and cross out the 1. 2 has 1 number to its right and 9 numbers to its l e f t : the score, therefore, allotted is 1 — 9 = 8; cross out 2. 3 has 5 numbers to its right and 4 to its left; score is 5 — 4 = I ; cross out 3. Continue in this way, obtaining the set of scores : — 11, — 8, 1, — 2, — 1, 2, — 1, — 2, 1, 0, — 1, the total of which is 5 = - 22. Alternatively, we may set down the two rankings, one of which is in natural order, one above the other, and join 1 to 1, 2 to 2, 3 to 3 and so on. Then if we count the number of intersections (care must be taken not to allow any two such intersections to coincide), N, say, S will be given by S = n(n - l)/2 - 2N and, therefore, t = 2S/m(« — 1) = 1 - 4 N / n ( n - 1) .
(6.10.2)
Like Spearman's R, Kendall's t is + 1 when the correspondence between the rankings is perfect, and — 1 only if one is the inverse of the other. When n is large t is about 2i?/3. Worked Example : Show that the values of t between the natural order 1, 2, . . . 10 and the following rankings are — 0-24 and 0-60: 7, 10, 4, 1, 6, 8, 9, 5, 2, 3 10, 1, 2, 3, 4, 5, 6, 7, 8, 9 Find also t between the two rankings as they stand. (Modified from M. G. Kendall, Advanced Theory of Statistics, Vol. I, p. 437.) Treatment: (1) Consider the first ranking. Using the short method of calculating S, we have S = (6 - 3) + (1 - 7) -f (0 - 7) + (4 - 2) + (0 - 5) + (2 - 2) + (3 - 0) + (1 - 1) + (0 - 1) 11 Hence T = - 11/45 = - 0-24.
MORE
VARIATES
THAN
O N E IO105
Io3
(2) Using the alternative short method for the second ranking we have : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 10, 1, 2, 3, 4, 5, 6, 7, 8. 9 and, obviously, N = 9, the number of inversions of the natural order. Thus T = 1 - 4 x 9/90 = 3/5 = 0-60 (3) To find x between the two rankings, rearrange both so that one is in the natural order. Here it is easier to put the second in that order: 10, 4, 1, 6, 8, 9, 5, 2, 3, 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Then
S = — 5 and T = — ^ = — a.
Exercise : Show that R between the natural order 1, 2, . . . 10 and the above two rankings has the values —0-37 and 0-45 respectively and that between the two rankings as they stand R = — 0-19. 6.11. Coefficient of Concordance. Frequently we need to investigate the degree of concordance between more than two rankings. Suppose, for example, we have the following 3 rankings : ^ 1 2 3 4 5 6 7 8 9 10 Y 7 10 4 1 6 8 9 5 2 3 Z 9 6 10 3 5 4 7 8 2 1 Summing t h e columns, we have the sums 17 18 17 8 16 18 23 21 13 14 Had there been perfect concordance, we should have had 3
6
9
12
15
18
21
24
27
30
and the variance of these numbers would then have been a maximum. B u t when, as in the present case, there is little concordance, the variance is small. I t is reasonable, therefore, to take the ratio of the variance of the actual sums to the variance in the case of perfect concordance as a measure of rank-concordance. The mean of each ranking is (n + 1) /2 in the general case; therefore, if there are m rankings, the mean of the sums will be m(n + l)/2. W i t h perfect concordance, these sums will be m, 2m, 3m, . . . nm and their variance, then, m 2 (l 2 + 22 + 3 a + . . . + m2)/« - m*(n + l) 2 /4 = m2(ws - 1)/12
116
STATISTICS
Let S be the sum of the squared deviations of the actual sums from their mean, m(n -f l)/2. We define the coefficient of concordance, W, between the rankings, by W = {S/n)/m*{n* - 1)/12 = 12S/m*n(n* - 1) . (6.11.1) Clearly, W varies between 0 and 1. In the case of the three rankings given, m = 3, n = 10 and W = 12 X 158-5/9 X 990 = 0-2134 I t may be shown (see Kendall, Advanced Theory of Statistics, vol. 1, p. 411) t h a t if i?av. denote the average of Spearman's R between all possible pairs of rankings, i?aT. = (mW - 1 )/(m - 1) . . (6.11.2) Exercise : Verify that (6.11.2) holds in the case of the three rankings given at the beginning of this section. 6.12. Polynomial Regression. So far we have limited our discussion of regression to bivariate distributions where the regression curves were straight lines. Such distributions are, however, t h e exception rather t h a n t h e rule, although they are important exceptions. If, using the notation of 6.6, we plot yi against Xi (or Xj against yj), t h e line about which these points tend to cluster most closely is usually curved rather than straight. When this is the case, the coefficient of correlation, f, which, it will be recalled, is a measure of the extent to which any relationship between the variates tends towards linearity, is no longer a suitable measure of correlation. The simplest type of non-linear equation is t h a t in which one of t h e variates is a polynomial function of the other, viz., y — a0 + axx -j- a^x* + . . . + Orxr + . . . + a^xk = S a /
.
(6.12.1)
r= 0
where the coefficients ar, (r = 0, 1, 2, . . . ft), are not all zero. If the regression equation of y on x (or x on y) is of this form, we have polynomial regression. Once we have decided upon the degree, ft, of the polynomial, we again use the Method of Least Squares to determine the coefficients, Or, (r = 0, 1, 2, . . . ft). Referring to Table 6.2.2, let y, be the actual mean of t h e x(-array, and let Y, be the calculated, or predicted, value when * = xt is substituted in (6.12.1). (If the data are not grouped
MORE
VARIATES
THAN
Io3
O N E IO105
and only one value of y corresponds to a given x, t h a t value of y is, of course, itself t h e mean.) The sum of t h e squared residuals, S 2 , is then given b y : S2 = £/,(y< <
Y,)* = S/,(y,
- L a,x,y
I
•
r
(6.12.2)
S 2 is thus a function of the A + 1 quantities, Or,(r = 0, 1, . . . k). To find the values of these quantities which minimise S 2 , we differentiate S 2 partially with respect to each of these quantities and equate each partial derivative, 8SildaT, to zero. This gives us K + 1 simultaneous equations in the o's, the normal equations, dS2ldOr = 0 (r = 0, 1, . . . k), from which the required coefficients may be determined. The following example illustrates the method when k = 2. Worked Example : The profits, f y , of a certain company in the xth year of its life are given by : X
1
2
3
4
5
y
1,250
1,400
1,650
1,950
2,300
Find the parabolic regression of y on x. {Weatherburn.) Treatment: Put u = x — 3 ; v = (y — 1,650) /50. u.
Then—
M4.
y-
v.
vu.
vu*.
1 2
-2 -1
4 1
-8 -1
16 1
1,250 1,400
- 8 — 5
16 5
-32 - 5
3
0
0
0
0
1,650
-13
0
-37
4 5
1 2
1 4
1 8
1 16
1,950 2,300
6 13
6 26
6 52
0
10
0
34
6
53
21
—
For parabolic regression2 of v on u, v = a + bu + cul and, so, S* = 2 (a + bu, + cu(' - i/i) . dS'/da = 2 S (a + bu, + cu,* — v,) i = 2(na + 6 S «, + c L u,*- 2 v,) i i i
118
STATISTICS
8S*l8b = 2 S (a + bu, + cu,2 — v,)u, i = 2(aS«, + 6 S « i , + c S t ( , l l - S v,u,) < < i »• dS'ldc = 2 S (a + bu, + cu? - v,)u,' = 2 (a E «,2 + 6 S M<s + c S w,4 - £ v,u,2) i > i i In the present example, the normal equations 0S2/Sa, SS'jSb, dS'/dc = 0, are 5a + 10c - 6 = 0; 106 — 53 = 0; 10a + 34c - 21 = 0 giving a = — 0 086, b = 5-3, c = 0-643. The regression equation of v on u is, therefore, v = - 0-086 + 5-3« + 0-643»2. Changing back to our old variates, the required regression equation is y = 1,140 + 72* + 32-15*2 6.13. Least Squares and Moments. If we differentiate (6.12.2) partially with respect t o a,, we have 8S*/8ar = Zft,(y, - £a,*/) . ( - 2x() i
r
and, equating to zero, 2 f.x^y, = 2 Jt.xiY,, for all r, i i showing t h a t — The process of fitting a polynomial curve of degree k to a set of data by the method of Least Squares is equivalent to equating the moments of order 0, 1, 2 . . . k oi the polynomial to those of the data. 6.14. Correlation Ratios. The correlation table being (6.2.2), let the regression equation of y on x be y = y(x). Then Yi = y(xi). If, then, S y l is the standard error of estimate of y from this equation, i.e., the mean square deviation of the y's from the regression curve, NS/
= 2 i j = 2 ^My,
-
Yi)" = 2 2 fijiy, - yl + y( - ?«)" i j - yi)2 + 2 2 - yi)(yt - F<) + 2 2 fi0i i j
Yi)*.
MORE
VARIATES
THAN
ONE
IO105
Let fH = fi sa S fij, the total frequency in the *
NSy2 =
+ S«i(y< i
Yi)* .
(6.14.1)
i
It follows t h a t if all the means of the ^-arrays lie on y = y(x), i.e., (yi — Yi) = 0, for all i, Sv2 is the mean value of the variance of y in each #-array, taken over all such arrays. Consequently, if all the variances are equal, each is equal to Sy2. When this is the case, the regression of y on * is said to be homoscedastic (equally scattered). If the regression is also linear, Sy2 = Sj,2 (1 — r*) and so the standard deviation of each array is s„ (1 — r2)i. Now let Sy'2 be the mean square deviation of the y's from the mean of their respective arrays. Then NS,'2 = S Y.fii(yt i
- y,)* = S Z f o ? - 2 £ £
j
= £ •Zf.y i
But
S
t
2
i
+ SS/(jy(« i j + S £ ftiy?
- 2 £ mzfvy,)]
i
i
fofi,
J
i
i
i
£ f i f f j = yi £ fij, and, therefore, i
J
NSy'2 = £ Xfijyf
-
> >
£ £ Mi2 • j
- Nm0i'
-
£ rnyi2 = Nsv2 + Ny2 — £ run* i
= Nsy* -
i
(£ «iy,:a — Ny2) i
But y is the mean of the array-means, yi; therefore the expression in brackets is N times the variance of the means of the arrays, which we shall denote by So S,'2 = s, 2 - sg2 . By analogy with (6.7.2), we write this
.
.
.
(6.14.2)
S„' ! = s„ 2 (l - «***) •
•
•
(6.14.3)
where eyx = SylSy (6.14.4) ana is called the correlation ratio of y on x. Likewise e-cy = SxJsr. is the correlation ratio of x on y. Since both S y ' 2 and
STATISTICS
120
Sy* are positive, (6.14.3) shows t h a t e,,x* < 1. Moreover, since the mean-square deviation of a set of quantities from their mean is a minimum, 0 < S„' 2 < S„2
or
0 < 1 - evx2 < 1 - r\ 2 i.e., r < eyx2 < 1
0 < eyx* - r* < 1 - r 2 .
or
.
.
(6.14.5)
Since we regard r as a measure of the degree to which a n y association between the variates tends towards linearity and since the residual dispersion, 1 — r2, is 0 when r" = 1, and 1 when r = 0, a non-zero value of eyx2 — r" may be regarded tentatively as a measure of the degree to which the regression departs from linearity. (But see 10.10.) That eyx is a correlation measure will be appreciated by noting that, when eyx2 = 1, S / 2 = 0, and, so, all the points (*<, yj) lie on the curve of means, the regression curve, y = y(x), i.e., there is an exact functional relationship between the variates. 6.15. Worked Example. Calculate the correlation ratio evx for the data of 6.8. Treatment: (1)
V " = V / V = (2
- Xy*)INsS .
(6.15.1)
Let T((y) be the sum of the y's in the Xjth array. Then T{(y) = n,f, and £ T{(y) = Ny. If T(y) be the sum of the y's in the distribution, i T(y) = S T,(y).
Then Ny* = [T{y)]*/N and S n&?=
.
Consequently
From the definition, e^f = s j 2 /s„ 2 . we see that, since deviation from the mean is unchanged by change of origin, e^ is unchanged thereby. Furthermore, since both numerator and denominator involve only squares, they are changed in the same ratio by any change of unit employed. Hence e^ is unchanged by 'change of unit and origin, and we may, therefore, work with the variables X and Y in our table of 6.8(2). From that table, then, we see that T(y) is the total of row F, and s^E^Cy)] j t j j e
sum
Qf
those quantities each of
which is the square of a term in row F divided by the corresponding term in row E.
MORE
VARIATES
THAN
ONE
IO105
(2) We have then : T(y) = sum of row F — 177 and N = 1,000, giving [T(y)]'jN = 31-329. Also 37" 1132 161* U^ 120^ 184' : 26 + 97 226 + 244 + 189 + 137 + I \ n< ) 112s 66* 172 + + + 55 21 5 = 1,117 (to 4 significant figures). 1
Thus
or
1 1 7
01
~ 1,000 x 2-308
<'••
V = 2-308 from 6-8(3))
= 0-471 e^ = 0-686, approximately. Since e yr 2 — r* = 0-009, the departure from linearity is small.
6.16. Multivariate Regression. When we have more t h a n two correlated variates, two m a j o r problems present themselves : (1) we m a y wish t o examine the influence on one of the variates of t h e others of the set—this is t h e problem of multivariate regression; or (2) we m a y be interested in assessing t h e interdependence of two of t h e variates, after eliminating the influence of all the others—this is t h e problem of partial correlation. Here we confine ourselves to the case of three variates, \> x2> xz> measured f r o m their means, with variances.^ 2 , s 2 2 , s 3 2 respectively. Since the variates are measured f r o m their means, let t h e regression equations be x
= 12-3*2 + &13-2*3 • *2 = &23-l*3 + &21-3*X •
• •
• •
(6.16.1) (6.16.2)
.
.
.
(6.16.3)
6
*3
=
&31-2*1 +
&32-l*2
We shall determine t h e b's b y t h e m e t h o d of Least Squares. Consider (6.16.1). The sum of t h e squared deviations of observed values of xl f r o m the estimated values is given b y S* = 2 (*, - bu.3xj
- &13.2*,)a
•
(6.16.4)
The normal equations are : b12.3 2
+ b13.2 2 *2*, = 2
6,2.3 2 *2*3 + 6 13 . 2 2 * 3 2 = 2 xxx3
.
(6.16.5)
.
(6.16.6)
122
S T A T I S T I C S
Solving these, we have
^So PL " 1 ~- yni 2,»23 8 ]J
•
.
(6.16.8)
Here r12, r23, r3l are total correlations : rl2, for instance, is the correlation between xt and x2, formed, by ignoring the values of x3, in the usual manner. The coefficients bu.3 and fc13.2 are called partial regression coefficients. Whereas ry = r,,-, is not in general equal t o bji-tThe reader familiar with determinant notation 1 will realise t h a t we may simplify these expressions. Let R denote the determinant where r x l = r 22 = *-33 = 1 and n j = r-ti for i, j = 1,2, 3, but i
j.
Then, if Rij denotes the cofactor of rij in R, we have r
23i'
~
^12
=
r
r
13r23i
lS
~
Ria —
r
l3
r
l2r31-
also „i?13 = R'1 'li-Ru + ' 12^12 + 'ls-Ris = -R] f-Rn + * 12^12 + * is' ,3R y 21 i?n + r22Rl2 + r23R13 = 0 >oH r2lRxl + Rl2 + r23 l l3 = 0 fr»iRu + r32R12 + r33R13 = 0 J l f 1 3 f l n + r23Ru + • .R 1 3 = 0 J (6.16.9) The regression equations become : (1) Regression of xx on x2 and x3 : + ^
+ ^
= 0 .
(2) Regression of x2 on x3 and
(6.16.10(a))
:
+ s =0 . + s2 s3 1 (3) Regression of x% on xtl and x% :
(6.16.10(6))
1 ^ + ^ 3 2 ^ = 0 . (6.16.10(c)) + s3 Sj 62 In the space of the three variates, these equations are represented by planes. These regression planes should not be 1 See Mathematical Note at end of Chapter.
MORE
VARIATES
THAN
ONE
IO105
confused with the regression lines : the regression line of xl on %% being xx = (r11sllsl)x2, for instance. 6.17. Multiple Correlation. D E F I N I T I O N : The coefficient of multiple correlation of xx with x2 and x3, denoted by is the coefficient of product-moment correlation of xl and its estimate from the regression equation of xx on x2 and x3. That this is a natural definition is clear if we recall t h a t if x x lies everywhere on the regression plane, there is an exact functional relationship between the three variates. From the definition,
=
cov (*„ b x2 + W , ) [var xx . var (&12.s*2 + b13.2x3)]i (i) cov to, b12.3x2 + bl3.2x3) = S(bli.3x1xi + £>13.2*1*3). since xx = *2 = x3 = 0 = bl^.38(xlx2) + bl3.t£(xxx3) = 6 l a . s covO*,*,) + 6 1 3 . 2 COV ( X j X 3 ) sj S j I? j 3 = r y 12SlS2 — p i3SlS3 i7" pn 2 3 11 2 ll S = - ^ [''12^12 + ' i s ^ i d
Now, using the first equation of (6.16.9), ji cov (xr, &i2.3*2 + &18-2*3> = - £1 p ^ [R — r l l ^ l l ] R = Sl*[l - i?/i?u], ••• rH = (ii) var ( x j = s ^ (iii) var (b12.3x2 + bl3.2x3) = 6 12 . 3 2 s 2 2 + & 13 ., 2 s 3 ! + 2&12.3&la.2»'23s,!ss (see 7.7.4) =
[i?12a + R13* +
2R12R13r23]
= Ki l l [i?i2(i?12 + r23R13) + + n Using the second and third equations of (6.16.9), var (fc12.3*2 + &13.2*3) = -
=
s2 W~i
R
•"11
+
i 2 R n r i 2 ~ KuRii'iai = ^'t1
-
124
STATISTICS
Consequently, r , . , , = [1 - R I R ^ i = ^
+
(6-I7.D
Likewise ' m i = [1 - jR/R^li and = [1 R/R3S]i 6.18. Worked Example. Calculate the multiple correlation coefficient of xt on xt and x}, r i.t>- from the following data : *1-
xs.
5 3 2 4 3 1 8
2 4 2 2 3 2 4
21 21 15 17 20 13 32
Find also the regression equation of Treatment: * a - =3 5 3
+1 -1
2
- 2 0
0
-1 -3 +4
1 9 16
-2
32
4 3 1 8
-
1 1 4
*3
1 1 1 1
21 21
+ + 15 17 -
1 1 5 3
1 1 25 9
0
20
0
0
0 0
-1 +1
1 1
13 - 7 32 + 12
49 144
+3 +4
+ 7 + 21 + 12 + 4 8
-2
6
-
229
+7
+27 + 79
-1
4 2 2 3
+ 1
4
=
***
2
2
on xt and x.
-1 -I 0
*3-20
1
-1 -1 +2
+ + +
1 + 1 1 - 1 5 +10 0 3 0
0
X a = - 4 = - 0-286; X% = - i = - 0-286; = - ^ -0-143. A S * , 8 = 4-571; ±E.X2» = 0-857; A S = 32-714; = 1; = = 3-857; = 11-286. cov (*,*„) = i £ X ^ j -
= l i - ( - 0-286) ( - 0-286) = 1 - 0-082 = 0-918 '
cov (x^,) = A S XtX, -
= 3-857 - ( - 0-286) ( - 0-143) = 3-857 - 0-041 = 3-816 = 11-286 - ( - 0-286)(— 0-143) = 11-286 - 0-041 = 11-245
cov(*,*,) = + S X j X , -
more
variates
than
one
io105
var (*,) = J. £
- X t ' = 4-571 - ( - 0-286)» = 4-571 - 0-082 = 4-489 var (*,) = A _ x , 2 = 0-857 - ( - 0-286)2 = 0-857 - 0-082 = 0-775 var (*,) = { S j ; , ' - X * = 32-714 - ( - 0-143)2 = 32-714 - 0-020 = 32-694 918 ° ' „ « 3-816 12 23 = [0-775 x 32-694]i [4-489 x 0-775]i = 0-492;' r23 = °-758: - [32-694 x445489]i = ° ' 9 2 7 I1 0-492 0-927 R = 0-492 1 0-758 = 0-0584 | 0-927 0-758 1 Ru = [1 - (0-758)2] = 0-4254; R12 = - [0-492 - 0-758 x 0-927] = 0-211; .Rls = [0-492 x 0-758 - 0-927] = - 0-5541 T, * "1* f~i 0-0584-1 i . 98 •• r ' - » = L 1 ~ « r J = L 1 " 0*4254J = Regression equation of x1 on xt and x3 is (6.16.10(a)):— _ ^ + _ Xi) + 5 _ = o, 1 3 where xx = X + 4 = 3-714, x, = X, + 3 = 2-714, x, = X, + 20 = 19-857, and st = (4-489)* = 2-11873, s, = (0-775)* = 0-88034, s, = (32-694)* = 5-71787. Hence the required regression equation is 0-20072(*! - 3-714) + 0-24024(*„ - 2-714) - 0-09708(x, - 19-857) - 0 or 20-1*, + 24-0*2 - 9-7*3 + 53-0 = 0 A
Exercise: Calculate t-2.31 and r3.12 and find the other two regression equations. 6.19. Partial Correlation. In many situations where we have three or more associated variates, it is useful to obtain some measure of the correlation between two of them when the influence of t h e others has been eliminated. Suppose we have the three variates, xv x2, x3, with regression equations (6.16.1), (6.16.2) and (6.16.3). Let x3 be held constant, then at this value of x3, the two partial regression lines of xl on xs and of x2 on xt will have regression coefficients i 1 2 . 3 and 621.3. In line with (6.6.14), we therefore define the partial correlation of xt and x2 to be given by ' i s s 2 = (&x.-3 X V . )
(6-19.1)
126
statistics
Likewise r
28-ia = (623-1 * 632-1) and y sl .2 a = (631-2 x 6J3.2) • I t follows t h a t -^12 x _ s2 R21 _ r 2 _ _ 12 S S 2-RH 1-^22 •^11-^22 y ( 12 ^13^23)^ (1 - r M «)(l - »-312) i.e.,
~ r i* r ** . [(1 - ^ ( l - ' „ • ) ] *
f 12.3 = 1
.
.
(6.19.2)
In practice it is seldom found t h a t »-12.3 is independent of x3, and we therefore regard the value of »-12.3 given by (6.19.2) as a rough average over the varying values of x3. Using the data of 6.18, we have y,-.. 128 =
005533 T = 0-226. [0-4254 X 0-1407]*
Mathematical
N o t e
to
Chapter
Six
If a, b, c, d are any four numbers we denote the quantity by I a I. Thus 1 ^ 1 = 1 x 7 - 3 x 5 = - 8 . J \ c a \ | 5 7 I Such a function of its four elements is called a determinant of order 2, having 2 x 2 elements. A determinant of order three has 3 x 3 elements and is written I all «12 «18 j i a 21 a 22 a 23 ' ! ^31 ^82 ^88 I the suffixes of an element indicating the row and column it occupies in the determinant. Suppose we select any element, a 2 1 say, and rule out the row and column in which it occurs. We are left with the 4 ad-be
elements I Ua l 2 a I 38 S» multiplied by (— the determinant.
I. The determinant of these four numbers I 2 1) +* = — 1 is called the cofactor of a 2 i in The cofactor of a33 is
1112 1)3+3 I J — — | a H a l2 | 1 #21 #22 t ' ^21 #22 I n general, we denote the cofactor of an element
(
by AQ.
more
variates
than
o n e io105
127
The value, A, of the determinant
may be obtained by forming the sum of the products of the elements of any row (or column) and their respective cofactors. Thus 21 + «22A a2 + = anA 11 + a21A
21
+ aalA
sl
= a 12-^12 ~t~ ^22-^22 "1" ®S%A jjg = a
1 -2 3 7 4 - 1 2 - 3 2 -1 —1 = 1 1 + 3 3 2 z i + ( - 1) X ( - 2) (8 - 3) + 2(14 + 2) + 3(- 21 _ 8) = — 50.
For instance
Suppose now the elements of any row (or column) are proportional to those of another row (or column). Say, for example, we have a i2 "13 12 a3 3 Xa12 Xa ls — aa 11 Xa + i2^ a 13 a 31 T »1S«
a 11
Then, clearly, atlA
21
+ altA
22
+ aisA
2l
= 0.
In fact, if we form the sum of the products of the elements of any row (or column) and the cofactors of the corresponding elements of another row (or column), t h a t sum is zero. (The reader will find a useful introduction to determinants in C. A. B. Smith's Biomathematics (Griffin); a more detailed discussion is t o be found in Professor A. C. Aitken's Determinants and Matrices (Oliver & Boyd).)
128
statistics
EXERCISES ON CHAPTER SIX 1. Calculate the means of the following values of x and y : X
.
y •
0-25
1-00
2-25
4-00
6-25
012
0-90
213
3-84
6-07
The corresponding values of x and y satisfy approximately the equation y = mx + c. By the method of least squares obtain the best values of the constants m, c, assuming that there is error in the y values only. 2. Daily Newspapers (London and Provincial) 1930-40 : Year
1930
1931
1932
1933
1934
148 18-2
Number Average circulation (millions)
169
164
156
157
147
17-9
17-6
17-9
18-2
18-0
Year
1936
1937
1938
1939
1940
Number Average circulation (millions)
148
145
142
141
131
18-5
191
19-2
19-5
18-9
1935
Fit a straight line to each of these series by the method of least squares, represent graphically and comment on the fit. (L.U.) 3. Calculate the coefficient of correlation between the continuous variables x and y from the data of the following table : X.
y.
- 3 to - 2 - 2 to - 1 — 1 to 0 0 to 1 1 to 2 2 to 3 Total
- 4 to —3 to —2 to — 1 to 0 to -3. -2. -1. 0. 1. 150 20 10 10 24
40 60 40 30 16 38
20 90 60 36 30 48
214
224
284
—
10 20 50 42 ' 20 34 176
1 to 2 to 2. 3.
10 20 20 16 6
—
—
—
—
72
22
16 6 —
6 2 —
8
Total.
220 200 180 150 100 150 1,000 (I.A.)
more
variates
than
io105
o n e io105
4. Calculate the correlation coefficient for the following U.S. data-. Index of income payments 114 137 172 211 230 239 Index of retail food prices 97 105 124 139 136 139 (L.U.) 5. An ordinary pack of 52 cards is dealt to four whist players. If one player has r hearts, what is the average number held by his partner ? Deduce that the correlation coefficient between the number of hearts in the two hands is — A. (R.S.S.) 6. In the table below, verify that the means of the ^-arrays are collinear, and also those of the y-arrays, and deduce that the correlation coefficient is —0-535. x 0 1 2 3 4 0 y\
3 1
12 12
18 54 18
4 36 36 4
3 9 3
(R.S.S.) 7. The ranks of the same 15 students in Mathematics and Latin were as follows, the two numbers within brackets denoting the ranks of the same student: (1, 10), (2, 7), (3, 2), (4, 6), (5, 4), (6, 8), (7, 3), (8, 1), (9, 11), (10, 16), (11, 9), (12, 5), (13, 14), (14, 12), (15, 13). Show that the rank correlation coefficient is 0-51. (Weatherburn.) 8. From the table below compute the correlation ratio of y on x and the correlation coefficient: Values of x 0-5-1-5. 1-5-2-5. 2-5-3-5. 3-5-4-5. 4-5-5-5. Total. Number of cases Mean y .
20 11-3
30 12-7
35 14-5
The standard deviation of y is 3-1. (Hint:
25 16-5
15 19-1
125 —
Use (6.15.1).)
(L.U.) 9. The three variates xlr *„ jrs are measured from their means. = 1; s, = 1-3; s3 = 1-9; »•„ = 0-370; r u = — 0-641; = — 0-736. Calculate »-„.,. If xt — xx + xt, obtain r4>, ra and Verify that the two partial correlation coefficients are equal and explain this result. (L.U.) Solutions 1. m = 0-988(4); c = —0-105. 3. r = 0-334. 4. 0-972. 8. e„ = 0-77(4); r = 0-76(4). 9. rn.t = —0-586; r„ - 0-874; r „ = 0-836; r,,., = —0-586.
c h a p t e r
s e v e n
SAMPLE AND POPULATION I : SOME FUNDAMENTALS O F SAMPLING
THEORY
7.1. Inferences and Significance. So far we have been concerned with problems of descriptive statistics : we have concentrated on describing distributions, summarising their main properties mathematically and establishing certain general principles exemplified by them. We have not as yet used these summaries and general principles for other purposes. This we must now start t o do. For one of the fundamental problems of statistics is : How, and with what accuracy, may we draw inferences about the nature of a population when we have only the evidence of samples of t h a t population to go on ? Suppose, for example, that we wish to find whether among males in the British Isles belonging to some specified age-group there is any correlation between height and weight. In practice, we cannot weigh and measure every individual belonging to this " p o p u l a t i o n " . We therefore resort t o . sampling. Common sense tells us, first, that, other things j being equal, the larger t h e sample, the better any estimate we i base on our examination of t h a t sample; and, secondly, that, ! whatever the size of the sample, t h a t sample must be a representative one. Assuming, for the time being, t h a t we have settled on t h e size of the sample or samples we shall take, how do we make sure t h a t the sample will be representative, a random sample ? This is our first problem. Suppose, however, t h a t we are satisfied t h a t our method of sampling is of a kind t o ensure random samples : we take our; samples, measure the height and weight of each individual in a sample, and calculate a value for r, the correlation coefficient, j based on the sample size, N. Immediately a host of new doubts and misgivings arise : How do we know t h a t the value obtained for r is really significant ? Could it not have arisen by chance ? Can we be reasonably sure that, although the variate-values obtained from a sample show a certain degree of correla130
sample
and
population.
i i 131
tion, in the population as a whole the variates are correlated ? Suppose we obtain from a second sample of AT a different value for r ; or suppose t h a t with a different value of N we obtain yet another value for r. Which, if any, of these values shall we use as the best estimate of p, t h e correlationcoefficient in the population ? Clearly, unless we can establish some general rules of guidance on such matters, all our descriptive analysis will be of little use. This is one of the main tasks of t h a t branch of statistics usually termed Sampling Theory. Before starting a more detailed discussion, let us set down what appear to be a few of the main types of problem with which the necessity of making statistical inference—inference from sample to population based on probabilities—confronts us: (а) There are those problems involved in the concept of randomness and in devising methods of obtaining random samples. (б) There are those problems which arise from t h e variation, from sample to sample of the same population, of the various sample statistics—problems concerned with the distribution of sample statistics. (c) There are those problems connected with how to estimate population parameters from sample statistics and with the degree of trustworthiness of such estimates. And, lastly, (d) There are those problems which arise when we seek t o test a hypothesis about a population or set of populations in the light of evidence afforded by sampling, problems, broadly, of significance. 7.2. What Do We Mean by " Random "f Unfortunately it is not possible here to enter into a detailed discussion of the difficulties involved in the concept of randomness. The " dictionary definition " of random sample is usually something along the following lines : A sample obtained by selection of items of a population is a random sample from t h a t population if each item in the population has an equal chance of being selected. Like most dictionary definitions, this one is not really very satisfactory, for, as the reader will realise, it has the air of trying
132
statistics
desperately to disguise something t h a t looks suspiciously like circularity. Nevertheless, we must reconcile ourselves, here a t least, t o using it. I t has, however, this virtue, t h a t it brings out the fact t h a t the adjective random applies to the method of selection rather than to any characteristic of the sample detected after it has been drawn. In this connection, two other, related, points must be made : (1) W h a t we are out t o get when we sample a population is information about t h a t particular population in respect of some specified characteristic, or set of characteristics, of the items of t h a t population. When sampling, we should keep asking ourselves, " What precisely are we trying to find out about what population ? " (2) A method t h a t ensures random selection from one population need not necessarily do so when used t o sample another population. W h a t are the main types of population we may sample ? In the first place, there are those populations which actually exist and are finite. Because all measurement entails approximation, the distribution of any variate in such a population is necessarily discrete. There are two ways of sampling such a population : after selecting an item, we may either replace it or we may not. Sampling without replacement will eventually exhaust a finite population and automatically, after each selection, the probability of a n y item being selected is altered. Sampling with replacement, however, can never exhaust even a finite population, and is thus equivalent to sampling from a hypothetical infinite population. If the probability of any item in the population being chosen is constant throughout the sampling process, we call the sampling simple. Thus, with a stable population, sampling with replacement is simple sampling. I t may happen, however, t h a t a population is so large t h a t even sampling without replacement does not materially alter the probability of an item being selected. In such a case, sampling without replacement approximates to simple sampling. The second type of population we are likely to encounter are theoretical or conceptual populations: The difference between an actual population and a conceptual one is illustrated when we compare a truck-load of granite chips with, say, the population of all real numbers between 0 and 1. Conceptual populations may be finite or infinite, but any infinite population is necessarily conceptual; so is any population in which t h e variate is continuous. Apart from their intrinsic interest, con-
sample
and
p o p u l a t i o n .ii133
I3I
ceptual populations are important because they can be used as models of actual populations or arise in the solution of problems concerned with actual populations. Finally, there are " populations " such as t h a t of " all possible throws of a die '' or t h a t of '' all possible measurements of this steel rod ''. These are certainly not existing populations like a truck-load of granite chips, nor are they anything like as definite as the population of " all real numbers between 0 and 1 " , which is, mathematically, precise. There are many difficulties with such "populations". Can we, for instance, regard t h e result of six throws of an actual die as a random sample of some hypothetical population of " a l l possible throws " ? And, since there is no selection, no choice, can they be regarded as constituting a random sample ? And in what way do we conceive essentially imaginary members of such a " population " as having the same probability of being selected as those members which, in Kendall's phrase, " assume the mantle of reality " ? Perhaps all we can say at t h e moment is t h a t such " populations " receive their ultimate justification in the empirical fact t h a t some events do happen as if they are random samples of such " populations ". 7.3. Random Sampling. We come then to the very much more practical question of how to draw random samples from given populations for specific purposes. Certain general principles should be borne in mind. To begin with, successful sampling demands specialised knowledge of the type of population to be sampled. For example, successful sampling of the varieties of birds visiting a given 10 acres of common land during a certain period of the year requires t h a t the sampling scheme be drawn up with the assistance of ornithologists intimately acquainted with the habits of possible visitors. On the other hand, the method of selection must be independent of the property or variate in which we are interested. If we wish to sample a truck-load of granite chips, just arrived in the siding, for chip-size, it would be fatal t o assume that, since the chips have been thoroughly shaken up on the journey, any shovelful will provide us with a random sample. A moment's reflection will convince us t h a t there will have been a t least a tendency for t h e more massive chips t o gravitate towards the bottom of t h e truck, while the lighter and smaller tend to come to the top. However, had we been interested in sampling a given number of churns of milk for f a t content, an adequate sampling scheme would have been to select a number of the churns a t random and, then, having thoroughly stirred their
134
statistics
contents, t o ladle out a given quantity from each. B u t how can we select a number of churns a t random ? Can we rely on " haphazard " human choice ? The answer is " No " . Indeed, we should seek to eliminate the human factor as far as possible. For experience tells us t h a t human choice is certainly not random in accordance with the definition we have here adopted. Even in cases where, at first sight, bias would seem hardly likely, as, for instance, in choosing the final digit in a set of four digit numbers, bias is most definitely operative. So to eliminate this factor, we resort t o a number of methods, of which but two can be mentioned here. 7.4. Ticket Sampling. The first method is ticket sampling. Let us assume t h a t we have a finite population of N items. We construct a model of this population as follows : On N similar cards we write down the relevant features of each member of t h e population, shuffle the cards thoroughly and draw n cards, say, representing a sample of n from the actual population. This is a fairly reliable method, but, if the population is large, involves much work preparing the model. Moreover, to ensure t h a t shuffling is really thorough is by no means as simple as it sounds. 7.5. Random Sampling Numbers. The second method is the method of using random sampling numbers. Given a finite population, we assign to each item of this population an ordinal number 1, 2, 3, . . . N. This set of numbers is virtually a conceptual model of the actual population. Suppose we wish to draw a sample of n. We use a table of random numbers. Among t h e best known are : L. H. C. Tippett, Tracts for Computers, No. 15, giving 10,400 four-figure numbers, composed of 41,600 digits. M. G. Kendall and B. Babington Smith, Tracts for Computers, No. 24, giving 100,000 digits grouped in twos and fours and in 100 separate thousands. R. A. Fisher and F. Yates, Statistical Tables for Biological, Agricultural and Medical Research, giving 15,000 digits arranged in twos. A Million Random Digits, published by the Rand Corporation, Santa Monica, California, giving five-figure numbers (see Table 7.1). We do not " pick out " numbers haphazardly from such tables as these. Indeed, it is essential not to do so, for it is extremely likely t h a t if this is done the bias of number-
sample
and
I3I
p o p u l a t i o n .ii135
preference, which we seek t o eliminate by using will operate once more. Instead, having stated numbers used and having indicated which section should work systematically through t h a t section. will make the procedure clear.
such tables, t h e table of is taken, we An example
Example: Draw a random sample of 20 from the " population " given in the table below. Calculate the mean of the sample and compare it with the population mean of 67-98. Treatment: we number the items in the table as follows: Length (cm) 59 and under 606162636465666768697071727374757677 and over
Frequency. 23 169 439 1030 2116 3947 5965 8012 9089 8763 7132 5314 3320 1884 876 383 153 63 25 58703
Sampling Number. 1-23 24-192 193-631 632-1161 1162-3777 3778-7724 7725-13689 13690-21701 21702-30790 30791-39553 39554-46685 46686-51999 52000-55319 55320-57203 57204-58079 58080-58462 58463-58615 58616-58678 58679-58703 —
We now read off from Table 7.1 20 successive five-figure numbers less than 58704, ignoring those greater than 58703. We thus obtain : 23780 28391 05940 55583 45325 05490 11186 15367 11370 42789 29511 55968 17264 37119 08853 44155 44236 10089 44373 21149 Our sample of 20 is consequently made up as follows : 2 items from the 64- class; 4 items from the 65- class; 3 items from the 66- class; 3 items from the 67- class; 1 item from the 68- class; 5 items from the 69- class; 2 items from the 72- class.
136
statistics
Taking the mid-value of the classes as 59-5, 60-5, 61-6, etc., the mean value of the sample is immediately found to be 1354/20 = 67-70 cm, as compared with 67-98 cm, the population mean. Exercise : It is desired to obtain random samples from the following : (i) a truck-load of granite chips; (it) a forest of mixed hardwood and softwood; (Hi) the population of London; (iv) all the cattle in Oxfordshire; (v) the varieties of bird_s visiting a given area of common land; (vij plants in a very large area of the Scottish Highlands. Explain the principles which would guide you in collecting such samples. (L.U.) Table
7.1.
Random
Numbers
(From A Million Random Digits, published for the Band Corporation by the Free Press (Glencoe, Illinois), previously published in the Journal of the American Statistical Association, Vol. 48, No. 264, December 1953.) 23780 88240 97523 80274 64971 67286 14262 39483 70908 94963
28391 92457 17264 79932 49055 28749 09513 62469 21506 22581
05940 89200 82840 44236 95091 81905 25728 30935 16269 17882
55583 94696 59556 10089 08367 15038 52539 79270 54558 83558
81256 11370 37119 44373 28381 38338 86806 91986 18395 31960
45325 42789 08853 82805 03606 65670 57375 51206 69944 99286
05490 69758 59083 21149 46497 72111 85062 65749 65036 45236
65974 79701 95137 03425 28626 91884 89178 11885 63213 47427
11186 29511 76538 17594 87297 66762 08791 49789 56631 74321
15357 55968 44155 31427 36568 11428 39342 97081 88862 67351
N o t e : This table gives 5 0 0 random digits, grouped for convenience, in five-digit numbers. Suppose six random numbers less than 161 are required, read off successive groups of three successive digits, rejecting those greater than 160. The result is :
(237) (802) (839) 105 (940) (555) (838) 125 (645) (325) 054 (906) (597) (411) (186) 153 (578) (824) 092 (457) (892) 009 The numbers in brackets are those greater than 160 and are rejected. The six random numbers less than 161 so obtained are therefore : 105
125
54
153
92 and 9
7.6. The Distribution of Sample Statistics. In the example of t h e previous section, we saw t h a t our sample mean differed somewhat f r o m the m e a n of t h e population. A different sample of 20 would have yielded a different value, also differing f r o m t h e population mean. W e r e we t o t a k e a large n u m b e r of samples of 20 we should h a v e w h a t in fact would be a frequency distribution of t h e m e a n of so m a n y samples of 20.
sample
and
I3I
p o p u l a t i o n .ii137
Suppose we drew every possible sample of 20 from the population of 58703. This would give us, for sampling with replacement, the enormous number of 58703 20 /20 ! such samples. (How?) If we drew the relative frequency polygon for this distribution, it would approximate closely to a continuous probability curve, with its own mean, variance and moments of higher order. Likewise, other sample statistics, the sample variance, for instance, also have their distributions for samples of a given size. So the question arises : What do we know about the distribution of sample statistics when : (a) the population sampled is not specified, and (b) the population is specified ? 7.7. The Distribution of the Sample Mean. We begin by recalling our definition of a random sample (7.2) and then, to make it of practical, mathematical use, reformulate it as follows : Definition: Random Sample. If x<, (i = 1, 2, 3, . . . n) is a set of n statistically independent variates, each distributed with the same probability density
X =
+ a2x3 - ) - . . . +
OiXi +
. . . +
OnXn
= 2 a{Xi i= 1 (7.7.1)
where the a's are arbitrary constants, not all zero.
We have
£{X) =£( \ i S= 1OixA/ = t S= i ai£(Xi) or
^'(X) -
S Oi^'iXi) i« 1
(7.7.2)
If we put at = 1 /n, for all i, and subject the to the condition that they all have the same probability density, we have £•(*!> = £(X2) = . • • = £(xn) = £(x) = (x, and X
becomes (*, + * 2 + . . . + xn) jn, the mean of a
138
statistics
sample of n froih a population <j>{x) a n d mean (x. becomes 6(x) = I S €(xi) = i . n£(x) = =i n
.
So (7.7.2) (7.7.3)
Thus the mean value of the m e a n of all possible samples of n is the mean of the population sampled: or the mean of the sampling distribution of x is the population mean. If n = 1, t h a t this is so is obvious, for, taking all possible samples of 1, the distribution of t h e sample mean is identical with t h e distribution of t h e individual items in the population. (7.7.3) is also t h e justification of t h e common-sense view t h a t t h e average of a number of measurements of the " length " of a rod, say, is a better estimate of t h e " length " of t h e rod t h a n a n y one measurement. W h a t of t h e variance of x ? We write jx,- for t h e mean of Xi, (xx for t h a t of X, a n d a,-2 a n d aj2 for t h e variances of a n d X, respectively. Also let pg be the coefficient of t h e correlation between a n d Xj (i ^ /), assuming such correlation t o exist. , Then (X - (x x ) 2 = (.ZMxi
- Hi)) .
or, if
n n n = 2 ai*(xi - (x,)2 + 2 2
i * j, Oi
\n)(Xj
-
w
)
i= 1
So
- (x^)2] =
2 afE^Xi — (X()2] *=1
n
+
n
2
2
0{(lj£[(Xi
- (Xi)(*f — W)]
But £{(Xi
- |X<)«] =
Also, when i
£(Xi>
- 2(Xi*< + (Xi2) =
e(Xi')
- |Xi2 = <Ji2
j,
£[{*• - V-i)(Xj - w)] = £(xtx, — [nXj - \ijXi + (XiW) =
€{XiXj)
=
cov
' fax,)
(Xi W =
oflfa
Hence ox* = 2 ai'm* +
2
2 a a p w j p f t . (i # j)
(7.7.4)
sample
and
p o p u l a t i o n .ii139
This is an important formula in its own right. If, however, the variates are independent, then py = 0, for all i, j, and a* 2 = W
+ « 2 V + . . . + O n V = S chW i= l (7.7.4 (a)) Again putting a{ = 1 /«, for all i, and subjecting the x's to the condition t h a t they all have the same probability density
or
0j
a
= -4= VI
•
(7.7.5)
Thus : The variance of the distribution of the sample mean is 1 /nth that of the variance of the parent population, n being the size of the sample. In other words, the larger the sample, the more closely the sample means cluster about the population mean value. The standard deviation of the sampling distribution of x is usually called the Standard Error of the Mean, and, in general, the standard deviation of the sampling distribution of any statistic is called the Standard Error of t h a t statistic. The above results hold for any population, no m a t t e r how the variate is distributed. However, it is known that, whatever the population sampled, as the sample size increases, the distribution of x tends towards normality; while, even for relatively small values of n, there is evidence t h a t the ^-distribution is approximately normal. 7.8. The Distribution of x when the Population Sampled is Normal. Consider a normal population defined by >(x) =
exp [ - (* -
Here, we recall, [/. is the population mean and <j!, t h e population variance. The mean-moment generating function for a normal distribution of variance o a is Mm(t) = exp (|a a < 2 ), (5.4.2), and the function generating the moments about the origin is M(t) ^ Mm(t) exp (n<) = exp (ytt + &H*). Now, remembering t h a t the m.g.f. for a n y distribution is M(t) = f ( e x p xt), the m.g.f. of the mean, x, of the n independent
140
statistics
variates X{, (t = ' 1, 2, 3, . . . n), each with probability function 4>(x), is n , n i £ ( e x p xt) — £ (exp S Xit/n) = £ ( II exp Xit/n) «=l „ / = II [£ (exp Xit/n)] i= l But, since (xv x2, . . . xn) is a sample from
= £ (exp Xt) = £ (exp ( S OiXit) j = £ ( II (exp «(*<<)).
B u t the *'s are independent, and, so, Mx(t)
= n £ (exp OiXit) = n £ (exp x((ait)) = U M i i o i t ) = exp
S^'c'j/'J
which is t h e m.g.f. of a normal distribution with variance cz* =
S ofa* . . . .
(7.8.2)
sample
and
p o p u l a t i o n .ii141
i3i
Consequently, (i) if n — 2 and = a 2 = 1, the distribution of the sum, *i + *2> the two normal variates xlt x2, is normal about the common mean with variance ctj 2 + a 2 l ; and (ii) if n — 2 and a x = 1, a 2 = — 1, the distribution of the difference, — x2, of the two normal variates, xlt x 2 , is normal about the common mean with variance <ji2 +
0 where
fit)
= V2w JL exp ( -
if 2 ).
I t must be emphasised, however, t h a t here we are sampling a population whose variance is assumed known. When this is not t h e case, the problem is complicated somewhat and will be dealt with later. 7.9. Worked Examples. 1. The net weight of i~kg boxes of chocolates has a mean of 0-51 kg and a standard deviation of 0-02 kg. The chocolates are despatched from manufacturer to wholesaler in consignments of 2,500 boxes. What proportion of these consignments can be expected to weigh more than 1,276 kg net? What proportion will weigh between 1,273 and 1,277 kg net? Treatment: We axe here drawing samples of 2,500 from an assumed infinite population. The mean net weight of boxes in a consignment weighing 1,276 kg will be 1,276/2,500 kg = 0-5104 kg. The standard error of the mean net weight for samples of 2,500 will be 0-02/V^500 = 0-0004 kg. Thus in this case t = 0-0004/ 0-0004 = 1. The probability that the sample mean will deviate
142
statistics
from the populati6n mean by more than this amount is P(t > 1) = 0-5 — P(0 < t sj 1), for this is a " one-tail " problem. P(t ^ 1) = 0-3413. Therefore P(t > 1) = 0-1587. Therefore just under 16% of the consignments of 2,500 boxes will weigh more than 1,276 kg. If a consignment weighs 1,273 kg, the mean weight of a box in that consignment is 0-5092. The deviation from the mean is then — 0 0008, or, in standardised units, —2. If the consignment weighs 1,277 kg, the corresponding mean weight is 0-5108 kg, a deviation from the population mean of + 2 standard errors. The probability that a consignment will weigh between these two limits is then—this being a " two-tail " problem— P ( - 2 < t < 2) = 2P(0 < l < 2 ) = 2 x 0-4772 = 0-9544 In other words, just over 95% of the batches of 2,500 boxes will lie between the given net weights. 2. The " guaranteed " average life of a certain type of electric light bulb is 1,000 hours with a standard deviation of 125 hours. It is decided to sample the output so as to ensure that 90% of the bulbs do not fall short of the guaranteed average by more than 2-5%. What must be the minimum sample size ? Treatment: Let n be the size of a sample that the conditions may be fulfilled. Then the standard error of the mean is 125/V «. Also the deviation from the 1,000-hour mean must not be more than 25 hours, or, in standardised units, not more than 25/(125/V») = Vnj5. This is a " one-tail" problem, for we do not worry about those bulbs whose life is longer than the guaranteed average. P(t > «i/5) = 0-1 and, so, P(0 < t < = 0-4 Using Table 5-4, we find that t = 1-281. Therefore, »i/5 = 1-281 or n = 40-96. Consequently, the required minimum sample size is 41. 3. The means of simple samples of 1,000 and 2,000 are 67-5 and 68-0 respectively. Can the samples be regarded as drawn from a single population of standard deviation 2-5 ? Treatment: Just as the sample mean has its own distribution, so, too, does the difference between the means of two samples of a given size. If x1 and x% are two independent variates distributed about mean /i„ /i, with variancesCT12,CT22irespectively, let X = x1 — x%. Then X = /x, — ft2. and, by 7.7.4 (a), aj = or,2 +
sample
population.
and
ii
143
2.5. This, on the assumption that X is approximately normally distributed, is most unlikely. We therefore reject the hypothesis that the samples are from a single population of standard deviation 2.5. 7.10. Sampling Distribution of the Mean when Sampling is without Replacement from a Finite Population. Let the population sampled consist of N items. Let the sample size be n. The number of samples of n t h a t can be drawn without replacement from N is ^ j . one value of the variate,
say,
I n these samples any figures
j j times, for
if Xi is chosen, there are but ^ ^ j ways of forming samples of n. Let the population mean be zero. Then 1v 2 X{ = 0. If m, is the mean of the / t h sample, let m be the < —1 mean value of all possible values of ntj. Denoting the sum of the items in the yth sample by ( L , we have \ i- 1 >) t " M W , =
\
nV"/
/N\ and
(,~i*V,
Consequently,
H W . n
,
/N — l\
s
(») fN — 1\ since each Xi occurs I _ j ) times and so n 2 mj = 0, i.e., m = 0. Thus the mean of the means of all possible samples is the mean of the parent population. If o 2 is t h e variance of the population, we have, taking the s population mean as origin, Na 2 = S <- l Moreover, if <jm* is the variance of the sample mean, /N\
V"/ 1 r/N
1V"/ / — 1\
11
B
\1 /N — 2\
-I
144
statistics
But
I 2 xA \«=j /
=
A
and, since
A
2 t= 1
or
2 Xi" + 2 2 (*<*,), (i ^ j), <=] < j
= 0, 2 2 (x&j) = — 2 t ; I 1
<s* N - n N\ ~~ n ' iV - 1 ' n! (N - n)l „2 AT — « . . . (7.10.i)
CTm2=_.__
If we let AT—^-qo , o m 2 — a * in, showing that when a, population is very large, sampling without replacement approximates to simple sampling. 7.11. Distribution of the Sample Variance. Let s 2 be the variance of a sample of n and let x be the sample mean. Then in 1 n l n /« \2 s a = - 2 (*,• - x)* = - 2 Xi* - X* = - 2 Xi* - ( 2 Xiln) L
i
"<<=] =
'
- n^
s i=1 Consequently n
sis')
=
S
i=l
^
'
-
^n S
S
t j
(XiXj),
(i *
j)
- i71 s 2 (XiXj), (i * j) i i
2 ^ )
-
2 iXiXj)], (i * j)
But there are n ways of chosing Xi from n values and, once xt is chosen, there are (n — 1) ways of chosing Xj so that Xj xt. Also, since and Xj are independent e ( x z (XiXj)) = 2 2 [£(xixj)] = 2 2 [£(Xi) . 6(Xj)] \ i j ' i j i J = 2 2 (£(#))* = 0 • i since 6(x) = (jlx' = 0. Therefore
ji,', [i s ', and [j.2 = a 2 , being population parameters.
sample
and
population.
ii
145
Thus the mean value of the variance of all possible samples of n is (n — 1) /n times the variance of the parent population; and, as we should expect, £(s2) - > a 2 , as the sample size increases indefinitely. Thus if we draw a single sample of n from a population and calculate the variance, s a , of t h a t sample, we shall have £(nsi/(n
-
1)) = £((xi - *)*/(» -
1)) = a 2 .
(7.11.2)
In other words : If we calculate I (Xi — x)*/(n — 1), instead of the usual i 2 {Xi — x)"ln, we have an unbiased estimate of the populai
tion variance, <j*. Of course, the actual figure calculated from the data of a single sample will, in general, differ from the actual value of 2 a . But, if we continue sampling and calculating ns*l(n — 1), we shall obtain a set of values, the mean value of which will tend to the actual value of a 2 as the number, AT, of samples of n drawn increases. A function of x and n which in this way yields, as its values, unbiased estimates of a population parameter is called an unbiased estimator of t h a t parameter. Thus if 6 is a population parameter and 0 (read " theta A cap ") is an estimator of 0, 0 is an unbiased estimator if £($) = 0. m / 33 L Xijn is an unbiased estimator of (jl, the population mean, i since £(m.i) = jx; on the other hand, s a = S (Xi — x)21n is a i
biased estimator of or2. For this reason, some writers define the variance of a sample to be S f i ( x t — x)2/(n — 1), S f i = n; i i with this definition, the sample variance is an unbiased estimate of the population variance. Although we have not adopted this definition here, we are introduced by it to an important notion-—that of the degrees of freedom of a sample. Let x be the mean of a sample of n. Then nx = S xu and, i for a given x, there are only n — 1 independent for when we have selected n — 1 items of the sample, the n t h is necessarily determined. S = nx is a linear equation of constraint on the sample. ' If there are p( < n, of course) linear equations of constraint on the sample, the number of degrees of freedom of the sample is reduced by p. We must now ascertain the standard error of the sample
146
statistics
variance. Once- again we confine our attention to the case when the population sampled is normal with zero mean and variance a2. What is the probability that a sample (xu x2, . . . xn) from such a population is such that its mean lies between x ± and a standard deviation lying between s ± ids ? Since the n x's are independent yet are drawn from the same population, the probability that the n values of x shall lie simultaneously between x1 ± \dxx, x2 ± idxt, dp =
exp ( - ( V + V
. .
± idx„ is
+ . . . + *„*))2csi)dx1dx2 . . . dxn (7.11.3)
Now think of (xlt x2, . . . xn) as the co-ordinates of a point P in a space of n-dimensions. Then dxxdx2 . . . dx„ is an element of volume in that space. Call it dv. Then dp is the probability that the point P shall lie within this volume element. If now we choose this volume element dv in such a way that any point P, lying within it, represents a sample of n with mean lying between x ± \dx and a standard deviation between s ± -Jds, dp will be the probability that our sample has a mean and standard deviation lying between these limits. Our problem therefore is to find an appropriate dv. Now we have the two equations, n n £ Xi = nx and 2 — x)2 — ns2. t=i Each of these equations represents a locus in our w-dimensional space.
If n were equal to 3, the equation
n
2 X{ 1=1 «= nx may be written (xl — x) + ( z — x) + (xa — x) = 0 and represents a plane through the point (x, x, x). Moreover, the length of the perpendicular from the origin on to this plane is 3^/3i = x . 3i and, so, the perpendicular distance between this plane and the parallel plane through (x + dx, i + dx, x -(- dx) is dx . 3i. In the w-dimensional case, the n . equation 2 Xi = nx represents a " hyperplane ", as it is i-1 called, and the " distance " between this " plane " and the " parallel plane " is dx . x
n
Again, if « = 3, the equation
2 (Xi — x) 1 = ns' becomes <-1
sample
and
population.
i
147
(#! — x)2 -)- (x2 — x)2 + (x, — x)2 = 3s2, and thus represents a sphere, with centre (x, x, x) and radius s . 3i. The plane B 2 Xi = 3x passes through the centre of this sphere. The i= 1 section will therefore be a circle of radius s . 31, whose area is proportional to s3. If s increases from s — %ds to s 4- J\ds, the increase in the area of section is proportional to d(s2). So the volume, dv, enclosed by the two neighbouring spheres and the two neighbouring planes will be proportional to dx . d(s2). In the w-dimensional case, instead of a sphere, we have a " hypersphere " of " radius " s . ni and this " hypersphere " is cut by our " hyperplane " in a section which now has a " volume " proportional to s n _ 1 . So, in this case, dv is proportional to dx . d{sn~x) = ksn~2dsdx, say. Consequently, the probability that our sample of n will lie within this volume is given by d p
= (1S^STi
ex
P { - i , f ,
X i
* }
sn
~2dsd* '
<7'
1L3(a))
n
2 (xt — x)2 = ns2 may be written i=1 2 X{2 — n(s2 -f x2) and, therefore, i= 1 dp = kt exp (— nx2/2a2)dx x X k2 exp ( - ns2l2v2)(s2)C-3)l2d(s2) . (7.11.4) where k1 and kt are constants. 1 1 Determination of k1: Since r + °o kl exp (— nx2 ji^dx = 1 But the
equation
c
we have immediately (5.4(e) footnote) Aj = (2iro2/n)~i. Determination of Aa : s'* varies from 0 to 00 ; therefore
r
Aa exp ( - «s2/2a2)(sa)<"~3"2d(s2) "0 Put «sa/2<72 = x\ then ht f ( 2 o a / « ) ( " - e x p ( - x)x^-W2-1dx = 1 0 But since, by definition (see Mathematical Note to Chapter Three),
I'
exp ( - x)x<— W- 1 ** = r « » - l)/2) A, = (»/2os)<—1>'2/r((n - l)/2).
i48
s t a t i s t i c s
We see imme'diately that, when the population sampled is normal: (i) the mean x and the variance s 2 of a sample of n are distributed independently; (ii) the sample mean is distributed normally about the population mean—taken here at the origin—with variance a2In-, and (iii) the sample variance s 2 is not normally distributed. The moment-generating function for the ^-distribution is M{t) s £ (exp ts*). Thus
Af(<) = k2j
exp (— «s a /2a a ) exp (fe2)(s2)("-3)/2cZ(s2)
we have exp ( - ^ h W " ^ ) " 0 But, by 3.A.3., exp ( - X * ) . ( X * ) ( ~ ) ~ l d ( X * ) = xf
I
...
=
.
.
1
^
2
)
(7.11.5)
The coefficient of t in the expansion of this function is the first moment of s 2 about the origin : (n — 1)<j2/m, as already established. The coefficient of <2/2 ! is the second moment of s 2 about the origin: («a — 1 )o4/«2. Hence var (s2) = (w2 - l)o*/wa - (n - l ) V / « 2 = 2(n - l ^ / w 2 For large samples, therefore, var (s2) === 2a* jn. In other words, the standard error of the sample variance for a normal parent population is approximately
^ for large samples.
sample
and
population.
ii
149
7.12. Worked Example. If s, 2 and s22 are the variances in two independent samples of the same size taken from a common normal population, determine the distribution of Sj2 + s s 2 . (L.U.) Treatment: The moment-generating function of s 2 for samples of n from a normal population (0, a) is (7.11.5) M(t) = (1 - 2o>»)-<"-»>'* Hence £(exp tsts) = £(exp tst2) = M(t) But, since the samples are independent, £ (exp t(st2 + s22)) = £[(exp is,2) (exp te22)] = [Af(f)]2 Hence the m.g.f. for the sum of the two sample variances is (I - 2o«//»i)-<»-»). Expanding this in powers of t, we find that the mean value of 2 2 2 s, + sj is 2(n — l)cr /«—the coefficient of til I—and var(s, 2 + j, 2 ) = 4n(n - l)a*ln2 — 4(n - 1 ) V / » 2 = 4(n - 1)CT4/«2. which, for large n, is approximately equal to 4a'/«. The probability differential for s, 2 + s s 2 is dp = '^^"i)'
exp ( - n(s* + s12)l2o2)(s12 + st2)»-2d(si2
+ st')
EXERCISES ON CHAPTER SEVEN 1. Using the table of random numbers given in the text, draw a random sample of 35 from the " population " in Table 5.1. Calculate the sample mean and compare it with the result obtained in 7.5. 2. A bowl contains a very large number of black and white balls. The probability of drawing a black ball in a single draw is p and that of drawing a white ball, therefore, 1 — p. A sample of m balls is drawn at random, and the number of black balls in the sample is counted and marked as the score for that draw. A second sample of m balls is drawn, and the number of white balls in this sample is the corresponding score. What is the expected combined score, and show that the variance of the combined score is 2mp(\ — p). 3. Out of a batch of 1,000 kg of chestnuts from a large shipment, t is found that there are 200 kg of bad nuts. Estimate the limits between which the percentage of bad nuts in the shipment is almost certain to lie. 4. A sample of 400 items is drawn from a normal population whose mean is 5 and whose variance 4. If the sample mean is 4-45, can the sample be regarded as a truly random sample ? 5. A sample of 400 items has a mean of 1-13; a sample of 900 items has a mean of 1-01. Can the samples be regarded as having been drawn at random from a common population of standard deviation 0-1 ?
150 s t a t i s t i c s
6. A random variate x is known to have the distribution p(x) = c(l + xja)m-1 exp (— mxja), - a < x Find the constant c and the first four moments of x. Derive the linear relation between /J, and /S2 of this distribution. (L.U ) 7. Pairs of values of two variables * and y are given. The variances of x, y and (x — y) are
I p(x)dx = 1. Transform by using substitution J-a 1 + xja = tjm. c — mme-mlaV{m); mean-moment generating function is t~"\ 1 - - J ; = a- ^j,2! = a a /2m\ m/3! - a»/3m»; 2 M„/4! = a*(m + 2)/8m ; 2JS, - 3/3, = 6.
c h a p t e r
e i g h t
SAMPLE AND POPULATION I I : t, z, AND F 8.1. The ^-distribution. We have seen t h a t if x is the mean of a sample of n from a normal population (|x, a) the variate t ss (x —
(i)/a/Vn
is normally distributed about zero mean with unit variance. But what if the variance of the parent population is unknown and we wish to test whether a given sample can be considered to be a random sample from t h a t population ? The best we can do is to use an unbiased estimate of a based on our sample of n, i.e., s{n/(n — l)}i, s being the sample variance. But if we do this, we cannot assume that t = (x - y.)(n -
1 )*/s
.
.
.
(8.1.1)
is a normal variate. In fact, it is not. W h a t is the distribution of this t (called Student's t) ? Let the population mean be taken as origin, then t =(n-
1 )*x/s
.
.
.
(8.1.1(a))
Since we m a y write t = (n — l)i{(x/a)/(s/a)}, t, and, therefore, the /-distribution, is independent of the population variance— a most convenient consequence which contributes greatly t o the importance of the distribution. If we hold s constant we have sdt = (« — 1 )*dx. Now x is normally distributed about zero with variance n~l (since t is independent of a, we may take a = 1) and, as we showed in the last chapter, x and s3 are statistically independent, when t h e parent population is normal. Thus dp(x) = (»/2?t)* exp (— nx*!2)dx Consequently, the probability differential of t for a constant s 1 may be obtained from dp(x) by using 8.1.1(a) and the relation sdt — (n — 1 )idx; we have dp{t, constant s1) = [n/2n(n — l)]*s exp [— nsH*/2(n — l)]
152
statistics
If now we multiply this by the probability differential of j* a n d integrate with respect to s 2 from 0 to <x>, we obtain t h e probability differential of t for all s2. By (7.11.4), 6Xp =
dp(t) = dt f
[nJ2n{n — 1)]} exp [ - ns2t2/2(n — 1)] s x
0
(w/2)l»-i)/2 r [ ( n - l)/2] e x p _ (n/2)(»- M[n/2n(n - l)]t V[{n - l)/2] X
™*/2)(s2)(»-Wd(s2) X
X dt f (sa)<» - 2)/2 e x p [ - ms2{1 + t2/(n 'o Putting s 2 = 2w/w[l + t2l{n — 1)], we have
l}/2]d(s ! )
f [ ( * - l)/2] X (2/w)"' 2 [1 + t2l(n - 1 )]-nl2dt I V e x p 0 r(«/2)
(-
" VTC(w — l)*r[(w — l)/2] • f 1 +
<*'
-
^since I j / n , 2 > _ 1 exp ( 0
1 ] B/
>"
= r(w/2)j
(since vtc = r(^)) or <#(<) =
1
p { ~ 1t \ - 2 T V
B
.
Vn
— 1
+ t*/(n - l)]-»/2dt (8.1.3)
We see at once t h a t t is not normally distributed. If, for instance, n = 2, we have dp(t) = (1 /tc) (1 + t2)-1, which defines what is known as the Cauchy distribution, a distribution departing very considerably from normality, the variance, for example, being infinite. However, using Stirling's approximation (5.2.7), it can be shown, as the reader should verify for himself, t h a t B[(n - 1 )/2, 1/2] . (n - 1)* tends to Vsfc a s n —y oo ; a t the same time, since [1 + t2j(n — l)]-"/ 2 m a y be written [(1 + <*/(» - ! ) ] - « " - D x (1 + t2j(n - !))]-*, while
sample
and
population.
m
ii
s
(1 + x/m) —> exp x as m —> ®, (1 + / /(» — exp ( - <s/2).
153
~
1}
—>
Thus the t-distribution approaches normality f \ exp (—t'/2)) as n increases. w 2 t c It is customary to put v = (n — 1), the number of degrees of freedom of the sample, and to write the probability function of t for v degrees of freedom thus
In his Statistical Methods for Research Workers, Sir R. A. Fisher gives a table of the values of | 11 for given v which will be exceeded, in random sampling, with certain probabilities (•P)•
Fisher's P is related to Fv(t) by P = 1 — 2 J
Fv(t)dt.
8.2. Worked Example. A random sample of 16 values from a normal population is found to have a mean of 41-5 and a standard deviation of 2-795. On this information is there any reason to reject the hypothesis that the population mean is 43 ? Treatment: t = 1-5 x 15*/2-795 = 2-078 for 15 degrees of freedom. Entering Table 8.1 at v = 15, we find that the probability of / > 1 -75 is 0-10 and of t > 2-13 is 0-05. Thus the probability that the population mean is 43 is over 0-05. On the information provided by the sample there is then no reason for rejecting the hypothesis. 8.3. Confidence Limits. Suppose, in the above example, we had wanted to find, from the sample data, the limits within which the population mean will lie with a probability of 0-95. We call these limits the 95% confidence limits of the population mean for the sample in question. To find these limits, we put 1
X 15 2^795 Entering Table 8.1 at v = 15, we find t h a t the value of | 11 which will be exceeded with a probability of 0-05 is 2-13. Hence L1
% 9 5 J i J ><
<
2 13
"
or 39-9 < (x < 43-1 Exercise : Show that the 98% confidence limits are 39-62 and 43-38.
154
statistics
Table
Values of \t \ for Degrees of Freedom Exceeded with Probability P in Random Sampling
8.1.
(Abridged, by permission of the author, Sir R. A. Fisher, and thi publishers, Messrs. Oliver and Boyd, from Statistical Methods frn Research Workers.) p.
0-50
0-10
0-05
0-02
0-01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1-000 0-816 0-765 0-741 0-727 0-718 0-711 0-706 0-703 0-700 0-697 0-695 0-694 0-692 0-691 0-690 0-689 0-688 0-688 0-687
6-34 2-92 2-35 2-13 2-02 1-94 1-90 1-86 1-83 1-81 1-80 1-78 1-77 1-76 1-75 1-75 1-74 1-73 1-73 1-72
12-71 4-30 3-18 2-78 2-57 2-45 2-36 2-31 2-26 2-23 2-20 2-18 2-16 2-14 2-13 2-12 2-11 2-10 2-09 2-09
31-82 6-96 4-54 3-75 3-36 3-14 3-00 2-90 2-82 2-76 2-72 2-68 2-65 2-62 2-60 2-58 2-57 2-55 2-54 2-53
63-66 9-92 5-84 4-60 4-03 3-71 3-50 3-36 3-25 3-17 3-11 3-06 3-01 2-98 2-95 2-92 2-90 2-88 2-86 2-84
25 30 35 40 45 50 60
0-684 0-683 0-682 0-681 0-680 0-679 0-678
1-71 1-70 1-69 1-68 1-68 1-68 1-67
2-06 2-04 2-03 2-02 2-02 2-01 2-00
2-48 2-46 2-44 2-42 2-41 2-40 2-39
2-79 2-75 2-72 2-71 2-69 2-68 2-66
00
0-674
1-64
1-96
2-33
2-58
v.
In general, for a sample of n with mean x and variance s%. 1*1 1 1
=
l£rufj. s
v
i
And if tp is the value of t with a probability P of being exceeded
sample
and
population.
ii
155
for v degrees of freedom, then the (1 — P)100% confidence limits for (x are: x — stp/v* < (Jt < x + stp/vi . . . (8.3.1) 8.4. Other Applications of the {-distribution. shown by Sir R. A. Fisher t h a t :
I t has been
If t is a variate which is a fraction, the numerator of which is a normally distributed statistic and the denominator the square root of an independently distributed and unbiased estimate of the variance of the numerator with v degrees of freedom, then t is distributed with probability function Fv(t). Problem : Given two independent samples of n1 and «2 values with means x and X, how can we test whether they are drawn from the same normal population ? We begin by setting up the hypothesis that the samples are from the same population. Let xlt (i = 1, 2, . . . «,), and Xt, (j = 1, n. _ 2, . . . n2), be the two samples. Then x = S x(jn1 and X n,
«=1
= E Xjjn 2 , while the sample variances are respectively j=i s, 2 = S (*, - ^)2/«! and saa = S (X, t = 1
X)'jna
3 = 1
These give unbiased estimates of the population variance, "i s i 2 /(»i — !) a n d »!S2a/(«a — !)• Now since £(nlSl* + n2ss2) = (», - l)a2 + (», - l)a2 = (», + n, - 2)o" a2 = (KiS,2 + « 2 5 s 2 )/(« l + « , - 2) . . (8.4.1) gives an unbiased estimate ofCT2based on the two samples, with " = n 1 + wJ — 2 degrees of freedom. If our hypothesis is true—if,_that is, our samples are from the same normal population, x and X are normally distributed about ft, the population mean, with variances a1jn1 and atjnt respectively. Therefore (7.9 Example 3), since the samples are independent, the difference, x — X, of their means is normally distributed with variance <j2(l /n 1 + 1 /» s ). It follows that <j2(l /«j + 1 /»8) is an unbiased estimate of the variance of the normally distributed statistic x — X, and therefore, in accordance with the opening statement of this section, [«i»,/(». +
•
•
• (8-4.2)
is distributed like t with * = « , + « , — 2 degrees of freedom.
156
statistics
8.5. Worked'Example. 1. Ten soldiers visit the rifle range two weeks running. first week their scores were 67, 24, 57, 55, 63, 54, 56, 68, 33, 43 The second week they score, in the same order :
Thi ;
70, 38, 58, 58, 56, 67, 68, 77, 42, 38
j i
Is there any significant improvement? How would the test bt affected if the scores were not shown in the same order each time f (A.I.S.) Treatment: 1st week score (x).
2nd week score (X). 70 38 58 58 56 67 68 77 42 38
67 24 57 55 63 54 56 68 33 43 520 (10*)
572 (10X)
X%
X - x
(X - *)»
3 14 1 3 -7 13 12 9 9 -5
9 196 1 9 49 169 144 81 81 25
4,489 576 3,249 3,025 3,969 2,916 3,136 4,624 1,089 1,849
4,900 1,444 3,364 3,364 3,136 4,489 4,624 5,929 1,764 1,444
52
764
28,922
34,458
(1) We assume there is no significant improvement, that, consei quently, both X and x are drawn from the same normal population and that, therefore, X — x is normally distributed about zero Then, regarding the 10 values of J? — x as our sample, we have s» s var (X - *) = S (X - *)»/» - (X - x)* - 76-4 - 27 04 = 49-36; and, therefore, s = 7-026. Hence t = (X - x)(n - 1 )i/s = 5-2 x 3/7-026 = 2-22 Entering Table 8.1 at v = 9 we find that the probability with whicl t = 2-26 is exceeded is 0-05, while the probability with which t = 1-83 is exceeded is 0-10. Therefore the result, while significant at th< 10% level, is not significant at the 5% level. We conclude, there fore, that there is some small evidence of improvement. (2) Had the scores not been given in the same order, we shouk have had to rely on the difference between the mean scores. W<
sample
and
population.
ii
157
again suppose that there has been no significant improvement and use the variate _ ' =
X
7 " • (njizl(nx
+
nx))i,
where = + n ^ 2 ) / ^ + nz - 2). In the present case nx = nx = 10, and we have 10s*2 = E* 2 - 10*2 and 10s z 2 = 10.X2. 2 2 10(5/ + sx ) = S* + S X' - 10(x* + X*) = 28,922 + 34,458 - 10(52» + 57-22) = 3,622. /. a 2 = 10(s/ + sj 2 )/18 - 201-2 or a = 14-18. Consequently,
t=
x (100/20) i = 0-82 for v = 18 d.f.
Entering Table 8.1 at v = 18, we find that there is a 0-5 probability that t will exceed 0-688 and a probability of 0-10 that t will exceed 1-73. Consequently, the result is not significant at the 10% level and there is no reason to reject the hypothesis that there has been no significant improvement. 2. In an ordnance factory two different methods of shell-filling are compared. The average and standard deviation of weights in a sample of 96 shells filled by one process are 1-26 kg and 0-013 kg, and a sample of 72 shells filled by the second process gave a mean of 1 -28 kg and a standard deviation of 0-011 kg. Is the difference in weights significant? (Brookes and Dick.) Treatment: Assuming that there is no significance in the difference of weights, _ 96 x (0-013)2 + 72 x (0-011)2 0 96 + 72 - 2 or 0 = 0-0125; \x - X | = 0-02 and (nxnxl(nx + nz)i = (96 x 72/168)* = 6-43. .-. | * | = 0-020 x 6-43/0-0125 = 10-29 for 166 degrees of freedom. Since v is so large in this case, we may assume that t is normally distributed about zero mean with unit variance. Then | t \ > 1 0 standard deviations and is, therefore, highly unlikely by chance alone. The difference in the weights is, then, highly significant. 8.6. The Variance-ratio, F. We now discuss a test of significance of t h e difference between t h e variances of two samples from the same population. Actually, if t h e sample variances axe such t h a t the two samples cannot have been drawn from t h e same population, it is useless t o apply the 2-test t o ascertain whether the difference between t h e means is significant, for we assume in establishing t h a t test t h a t t h e
158
statistics
samples are in' fact from the same population. Thus, the present problem is logically the more fundamental. Problem : A standard cell, whose voltage is known to be 1-10 volts, was used to test the accuracy of two voltmeters, A and B. Ten independent readings of the voltage of the cell were taken with each voltmeter. The results were : A . M l 115 114 110 109 1 11 112 115 113 114 B . 1 12 1 06 1 02 1 08 1 11 1 05 1 06 1 03 1 05 1 08 Is there evidence of bias in either voltmeter, and is there any evidence that one voltmeter is more consistent than the other? (R.S.S.) We already know how to tackle the first part of the problem (see at the end of this section), but what about the second part ? The consistency of either meter will be measured by the variance of the population of all possible readings of the voltmeter, and this variance will be estimated from the ten sample readings given. Thus we have to devise a test to compare the two estimates. This has been done by Sir R. A. Fisher, whose test is : If u* and v2 are unbiased estimates of a population variance based on nx — 1 and n2 — 1 degrees of freedom respectively (where rii and n2 are the respective sample sizes), then by calculating z = i log, (u'/v') and using the appropriate tables given in Statistical Methods for Research Workers, we can decide whether the value of this variance ratio, u*lvs, is likely to result from random sampling from the same population. Let xu (i = 1, 2, 3, . . . n x ), and Xj, ( j = 1, 2, 3, . . . n?), be two independent samples with means x and X respectively.* Unbiased estimates of the population variance are : u*
=
m 1 s 1 2 / ( m 1 — 1),
and v2 = n^s^Kn^ —
1),
2
where s ^ and s 2 are the respective sample variances. If V 1 = n l — 1. V | S « , - 1 , V + 1) a n d = v2t/»/(v2 + 1) . (8.6.1) Now the sample variance, s s , has the probability differential (7.11.4) n-l <*p(s*) =
(w/2j2)
2
exp ( - ns'/2a>) ( s ' ) ~ d(s')
s a m p l e
a n d
p o p u l a t i o n .
ii
159
Substituting from (8.6.1), the probability differential of m2 is -.-a dp(u2) = [ K /2<ja)">/21r(vj/2)] (tta) 2 exp ( ^u'l2a2)d(u2) (8.6.2)
But m and v are independent; therefore t h e joint probability differential of u2 and v2 is
X exp [ - (viK2 + v2v2)/2e2]d(u*)d(v2)
.
(8.6.3)
.
(8.6.5)
Now let 2 s l n («/y) = J i n (M2/«2) 2
2
. 2
2
Then u — v exp (2z), and, for a given v, d(u ) = 2v exp (2z)dz. Therefore
X exp [ - (vj«2* + v2)u2/2a2](t>2)
v, + v, - 2 2 d(v2)dz .
(8.6.6)
To find t h e probability differential of z, we integrate this with respect to v2 between 0 and oo, obtaining 2(v 1 /2qy' 2 (v 1 /2
/
x
n, -t- y, - 2 22
a
a
2
u exp[— (vjtf + v2)^ /2CT ](y )
2
<2(y2)
Recalling t h a t r T(n) — / xn'1exp 0
(—
we put
(vtefc 4- v2)
1 x = ^ (v^ 22 -f v2)v2.
2
160
statistics
This defines Fisher's ^-distribution, which, it should be noted, is independent of the variance of the parent population. How do we use it ? The probability, P(z Z), t h a t z will be not greater than some
(Z
given value Z for v t , v2 degrees of freedom is j dp(z). Then o z P{z > Z) = 1 - P(z < Z) = 1 — j dp(z). In his book o Fisher gives tables setting down the values of Z exceeded with probabilities 0-05 and 0-01 for given v t and v2. He calls these values, Zo.,,5 and Z0.01, the " 5% and 1% points " of z. To obviate the necessity of using logarithms, Snedecor (Statistical Methods, Collegiate Press, Inc., Ames, Iowa) tabulated the 5% and 1% points of the variance ratio, u2/v2, which he denotes by F, in honour of Fisher, instead of z = -J log,, F. Substituting F = u2 lv2, where u2 is the larger of the two estimates of the population variance, or F = exp (2z), in (8^6.7), we have v,"'V«/2
JPC-! - 2)/2
(ViF + v2) 2 I n the F-table, Table 8.6, the d.f. vx> of the larger estimate, give the column required, and v2, the d.f. of the smaller estimate, the row required. At t h e intersection of the appropriate column and row we find two figures : the upper figure is t h a t value of F exceeded with a probability of 0-05, t h e 5% point; the lower figure is t h a t value exceeded with a probability of 0-01, the 1% point. We may now return to t h e problem at the beginning of this section. 1
Writing
x = nFl("ip
+ vt)
°r
F
= ^/"ifi
-
x
)
we have
Hence
=
jf V - < 1
-
and the integral is B I (v,/2, v,/2). Thus P(x ^ X) is the Incomplete B-function Ratio, lz(v1j2, i/,/2) and can be found from the appropriate tables (see Mathematical Note to Chapter Three, D).
sample T a b l e 8.6.
5%
and
population.
ii
161
and 1 % Points for the Distribution of the Variance Ratio, F
(Adapted, by permission of the author and publishers, from Table 10.5.3 of Statistical Methods by G. W. Snedecor (5th Edition, 1956, pp. 246-249). 1
2
3
4
5
6
8
12
24
1
161 4052
200 4999
216 5403
225 5625
230 5764
234 5859
239 5981
244 6106
249 6234
254 6366
2
18-51 98-49
19-00 99-01
19-16 99-17
19-25 99-25
19-30 99-30
19-33 99-33
19-37 99-36
19-41 99-42
19-45 99-46
19-50 99-60
3
10-13 34-12
9-55 30-81
9-28 29-46
9-12 28-71
9-01 28-24
8-94 27-91
8-84 27-49
8-74 27-05
8-64 26-60
8-63 26-12
4
7-71 21-20
6-94 18-00
6-59 16-69
6-39 15-98
6-26 15-52
6-16 15-21
6-04 14-80
5-91 14-37
5-77 13-93
563 13-46
5
6-61 16-26
5-79 13-27
6-41 12-06
5-19 11-39
5-05 10-97
4-95 10-67
4-82 10-27
4-70 9-89
4-63 9-47
4-36 9-02
6
5-99 13-74
5-14 10-92
4-76 9-78
4-53 9-15
4-39 8-75
4-28 8-47
4-15 8-10
4-00 7-79
3-84 7-31
3-67 6-88
7
6-59 12-25
4-74 9-55
4-35 8-45
4-12 '•85
3-97 7-46
3-87 7-19
3-73 6-84
3-57 6-47
3-41 6-07
3-23 5-65
8
6-32 11-26
4-46 8-65
4-07 7-59
3-84 7-01
3-69 6-63
3-68 6-37
3-44 6-03
3-28 5-67
3-12 5-28
2-93 4-86
9
5-12 10-56
4-26 8-02
3-86 6-99
3-63 6-42
3-48 6-06
3-37 5-80
3-23 6-47
3-07 5-11
2-90 4-73
2-71 4-31
10
4-96 10-04
4-10 7-56
3-71 6-55
3-48 5-99
3-33 5-64
3-22 5-39
3-07 5-06
2-91 4-71
2-74 4-33
2-54 4-31
11
4-84 9-65
3-98 7-20
3-59 6-22
3-36 5-67
3-20 5-32
3-09 5-07
2-95 4-74
2-79 4-40
2-61 4-02
2-40 3-60
12
4-75 9-33
3-88 6-93
3-49 5-95
3-26 6-41
3-11 5-06
3-00 4-82
2-85 4-50
2-69 4-16
2-60 3-78
2-30 3-36
13
4-67 9-07
3-80 6-70
3-41 5-74
3-18 5-20
3 02. 4-86
2-92 4-62
2-77 4-30
2-60 3-96
2-42 3-59
2-21 3-16
14
4-60 8-86
3-74 6-61
3-34 5-56
3-11 5-03
2-96 469
2-85 4-46
2-70 4-14
2-53 3-80
2-35 3-43
2-13 3-00
15
4-54 8-68
3-68 6-36
3-29 5-42
3-06 4-89
2-90 4-56
2-79 4-32
2-64 4-00
2-48 3-67
2-29 3-29
2-07 2-87
16
4-49 8-53
3-63 6-23
3-24 5-29
3-01 4-77
2-85 4-44
2-74 4-20
2-59 3-89
2-42 3-55
2-24 3-18
2-01 2-75
17
4-45 8-40
3-59 6-11
3-20 5-18
2-96 4-67
2-81 4-34
2-70 4-10
2-55 3-79
2-38 3-45
2-19 3-08
1-96 2-65
18
4-41 8-28
3-55 6-01
3-16 5-09
2-93 4-58
2-77 4-25
2-66 4-01
2-51 3-71
2-34 3-37
2-15 3-00
1-92 2-57
19
4-38 8-18
3-52 5-93
3-13 5-01
2-90 4-50
2-74 417
2-63 3-94
248 3-63
2-31 3-30
2-11 2-92
1-88 2-49
\
00
162
statistics Table
8.6.
Continued
1
2
3
20
4-35 810
3-49 5-85
3-10 4-94
2-87 4-43
2-71 4-10
2-60 3-87
2-45 3-56
2-28 3-23
2-08 2-86
1-84 2-42
21
4-32 8-02
3-47 6-78
3-07 4-87
2-84 4-37
2-68 4-04
2-57 3-81
2-42 3-51
2-25 3-17
2-05 2-80
1-81 2-36
22
4-30 7-94
3-44 6-72
3-05 4-82
2-82 4-31
2-66 3-99
2-55 3-76
2-40 3-45
2-23 3-12
2-03 2-75
1-78 2-31
23
4-28 7-88
3-42 5-66
3-03 4-76
2-80 4-26
2-64 3-94
2-63 3-71
2-38 3-41
2-20 3-07
2-00 2-70
1-76 2-26
24
4-26 7-82
3-40 6-61
301 4-72
2-78 4-22
2-62 3-90
2-61 3-67
2-36 3-36
2-18 3-03
1-98 2-66
1-73 2-21
30
4-17 7-56
3-32 5-39
2-92 4-51
2-69 4-02
2-53 3-70
2-42 3-47
2-27 317
2-09 2-84
1-89 2-47
1-62 201
40
4-08 7-31
3-23 5-18
2-84 4-31
2-61 3-83
2-45 3-61
2-34 3-29
2-18 2-99
2-00 2-66
1-79 2-29
1-61 1-80
60
4-00 7-08
315 4-98
2-76 4-13
2-52 3-66
2-37 3-34
2-25 3-12
2-10 2-82
1-92 2-50
1-70 2-12
1-39 1-60
120
3-92 6-85
3-07 4-79
2-68 3-95
2-46 3-48
2-29 3 17
2-17 2-96
2-02 2-66
1-83 2-34
1-61 1-95
1-25 1-38
00
3-84 6-64
2-99 4-60
2-60 3-78
2-37 3-32
2-21 302
2-09 2-80
1-94 2-51
1-76 2-18
1-52 1-79
1-00 1-00
"l-
\
4
Note
5
to
Table
6
8
12
24
00
8.6
(1) To find the 5% and 1% points for values of Vi or vt not given in the above table, when Vj > 8 and v, > 24 we proceed as illustrated below: (а) To find the 5% point of F when = 200, v, = 18, enter the Table at v, = 18. The 8% point for = 24 is 2-16; the 5% point for y, = °o is 1-92. Divide 24/24 = I; divide 24/ao = 0; divide 24/200 = 0-12. The difference between the two given 6% points is 0-23. 0-12 of this difference, 0-12 x 0-23 = 0-0276. We add this to 1-92, obtaining 1-95, correct to two decimal places. (б) To find the 1% point of F when = 11, y, = 21, enter the Table at v, = 21. The 1% point when », = 8 is 3-51 and when i>l = 12 is 3-17. 24/8 = 3; 24/12 = 2; 24/11 = 2-18. The difference between the two known 1% points is 3-61 - 3 17 = 0-34. 0-18 x 0-34 = 0-06. Hence the required 1% point is 3-17 + 0-06 = 3-23. (c) To find the 8% point of F when k, = 4, •>, = 65, enter the Table at * i = 4. The 5% point for i>t = 40 is 2-61; the 6% point for i>, = 60 is 2-82. 120/40 = 3; 120/60 = 2; 120/58 = 2-18. 2-61 - 2-52 = 0-09. 0-18 X 0-09 = 0-016. The required 5% point is 2-52 + 0-016 = 2-54 correct to two decimal places. (1) To find the 1% point of F when v1 = 12, y, = 500, enter the Table at >•, = 12. The 1% for v, = 120 is 2-34; the 1% point for >•, = oo is 2-18. 120/120 = 1; 120/oo = 0; 120/600 = 0-24. 2-34 - 2-18 = 0-16. 0-24 X 0-16 = 0-038. The required 1% point is 2-18 + 0-038 = 2-22, correct to two decimal places. (2) If we make the substitution F = t1 in (8.6.8), simultaneously putting f l = 1 and i, = v, we find that the probability differential of F transforms into that for 1. Thus we may use the F-tables to find the 5% and 1% points of /. They are, in fact, the square roots of the 5% and 1% points of F for v, •= 1. (See also 10.3).
sample
and
population.
ii 163
We tabulate the working as follows : Reading of voltmeter A (X).
(x - 1-10).
Ill 1-15 114 1-10 1-09 1-11 1-12 115 1-13 1-14
(x - 1-10)2. 0-0001 0-0025 0-0016
0-01 0-05 0-04 —
x = = sx2 = = sx =
—
-0-01 0-01 0-02 0-05 0-03 0-04
0-0001 0-0001 0-0004 0-0025 0-0009 0-0016
0-24
0-0098
110 + 0-24/10 1-124 2 (x - l-10)a/10 - (x - MO)1 0-000404 0-0201
Reading of voltmeter B (X).
(X - 1-10).
(X - 1-10)1.
1-12 1-06 1-02 1-08 111 1-05 1-06 1-03 1-05 1-08
0-02 -0-04 -0-08 -0-02 0-01 -0-05 -0-04 -0-07 -0-05 -0-02
0-0004 0-0016 0-0064 0-0004 0-0001 0-0025 0-0016 0-0049 0-0025 0-0004
-0-34
0-0208
X = 1-10 - 0-034 =
Sj.»
1-066
= 0-00208 - (0-034)2 = 0-000924 = 0-0304
164
statistics
For voltmeter A: \t | = 0-024 x 9^/0-0201 = 3-68. Entering Table 8.1 at v = 9 we find that the value of t exceeded with a probability of 0-01 is 3-25. The result is therefore significant at the 1% level. Since the value of t here is positive, the voltmeter A definitely reads high. For voltmeter B: | t | = 0-0340 x 9*/0-0304 = 3-36. Once again the value of t is significant at the 1% level and we conclude that, since t is here negative, the voltmeter reads low. To test whether there is evidence that one voltmeter is more consistent than the other, we set up the null hypothesis that there is no difference in consistency. In other words, we assume that the samples are from populations of the same variance. F = u2jv2, where u2 > v2 and u2 and v2 are unbiased estimates of the population variance based on, in this case, the same number of degrees of freedom, 9. Since the samples are of equal size, we have F = sx2js2 = 0-000924/0-000404 = 2-29. Entering Table 8.6 at v2 — 9, we read that the 6% point of F for = 8 is 3-23, while that for = 12 is 3-07. 24/8 = 3; 24/12 = 2; 24/9 = 1-67. 3-23 - 3-07 = 0-16. 0-16 x 0-67 = 0-107. Therefore the 5% point of F for Vl = va = 9 is 3-177 = 3-18, correct to two decimal places. The value of F obtained, 2-29, is, therefore, not significant at the 5% level, and we have no reason to reject the hypothesis that there is no difference in consistency between the two voltmeters. EXERCISES ON CHAPTER EIGHT 1. A sample of 14 eggs of a particular species of wild bird collected in a given area is found to have a mean length of 0-89 cm and a standard deviation of 0-154 cm. Is this compatible with the hypothesis that the mean length of the eggs of this bird is 0-99 cm? 2. A group of 8 psychology students were tested for their ability to remember certain material, and their scores (number of items remembered) were as follows : A B C D E F G H 19 14 13 16 19 18 16 17 They were then given special training purporting to improve memory and were retested after a month. Scores then : A B C D E F G H 26 20 17 21 23 24 21 18 A control group of 7 students was tested and retested after a month, but was given no special training. Scotes in two tests : 21 19 16 22 18 20 19 21 23 16 24 17 17 16 Compare the change in the two groups by calculating t and test whether there is significant evidence to show the value of the special training. Do you consider that the experiment was properly designed? (R.S.S.)
sample
and
population.
ii
165
3. A sample of 6 values from an unknown normal population : 20, 25, 24, 28, 22, 26. Another sample of 5 values : 21, 24, 27, 26, 25. Show that there is no good reason to suppose that the samples are not from the same population. 4. Two marksmen, P and Q, on 25 targets each, obtained the scores tabulated below. Ascertain whether one marksman may be regarded as the more consistent shot. Score . . 93 94 95 96 97 98 99 100 Total Freouencv( P 2 1 4 0 5 5 2 6 25 frequency-^ q 0 2 2 3 3 8 5 2 25 (I.A.) 5. Latter has given the following data for the length in mm of cuckoo's eggs which were found in nests belonging to the hedgesparrow (A), reed-warbler (B) and wren (C): Host. A 22-0, 23-9, 20-9, 23-8, 25-0, 24-0, 21-7, 23-8, 22-8, 23-1, 23-1, 23-5, 23-0, 23 0 B 23-2, 22-0, 22-2, 21-2, 21-6, 21-6, 21-9, 22-0, 22-9, 22-8 C 19-8, 22-1, 21-5, 20-9, 22 0, 21-0, 22-3, 21-0, 20-3, 20-9, 22 0, 20-0, 20-8, 21-2, 21-0
Is there any evidence from these data that the cuckoo can adapt the size of its egg to the size of the nest of the host? Solutions 1. Not significant at 0-02 level. 2. Evidence of improvement in test group highly significant; that of control group highly insignificant. Initial scores in control group too high for control to be useful. 4. F not significant at 0 05 point, i.e., although there is evidence that Q is more consistent this could arise from random variation alone.
CHAPTER
NINE
ANALYSIS O F VARIANCE 9.1. The Problem Stated. The test of significance of the variance-ratio, F, described in the previous chapter encourages us to embark on a much wider investigation, t h a t of the analysis . of variance. This important statistical technique has been defined by its originator, Sir R. A. Fisher, as " The separation of the variance ascribable to one group; of causes from the variance ascribable t o other groups " (Statistical Methods for Research Workers, Eleventh Edition, ; 1950, p. 211). Suppose t h a t from a large herd of cows we pick fifty animals a t random and record the milk-yield of each over a given period. These fifty amounts (in litres, say) are a sample of fifty values of 6ur variate. Now t h e herd may consist, perhaps, of five different breeds—Ayrshire, Jersey, etc.—and we want to find an answer to the following problem, using the evidence provided by our sample : Does milk-yield vary with the breed of cow ? other words, are milk-yield and breed connected ?
Or, in
As a first step towards answering this question, it would be. reasonable to divide our sample into five sub-samples or] classes, according t o breed. Then if we could split up the total variance of our sample into two components—-that due? to variation between t h e mean milk-yields of different breeds and t h a t due t o variation of yield within breeds, we could! subject these two components to further scrutiny. ! To do this we first set up the null hypothesis that the factorj according to which we classify the population and, therefore, ths\ sample values of the variate, has no effect on the value of the variate, i.e., in our present case t h a t breed (the factor of classification); does not influence milk-yield. If,' indeed, this is the case, each class into which we divide our sample will itself be a random sample from one and the same population. Con-; sequently any unbiased estimates we may make of the population variance on the basis of these sub-samples should ba! compatible and, on the further assumption t h a t the populations sampled is normal, these estimates should not differ significantly 166
analysis
of
167
variance
when subjected to a variance-ratio test. However, should they be found to differ significantly, we should have t o conclude t h a t our sub-samples are not random samples from one homogeneous population, but are in fact drawn from several different populations brought into being as it were by our method of classification. We should have to conclude, in short, t h a t our null hypothesis was untenable and t h a t milk-yield and breed are connected. In practice, of course, the problem is seldom as simple as this and m a y involve more than one criterion of classification. We may, for instance, have to analyse, not merely the influence of breed on milk-yield, but also t h a t of different varieties of feeding-stuffs. This would present us with a problem of analysis of variance with two criteria of classification. Problems arising from three or more criteria are also common. Although the general principle underlying t h e treatment of all such problems is the same, each presents its own particular problems. 9.2. One Criterion of Classification. Consider a random sample of N values of a given variate x. Let us classify these N values into m classes according t o some criterion of classificam
tion and let the ith class have M; members.
Then S «,• = N. i =1 Also, let the yth member of the i t h class be Xi,. The sample values m a y then be set out as follows : Class 1.
*11
Class 2.
xtl
Class i
x
Class m
x
n
X11
*<2
*1,
Xl„l
*2f
X2ns
x
Xint
a
X
mi
ml
Xm
"m
I t is frequently the case that, for all i, rn = n, i.e., each class has n members and, consequently, N = mn. Let the mean of the ith class be x%. and the general mean of the N values be x.. Then, for all i, S (x^ - Xi.) = 0 j= i
. . .
(9.2.1)
168
168 s t a t i s t i c s
Consequently, nt ti( 2 2 2 (Xij — x..) = 2 (Xij — xt. + Xi.— x..) i=1 ni i =l m = 2 (Xi, - Xi.)2 + m(xi. - x..f + 2(xt. - x..) 2 (x,j j= l
xt.)
j = i
rn = 2 (xn — Xi.)2 + m(xt. — x..)1, (in virtue of (9.2.1)) i=i Hence m ni 2
2 (Xij -
t = 1j=1
x..)2 m
m
rn
= 2 «<(*,-. - *..)2 + 2 2 (9.2.2) i= 1 i=lj'=l The left-hand member of this equation is t h e t o t a l sum of t h e squared deviations of t h e sample values of t h e variate f r o m t h e general m e a n ; it is a measure of t h e " t o t a l variation " . The right-hand side of t h e equation shows t h a t this " t o t a l variation " m a y be resolved i n t o two components : one, measured b y t h e first t e r m on the right-hand side, is t h e variation which would have resulted h a d there been no variation within t h e classes; this is easily seen b y p u t t i n g x^ - -- xt. for all i; it is therefore t h e variation between classes; the other, measured b y t h e second t e r m on the right-hand side, is t h e residual variation within classes, a f t e r t h e variation between classes has been separated out f r o m the total variation. In short, Total
variation =
Variation
between classes + Variation within
classes
Assuming t h a t t h e classifying factor does not influence variate-values, each class into which t h e sample is divided by this factor will be a r a n d o m sample f r o m the parent population. Taking expected values of t h e terms in (9.2.2) :
where . e ( \i
e ( 2 2 (Xij- *..)2>l ={N - 1)0*. \i=ij-i / a 2 is t h e variance of t h e parent population. Also m iii \ nt p / *H \-i 2 2 (Xij- Xi.)A = 2 \£ 2 (Xi, - Xi.)*) =Ij =i / f = lL \ j = i /J m 2 = 2 (m - l)o = (N - m)a2. t= i
analysis
of
v a r i a n c e 169
Consequently, . m . e ( S m(xt. - X„)2J = (N - 1)<J2 - (N - m)a2 = (m -
l)a2
Thus, providing our null hypothesis stands, (9.2.2) leads us to two unbiased estimates of a 2 , viz., m S fii(x{. - *..)2/(m - 1), based on m — 1 degrees of freedom, and m rti 2 S (x{j - x{.)*l(N - m), i=lj=l based on N — m degrees of freedom. So far all we have said has been true for any population. We now impose the restriction that the population sampled be normal. With this restriction, the two unbiased estimates are also independent, and all the conditions for applying either Fisher's ^-test or Snedecor's version of t h a t test are fulfilled. We now draw up the following analysis of variance table : Analysis of Variance for One Criterion of Classification Source of variation.
Sum of squares.
Degrees of freedom.
Estimate of variance.
Between classes
S n{(x,. — x..)'
m — 1
2 »,(*<• - x..)*l <-i (m - 1)
N —m
2 2 (xtl - *,.)•/ f-n-i (N-m)
Within classes Total
m
i-i m
ni
S
S (x{, - *,.)»
m ni 2 2 (x„ - x..)» (-U-1
N - 1
m
ni
—
Since, from the conditions of the problem, both N and m are greater than 1, the estimate of o2 from the variation within classes must, of necessity, be based upon more degrees of freedom than t h a t of ct2 from the variation between classes. I t is reasonable, therefore, to take the estimate of o 2 from
170
statistics
the variation within classes as the more reliable estimate, even if the null hypothesis is untenable. If, then, the other estimate of a 2 , t h a t from the variation between classes, is smaller than this, but not considerably so, we m a y straightaway conclude t h a t there is no evidence upon which t o reject the hypothesis. If, however, it is greater, we m a y test whether it is significantly greater by means of a variance-ratio test. 9.3. Worked Example. Six machines produce steel wire. The following data give the diameters at ten positions along the wire for each machine. Examine whether the machine means can be regarded as constant. Machine. Diameters in m x 10~5 A 12 13 13 16 16 14 15 15 16 17 16 16 18 17 B 12 14 14 19 20 18 14 21 17 14 C 19 18 17 17 16 15 D 23 27 25 21 26 24 27 24 20 21 E 16 13 17 16 15 15 12 14 13 14 F 16 17 16 15 16 16 13 18 13 17 (Paradine and Rivett) Treatment: (1) We set up the hypothesis that the machine means are constant. (2) We note t h a t : (а) since the variance of a set of values is independent of the origin, a shift of origin does not affect variance-calculations. We may therefore choose any convenient origin which will reduce the arithmetic involved. We take our new origin at * =
20;
(б) since we are concerned here only with the ratio of two variances, any change of scale will not affect the value of this ratio. This also contributes to reducing the arithmetic, but in the present example is unnecessary. (3) We may further reduce the work of calculation as follows : Let T = 2 2 xt, and T, = li i Then (a) 2 X ( x t j - *)2 - S S V - Nx"- = ij ij ii
-
T2/N
(6) S S (*„ - *,)2 = s ( s (*«, - i,) 2 ) = S ( s * „ 2 - r42/»<) = s s v - s i w - *)2 = S (77/n,) - T'lN ' ' ' i 1 (4) In our example, «< = 10, for all t, and m = 0. Therefore N = nm = 60. We set up the following table : (c)
analysis
Machines (m = 6).
Positions
of
171
variance
(» = 10) Tt
1.
2.
4.
3.
5.
6.
7.
8.
A
12 13 13 16 16 -8 -7 -7 -4 -4 64 49 49 16 16
B
12 14 14 16 16 18 17 19 -8 -6 -6 -4 -4 -2 -3 -1 64 36 36 16 16 4 1 9
C
14 -6 36
21
17
14
1
9
36
23 3 9
27 7 49
25 5 25
21 1
9.
Ti'tn !
10.
15 15 16 17 - 5 3 -5 -5 -4 -3 9 36 25 25 16 14
280-9
305
20 18 - 3 6 0 -2 0 4
129-6
186
19
18
17
17
16
15 - 3 2
102-4
146
26 6 36
24 4 16
27 7 49
24 4 16
20 0 0
21 + 38 1 1
144-4
202
E
12 14 13 16 13 17 16 15 15 14 - 5 5 -8 -6 -7 -4 -7 -3 -4 -5 -6 -6 64 36 49 16 49 9 16 25 25 36
302-5
325
F
13 18 13 16 17 15 15 16 16 17 - 4 4 -7 -2 -7 -4 -3 -5 -5 -4 -4 -3 49 9 25 25 16 16 9 4 49 16
193-6
218
D
1 -3 -6
1
T'jN
-1 -2 -3 -3 -4 -5 1 4 9 16 25 9
T => s< Ti'im J -183 = 1153-4 =< 1382
= 182»/60 == 552-07
The analysis of variance table is, accordingly— Source of variation.
Sum of squares.
Between machines
1,153-4 - 552-1 = 601-3
5
Within machines
1,382-0 - 1,153-4 = 228-6
6x9 = 54
829-8
59
Total
Degrees of Estimate of freedom. variance. 120-3 4-2 —
F. 28-6 —
—
Entering Table 8.6 at = 5, va = 54, we find that this value of F is significant at the 1% point. We, therefore, reject the hypothesis that there is no difference between the machine-means. In other words, there is a significant variation, from machine to machine, of the diameter of the wire produced. 9.4. Two Criteria of Classification. Now let us consider the case where there are two criteria of classification. Suppose, for instance, we classify our cows not only according to breed but also according to the variety of fodder given them. To examine
172
statistics
where milk-yield' varies significantly with breed a n d with diet, we m u s t analyse t h e variation in yield into three components, t h a t due t o breed, t h a t due t o variety of fodder, and, finally, t h e residual variation, due t o unspecified or unknown causes, assumed t o be normal. Let our sample be of N values of x, the variate, such t h a t N = nm, a n d let us classify it according t o some factor A into m classes, and, according t o another factor B, into n classes. Let the sample variate-value in the i t h A-class a n d jth Bclass be Xij. The reader should verify, using t h e m e t h o d of t h e last section, t h a t m
n
m
E E (Xij - *..)a = S n(x(. - x..y + i=l )= 1 i= 1
n
£
1
m(x.j — x..)* +
-f £ £ {Xij — Xi. — x.j + (9.4.1) i-lj-1 The first term on t h e right-hand side of this equation is t h e sum of t h e squared deviations f r o m t h e general mean if all variation within the A-classes is eliminated, i.e., if each item in a n A-class is replaced b y t h e mean value of t h a t class; t h e second t e r m is the sum of squared deviations from the general m e a n if all variation within B-classes is eliminated; while t h e third term, t h e residual term, measures t h e variation in x remaining a f t e r the variation due to t h a t between A-classes and t h a t between B-classes has been separated o u t ; once again we assume it t o be normal and d u e t o unspecified or unknown causes. Since it has been shown (A. T . Craig, " On t h e Difference between Two Sample Variates ", National Mathematical Magazine, vol. I I , (1937), pp. 259-262) t h a t t h e three t e r m s on the right are independent, we have
( ( m
n
\
Z
2 (Xij - *..)') = (mn »'
l)o a ;
i = l jm= ]
£ n(xi. - x,.)2j
and m
(
n
= (m -
l)o 2 ;
e { l m ( x , - x . . r ) = ( n - I)*; \
S S (Xij - x.j + x..)A i-= l j - 1 ' 2 - (mn — l)o — (n — l)a a — (m — l)tj 2 = (m - !)(« - l)o a .
( 1 0 4 2 )
analysis
of
variance
173
2
Here o is, of course, the variance of t h e parent population, assumed homogeneous with respect to t h e factors of classification, A and B. The analysis of variance table is, therefore : Analysis
of Variance for Two Criteria of Classification
Source of variation.
Sum of squares.
Degrees of freedom.
Estimate of variance.
Between Aclasses
S n(x<. - x..)2
m —1
m 2 n(x,. - f..)2/(w» - 1) i-1
Between B- Sn mix.. — x..)' classes i-i
n - 1
JLm(x.) - x..)2/{n - 1)
Residual (A x B)
Total
m n S S (xtj - x,. (m - 1) 2 2 (X,j — Xi- — X.j i-ij-i (n - 1) - x.s + *..)» "+ ij)l(m - 1 )(n- 1) m n S S (xtJ - x..)2 mn — 1 — i -1j-1
Let us call the three resulting estimates of a 2 , QA, QB and QA x B respectively. The test procedure is, then, as follows : (а) Test QAIQA X B for m — 1 and (m — 1 )(n — 1) degrees of freedom using the F-table; and (б) test QBIQA x i for m - 1 and (m — l)(n — 1) degrees of freedom using the F-table. 9.5. Three Criteria of Classification. We now assume t h a t we have N = Imn sample values of a given normal variate, classified according to three criteria, A, B, C, into I groups of m rows and n columns. Let be t h e value of the variate in the7th row and Ath column of the ith group (see Table 9.5). i takes the values 1, 2, 3, . . . j, the values 1, 2, 3, . . . m\ and k, the values 1, 2, 3, . . . n. We shall use the following notation : = general mean ; %
= ( S Xijh\ /n = mean of values in i t h group, 7th row
174
statistics
= ^S^Xijtj
jm =
x.jk = ^ £ Xiji^ II ( m
n
£
\<=1
, i x..ic = ( £
„
„
ithgroup,Athcolumn;
„
„
„
yth row, /Hh c o l u m n ;
\
£ xijic) !mn — mean of «'th group ;
;=.li=l
, I x.j. = ( £
=
„
/
n » £ x^) /In = mean of jth row ;
k=1
>
m \ £ Xijic) flm — mean of ftth column.
Table 9.5 will m a k e t h i s notation clear. The identity u p o n which t h e analysis of variance table is based is : l £
m £
n I £ (xijk - x...Y = mn £ (*<„ - *...)* m n + kl £ (x.j. — x...Y + Im £ (x..k - x...)* l l m
l + £
n
+ I £ £ (Xijlc — X.j. — X..i + X...)a 3 = 1 i= 1 Z n + W £ £ (*« — Xi.. — x..!c + X...)* t=l* =1 I m + n £ £ (% — — % + x...)2 t=ij=i m n £ £ (Xijk — x.jic — — Xij. + Xi.. + x.j. + x..k — X...)*;
i=li=Ur=l
(9.5.1)
T h e resulting analysis of variance table is given on page 175. , 9.6. The Meaning of " Interaction ". I n the analysis of| variance for three criteria of classification, we have spoken of " interactions " between t h e factors A, B a n d C. W h a t d o we' mean b y this t e r m ? Our null hypothesis is t h a t no one of t h e factors A, B, CI separately influences t h e variate-values. B u t it is conceivable ; t h a t a n y two of these factors acting together m a y do so.! When, therefore, we h a v e three or more criteria of classifica-1
analysis
of
variance
175
tion, it is necessary to test the estimates of a 2 based on the " interaction " terms before testing the variations due to the individual factors separately. If none of these " interactions ", Sample for Three Criteria of Classification
T a b l e 9.5.
Columns (Factor C) k = 1 to n . . . k . .
Means ~
1 . 1 0 • II ' < H i O -w a! P* to P, 3
\ ,*1
SS 1 O O• «
V
1
Xlll
.
*W •
• •
OS pq- m Xlml • Means
1
"0 ® *
V
j
Xl-l
• Un
• Xi mi .
•
• Xx.k .
• xUn
X\"
XiU •
•
• Xnt .
• Xa„
Xll-
x
xi}l
• ljh •
X,.i
•
• i-k •
8£ "S2 !1» ««
Xlll
•
. xm
Ximl
3 ' - •
Means c Rows
M e a n s of Columns
• imn
x
X
x
X
x
Xll. %
• lmlc •
X
• i>n • lln
Xljl
• ut •
• Xlln
x
• Xlmk •
• Xlmn
x
lml
%
m
X
.
X
tmi-
*tm-
•
•
• Uk •
• l-n
X
.ii
•
•
• x.u
.
• X.ln
X.l.
X.,l
.
•
•
•
Means
1
x
•
Means
• m
m*
•
.
1 •
^lmn
•
'
i
Xll.
x
• X1}t .
m
O *
• xlln
• X1U •
x
X.ml • X..x
•
•
x
-mk •
• x..t . , «
X
• X
l~
X
-1,
• '«m
X.: J X
-m-
X...
when tested against the residual estimate of a 2 , is significant, we may pool the corresponding sums of squares, form a revised estimate of a 2 , based on more degrees of freedom, and then proceed to test the estimates due to the individual factors; if, however, one, or more, of the estimates due to interactions is
SunTof squares. S^l
Ifyi
n
s
(*••* ~~
(*... — x..,)2 j-i '
m
mT S l
n — 1 „ — 1
m — 1
hi
~
$
Residual (A x B x
S S +
i-lj-lt-I
o ~ _ x..k + w
2
"
m
2 (xijk - x.Jt - X(.k - xq. {I - I){m - 1) 2 E E {xm - x - x{.k - x{ ' (ft 1) i-H-lt-l 1 +%+ ' +*,.. +X.J. + X..kJ»/ (l~ 1)(« - 1) (M - 1)
«
. _ .„ , J™ " , . 1W n - - + *_)' {»-!)(«-1) {m _ 1){n _ ^ S (x.jt _ x..k + x...)2
"
•» v v ( / - l)(w -
Interaction between •» columns and ' £ groups (C x A)
n 1}
_ . . - „ „(l ,1)( w„ ~ *<" ~ + ~ " ~
Interaction between ' » rows and columns Vx* (B X C)
h ~ *<" 5 - % + *...)»
-
tnfi ' l — l -. 1 r S (x{„ — x.,,)2
Estimate of variance.
I S , . . . , - n, / > ,»„ n n ' * . ~ *<•• ~ ** + *-)' (* ~ ^ ~ « - l)(m - 1)
Im 2 (x..k — *.„)a
n
i mn S (x,, — i-1 m 2 kl 2 (x... — 1 x...) i-i
Interaction between groups and rows
Between columns
Between rows
Between groups
p
Analysis of Variance for Three Criteria of Classification
analysis
of
177
variance
found to be significant, we have reason for suspecting the null hypothesis. Significant interactions, however, do not n e c e s s a r i l y imply real interactions between the factors concerned. The working scientist and technologist are only too aware that, from time to time, experiment may give rise t o what, at first sight, would appear to be a sensational result, but t h a t closer scrutiny may well reveal t h a t it is due to chance heterogeneity in the data and is no sensation after all. This is just another reason why the closest co-operation between statistician and technologist is always essential, but never more so than when the statistician finds something apparently of " significance ". 9.7. Worked Examples. 1. Five doctors each test five treatments for a certain disease and observe the number of days each patient takes to recover. The results are as follows (recovery time in days) :
1.
2.
Treatment. 3.
4.
5.
10 11 9 8 12
14 15 12 13 15
23 24 20 17 19
18 17 16 17 15
20
Doctor. A B C D E
21
19 20 22
Discuss the difference between : (a) doctors, and (b) treatments. (A.I.S.) Treatment: This is a problem involving two criteria of classification, each criterion grouping the data into five classes. Thus w, (the number of items in row i) = nt (the number of items in column ») = 5, and N = 25. Transfer to a new origin at 16. With the suffix » denoting row and the suffix j denoting column, let
r = SL*„; r 4 - S * „ ;
=
then (cf. 9.3 Treatment (3)) : - x)* = SSV - Nx* = SSV - T*IN; i j i i i i *)' ' £«<(*< - *)' = S(77/«,) - T'/N•, • i i 1 S S (*, - *)» = S n,(g, -*)' = £ (Tfln,) - T*IN < 1 1 1
178
s t a t i s t i c s
We now draw up the following table : Treatment. 2. 3. 4.
Doctors
1.
T,
5.
-6(36) -5(25) -7(49) -8(64) -4(16)
-2(4) -1(1) -4(16) -3(9) -1(1)
7(49) 2(4) 8(64) 1(1) 4(16) 0(0) 1(1) 1(1) 3(9) -1(1)
4(16) 5(25) 3(9) 4(16) 6(36)
-30
-11
23
3
22
900
121
529
9
484
190
31
139
7
102
Tt.
77.
V-
5 8 -4 -5 3
25 64 16 25 9
109 116 90 91 63
T = 7
2 77 = 139 469
ET,'
V
J2043
22 V
469
Consequently, (i) Total sum of squared deviations, S S V — T-jN = 469 - 7*/25 = 469 - 1-96 = 467-04. ' ' (ii) Sum of squares for treatments, 2 {Tfln,} — T*IN = 2043/5 - 1-96 = 406-64. > (iii) Sum of squares for doctors, 2 (T^/nJ - T*/N = 130/5 - 1-96 = 25-84. ' (iv) Residual sum of squares = 467-04 — 406-64 — 25-84 = 34-56. The analysis of variance is, then : Source of variation.
Sum of squares.
Degrees of freedom.
Estimate of variance.
Between treatments .
406-64
4
101-66
Between doctors
25-84
4
6-46
2-99
Residual.
34-56
16
216
—
467-04
24
—
—
Total
.
F.
47-00**
Entering Table 8.6 at = 4, v, = 16, we find the 5% and 1% points of F to be 3-01 and 4-77 respectively. We conclude, there-1
analysis
of
179
variance
fore, that the difference between doctors is hardly significant (at the 5% level), while that between treatments is highly so (highly significant at the 1% level). 2. P. R. Rider (An Introduction to Modern Statistical Methods, John Wiley, New York) quotes the following Western Electric Co. data on porosity readings of 3 lots of condenser paper. There are 3 readings on each of 9 rolls from each lot. Porosity Readings on Condenser Paper Lot number.
Reading number.
1.
2.
3.
Roll number. 4. 5. 6. 7.
8.
9.
I
1 2 3
1-5 1-5 2-7 3-0 3-4 2-1 2-0 3-0 5 1 1-7 1-6 1-9 2-4 5-6 4-1 2-5 2-0 5-0 1-6 1-7 2-0 2-6 5-6 4-6 2-8 1-9 4-0
II
1 2 3
1-9 2-3 1-8 1-9 2-0 3-0 2-4 1-7 2-6 1-5 2-4 2-9 3-5 1-9 2-6 2-0 1-5 4-3 2-1 2-4 4-7 2-8 2-1 3-5 2-1 2-0 2-4
III
1 2 3
2-5 3-2 1-4 7-8 3-2 1-9 2-0 1-1 2-1 2-9 5-5 1-5 5-2 2-5 2-2 2-4 1-4 2-5 3-3 7-1 3-4 5-0 4-0 3-1 3-7 4 1 1-9
We shall carry out the appropriate analysis of variance assuming, for the time being, t h a t we have here three criteria of classification—the roll number dividing the data into nine classes, the lot number and the reading number each dividing the data into three classes. This will illustrate the method employed in such a case. Then, less artificially, we shall regard the data as classified by two criteria (roll and lot) with three values of the variate, instead of one, given for each Lot x Roll. First Treatment: (1) We draw up the table as shown at the top of page 180. Thus E 2 S V = (1-5* + 1-5' + 2-7* + . . . + 3-7! + 4-l« + 1-9') <<* = 812.41 and T = 23M, iV = 3 x 3 X 9 = 81 give T*/N = 659-35 The total sum of square deviations from the mean, £ £ £ xijt> - T'lN = 812-41 - 659-35 = 153-06.
180
Lot.
statistics
Reading.
I
1 2 3 Totals
1 2 3
II
Totals
III
1
2 3 Totals Total (Rolls) Total (Readings)
1.
2. 3. 4.
RoU. 8. 6.
Totals.
1-5 1-5 2-7 3-0 3-4 2-1 2-0 3-0 5-1 1-7 1-6 1-9 2-4 6-6 4-1 2-5 2-0 5-0 1-6 1-7 2-0 2-8 6-6 4-6 2-8 1-9 4-0
24-3 26-8 26-8
4-8 4-8 6-6 8-0 14-6 10-8 7-3 6-9 14-1
77-9
1-9 2-3 1-8 J-9 2-0 3-0 2-4 1-7 2-6 1-5 2-4 2-9 3-5 1-9 2-6 2-0 1-8 4-3 2-1 2-4 4-7 2-8 2-1 3-8 2-1 2-0 2-4
19-6 22-6 24-1
6-6 7-1 9-4 8-2 6-0 9-1 6-8 8-2 9-3
66-3
2-6 3-2 1-4 7-8 3-2 1-9 2-0 1-1 2-1 2-9 5-5 1-8 8-2 2-8 2-2 2-4 1-4 2-8 3-3 7-1 3-4 8-0 4-0 3-1 2-7 4-1 1-9
28-2 26-1 38-6
8-7 18-8 6-3 18-0 9-7 7-2 8-1 6-6 6-8
86-9
19-0 27-7 22-3 34-2 30-3 27-1 21-9 18-7 29-9
231-1
24-3 + 19-6 + 2 6 - 2 26-8 + 22-6 + 26-1 26-8 + 24-1 + 35-6
69-1 78-8
1 2 3
(2) We draw up the following l o t Lot. 1. 1 II
III Total
8. 9.
7.
(Rolls)
2.
4-8 4-8 8-8 7 1 8-7 18-8
3.
4.
88-8
x roll
Roll. 8.
6.
6-8 8-0 14-8 10-8 9-4 8-2 6 0 9-1 6-3 18 0 9-7 7-2
T o t a l ( L o t 1}
Total
(Lot
T o t a l (Lot
Grand
II)
III)
Total
| 1231-1
J
table :
7.
8.
9.
7-3 6-8 8-1
6-9 14-1 8-2 9-3 6-6 6-6
Total (Lots).
77-9 66-3 38-6
19-0 27-7 22-3 34-2 30-3 27-1 21-9 18-7 29-9
The sum of squares for l o t x r o l l classification (i.e., a two criteria classification) is thus (4-8* + 4-8* + 6-6* + . . . + 8-12 + 6-62 + 6-52) = 2,280-37, and the sum of the squared deviations from the mean is 2,280-37/3 - 659-35 = 100-77. Note that we divide 2,280-37 by 3 because each entry in the body of this l o t x r o l l table is the sum of three readings. The sum of squared deviations for r o l l s is (19-0* + 27-72 + . . . + 18-72 + 29-92)/9 - 659-35 = 26-31 (why do we here divide by 9 ?), while the sum of squared deviations for l o t s is (77-9* + 66-31 + 35-6a)/27 - 659-35 = 7-90 (why do we, this time, divide by 27?). Finally, the residual sum of squared deviations, now called Interaction (Lot x Roll) is found by subtracting from the total sum of squared deviations for this classification the sum of that for Rolls and that for Lots, i.e., 100-77 - 26-31 - 7-90 = 66-56.
a n a l v s i s
(3) The
reading
x
of
1.
2.
3.
4.
8.
6.
5-9 6-1 7-0
7-0 9-5 11-2
5-9 6-3 10-1
12-7 11-1 10-4
8-6 10-0 11-7
19-0
27-7
22-3
34-2
30-3
1 2 3 (Rolls)
l8l
9.
Total (Reading).
5-8 4-9 8-0
9-8 11-8 8-3
69-1 75-5 86-5
18-7
29-9
Roll.
Reading.
Total
v a r i a n c e
table is:
roll
7.
8.
7-0 8-9 11-2
6-4 6-9 8-6
27-1
21-9
Here the sum of squares, 5-9* + 7-0a + 5-9* + . . . + 8-6' + 8-0a + 8-3J = 2,107-73. The sum of squared deviations from the mean is, then, 2,107-73/3 - 659-35 = 43-23. The sum of squared deviations for r e a d i n g s is (69-1* + 75-52 + 86-5a)/27 - 659-35 = 5-73. We have the corresponding sum for r o l l s , already : 26-31. Interaction (Rolls x Reading) is, then, 43-23 — 5-73 — 26-31 = 10-19. (4) The l o t x r e a d i n g table is : Lot. I II III
1.
Reading. 2.
3.
24-3 19-6 25-2
26-8 22-6 26-1
26-8 24-1 35-6
The sum of squares in this case is 24-3* + 26-82 + . . . + 26-la + 35-6a = 6,086-25. Hence the sum of squared deviations for l o t x r e a d i n g is 6,086-25/9 - 659-35 = 16-90. We already have the sums of squared deviations for l o t s ( = 7-90) and that for r e a d i n g s ( = 5-73). Consequently, Interaction (Lot x Reading) is 16-90 - 7-90 - 5-73 = 3-27. (5) The analysis of variance table is shown on the top of page 182. We see at once that the Interactions, l o t s x r e a d i n g s and r e a d i n g s x r o l l s are not significant; nor, for that matter, are they significantly small. However, Interaction r o l l s x l o t s is significant at the 1% level and so is the variation between r o l l s , while that between l o t s is significant at the 5% level. Since the two Interactions, l o t s x r e a d i n g s and r e a d i n g s x R o l l s are not significant, we may combine the corresponding sums of squares with that for r e s i d u a l to obtain a more accurate estimate of the assumed population variance; this is 3-27 + 10 19 + 33-10 _ . e Q 4 + 1 6 + 32 °'89' We find, as the reader will confirm for himself, that the levels of significance are unaltered when this new estimate of o1 is used, and
182
statistics
Source of variation. Between rolls
Sum of squares.
Degrees Estimate of of freedom. variance.
F.
26-31
8
3-29
3-14**
Between lots
7-90
2
3-95
3-81 *
Between readings
5-73
2
2-87
2-76
66-56
16
4-16
4-00 **
Interaction Lots) .
(Rolls x
Interaction (Lots x Readings)
3-27
4
0-82
—
Interaction (Readings x Rolls)
10-19
16
0-66
—
Residual .
33-10
32
1-04
—
153-06
80
TOTAL
—
—
we conclude that the variation of rolls within lots and that between rolls are highly significant, while that between lots is significant. Second Treatment : The conclusion we have reached justifies the view we suggested originally that it is less artificial to regard the data as classified by two criteria only (ROLL and LOT), with three values of the variate, instead of one, being taken. This being the case, the situation is summarised by the table given in Step II of our first treatment of the problem. The corresponding analysis of variance is: Source of variation. Between rolls . Between lots Interaction Lots) . Residual . TOTAL
Sum of squares.
Degrees Estimate of of freedom. variance.
F.
26-31
8
3-29
3-39 **
7-90
2
3-95
4-07 *
66-56
16
416
4-29 **
52-29
54
0-97
153-06
80
(Rolls x
—
—
—
Quite clearly, our null hypothesis, of the homogeneity of the condenser paper with respect to these two factors of classification, lots and rolls, breaks down.
analysis
of
variance
183
9.8. Latin Squares. When, in the case of three-factor classification, each criterion results in the same number of classes, n, say, some simplification may be effected in our analysis of variance by the use of an arrangement known as a Latin Square. Essentially, this device aims at isolating the separate variations due to simultaneously operating causal factors. Let us suppose t h a t we wish to investigate the yield per acre of five variates of a certain crop, when subjected to treatment by two types of fertiliser, each of five different strengths. We divide the plot of land used for the experiment into 5 s sub-plots, considered t o be schematically arranged in five parallel rows and five parallel columns. Each of the five columns is treated with one of the five strengths of one of t h e fertilisers (call it Fertiliser A); each of the rows is likewise treated with one of the five strengths of the second fertiliser (B). Then the five varieties of the crop under investigation are sown at random in the sub-plots, but in such a way t h a t any one variety occurs but once in any row or column. Denoting the crop-varieties by A, B, C, D, E, we shall have some such arrangement as : Strengths
1
Fertiliser A 2 3 4
5
M 1
A E B D C
B C D E A
E D A C B
fe 2
a s t 4 OK H. 5
C A E B D
D B C A E
Now assume that the following figures (fictitious) for yield per acre are obtained : A
B 3-2
E
C 2-0 A
C 1-8
20 D
B D
1-2
2-4 A
C 2-6
2-0
2-6
3-6 C.
A 2-6 E 20
2-2 A
C
D 2-2
2-4
1-6 B
1-8 D
B
E
E
1-8
2-8
1-6
30
E
D 2-2
2-4 B
1-4
2-8
184
statistics
Transferring our origin t o 2-2 and multiplying each entry by 5, we have : Fertiliser A
Strength
r®
h
v)
3
4
5
(Varieties).
(H*
A
17
289
B A 5(25) - 1 ( 1 )
C 0(0)
D -2(4)
E -2(4)
0
34
0
B
9
81
2
E c - I P ) -2(4)
A 3(9)
B 1(1)
D 0(0)
1
15
1
C
0
0
3
B D 4(16) - 3 ( 9 )
E C -3(9) -1(1)
A 7(49)
4
84
16
D
- 6
25
4
D 1(1)
E -5(25)
B 2(4)
c
1
35
1
E - 1 5 225
1(1)
C 2(4)
A 0(0)
0
30
n
—
11
-11
1
198
18
47
39
23
121
121
1
e
3
2
Totals
(B)
1
m
M
1
Totals
£
A 2(4)
E D -1(1) -4(16)
B 3(9) 9
6
26
63
198
16
81
340
-4
6 620
/
With T = 6 and N = n1 = 25, T2/N = 1-44. The sum total of squares is 198 and so the total sum of squared deviations is 196-56. The sum of squared deviations for Fertiliser A is 340/5 — 1-44 = 66-56. The sum of squared deviations for Fertiliser B is 18/5 — 1-44 = 2-16, and the sum of squared deviations for Varieties is 620/5 — 1-44 = 122-56. The analysis of variance table is then as shown on page 185. Both the variation between varieties and t h a t due t o the different strengths of Fertiliser A are significant a t the 1% level. By pooling the sum of squares for Fertiliser B and the Residual sum of squares, we may obtain a more accurate estimate of t h e assumed population variance : (2-16 + 5-28)/ (4 + 12) = 0-465. This is the estimated variance of t h e yield of a single sub-plot. The estimated variance of the mean of five such sub-plots is, therefore, 0-465/5 = 0-093. That of the difference of the means of any two independent samples of 5 sub-plots is 2 x 0-093 = 0-186. The Standard
analysis
Source of variation.
of
Sum of squares.
Between varieties
variance
185
Degrees Estimate of of freedom. variance.
F.
122-56
4
30-64
69-6 *»
Between strengths of Fertiliser A
66-56
4
16-64
37-8**
Between strengths of Fertiliser B .
2-16
4
0-54
1-2
Residual .
5-28
12
0-44
—
196-56
24
—
—
TOTAL
Error of the difference of the means of two such samples is, consequently, (0-186)* = 0-43. For 16 degrees of freedom the least difference, m, between the means of any two samples of 5 t h a t is significant at the 5% level is given by : m/0-43 = 2-12 or m = 0-9116. I t will be seen therefore t h a t all the five varieties differ significantly, t h a t only strengths 1 and 5 of Fertiliser A do not 'liffer significantly, while none of the strengths of Fertiliser B differs significantly a t the 5% level. 9.9. Making Latin Squares. If a Latin Square has « rows and n columns it is said to be a square of order n. The number of possible squares of order n increases very rapidly with n. There are, for instance, 576 squares of order 4; 161,280 of order 5; 373,248,000 of order 6; and 61,428,210,278 of order 7. (See R. A. Fisher and F. Yates, Statistical Tables for Use in Biological, Agricultural and Medical Research.) When the letters of the first row and first column are in correct alphabetical order, the square is a standard square. Thus A B C B C A C A B is a standard square of order three—indeed, it is the only standard square of that order, as the reader will easily realise. The standard squares of order 4 are : A B C D
B A D C
C D B A
D C A B
A B C D
B D A C
C A D B
D C B A
A B C D
B C D A
C D A B
D A B C
A B C D
B A D C
C D D C A B B A
statistics
186
From these standard squares all the remaining, essentially different, non-standard squares may be derived. I t is important to understand, however, t h a t what is required when deriving such non-standard squares is a new pattern or lay-out. Merely interchanging letters does not suffice. For example, the two squares A B C D
B C D A D C D B A C A B
and
D C B C D A B A C A B D
A B D C
present the same pattern and are, therefore, no different. If, then, we require to derive a non-standard square from a given standard square, we must permute all columns and all rows except the first. We thereby obtain a total of 12 possible squares of order 3 (12 = 3! x 2!) and 576 of order 4 (in this case each standard square, of which there are four, yields 4 ! different column arrangements and 3 ! different row arrangements : 4 x 4! X 3! = 576). When all t h e standard squares of a given order have been set down, we say t h a t the standard squares have been enumerated. This has been done for n less than 8, although for t i ^ 8 a considerable number have been listed. To choose a Latin square of given order, then, we select a standard square at random from those enumerated, and permute at random, using, both for selection and permutation, a table of random numbers. EXERCISES ON CHAPTER NINE 1. The following results were obtained in four independent samplings : 2 6 14 12 6 5 (1) 6 19 10 17 19 16 (2) 11 19 23 8 17 11 (3) 14 2 29 16 20 (4) 19 Carry out an analysis of variance on these data. (L.U.) 2. Four breeds of cattle B„ B„ B„ B t were fed on three different rations, R 1( R s , R». Gains in weight in kilogrammes over a given •eriod were recorded. B, B« Bt B, 62 41 45 46-5 Ri 22 47-5 41-5 31-5 R. 25-5 28-5 50 40 R, Is there a significant difference : (a) between breeds; (6) between rations ?
analysis
of
variance
187
3. Passenger Traffic Receipts of Main-Line Railways and L.P.T.B. (Weekly averages, ^000)
1944 1945 1946 1947
First Quarter.
Second Quarter.
Third Quarter.
Fourth Quarter.
3,376 3,320 3,310 2,991
3,548 4,072 3,884 3,752
4,120 4,898 4,758 4,556
3,836 3,872 3,611 3,703
Carry out an analysis of variance on these data. Is the betweenyears difference significant ? 4. A chemical purification process is carried out in a particular plant with four solvents (i — 1, 2, 3, 4) at three different, equidistant temperatures. For every one of the 4 x 3 = 12 combinations of solvents with temperatures the process is repeated four times and the resulting 48 test measurements are shown below (a low value indicates a high degree of purity). Solvent t = 3 i = 1 i = 2 i = 4 = 1 66-9 68-3 71-2 70-3 66-2 64-6 79-1 66-6 68-6 70-0 71-8 71-8 70-1 69-9 66-2 71-1 = 2 63-4 63-9 70-7 69-0 64-9 62-7 65-9 64-9 67-2 71-2 69-0 69-3 69-5 66-9 66-2 72-0 = 3 66-4 64-1 67-5 62-7 71-6 70-8 68-9 68-8 66-2 67-0 64-0 62-4 73-6 70-4 70-5 72-8 Carry out the appropriate analysis of variance. Are there differences between solvents and temperatures, taken as a whole ? Is there any interaction between solvents and temperatures ? (L.U.) 5. The atmosphere in 4 different districts of a large town was sampled, the samples being taken at 4 different heights. Four different tests for the presence of a certain chemical were made on the samples. The arrangement is shown in the following table with the % by weight of the chemical as determined by the tests. Letters denote the different tests. Districts 2 4 1 3 1 {2 2 !> 3 3 a 4
A 8 D 6-8 B 6-3 C 5-7
B 5-3 A 4-9 C 4-7 D 3-3
C 4-1 B 4-1 D 4-0 A 40
D 5 C 3-2 A 5 B 4-2
188 s t a t i s t i c s
Is there evidence of significant variation from district to district and between heights in the percentage of the chemical present in the atmosphere ? Can it be said that there is a decided difference between the sensitivity of the tests ? Solutions 1. Variation between samples significant at 5% point but not at 1% point. 2. No significant difference between breeds or between rations. 3. No significant variation between years but that between quarters is highly significant. 4. Variation between temperatures is not quite significant at 5% point; that between solvents is significant a t that point; interaction between solvents and temperatures significant at 1% point. 5. Variation between districts significant at 5% point but not at 1% point; no significant variation between heights or between sensitivity of tests.
CHAPTER
TEN
TESTING REGRESSION AND CORRELATION 10.1. The Correlation Coefficient Again. We now return to the problems raised at the beginning of Chapter Seven : How do we know t h a t a value of r, the sample correlation coefficient, calculated from a sample of N from a bivariate normal population is really significant ? Is there a way of deciding whether such a value of r could have arisen b y chance as a result of random sampling from an uncorr e c t e d parent population ? Linked closely with this problem are several others : If a sample of N yields a value of r = r0, how can we test whether it can have been drawn from a population known to have a given p ? Again, how shall we test whether two values of r obtained from different samples are consistent with the hypothesis of random sampling from a common parent population ? Finally, given a number of independent estimates of a population correlation coefficient, how may we combine them to obtain an improved estimate ? We start to tackle these problems by what may at first appear to be a rather indirect method. For we shall use t h e technique of analysis of variance (or, more correctly, in this case, analysis of covariance) to test the significance of a linear regression coefficient calculated from a sample drawn from a bivariate normal population. But this, as we shall soon see, is equivalent to testing the significance of a value of r, or, what is the same thing, to testing the hypothesis t h a t in the parent population, p = 0. 10.2. Testing a Regression Coefficient. Let (#,-, yt), (t = 1, 2, . . . N), be a sample of N pairs from what we assume to be an uncorrelated bivariate normal population. Taking the sample mean as origin (x = 0 = y), the regression equation of y on x is y = bj/xX, where byT = s^/s* 2 . (10.2.1) Let Yi be the value of the ordinate at x = X{ on this 189
190
statistics
regression line." Then t h e sum of the squared deviations of y t h e y ' s f r o m t h e sample mean ( y = 0) is simply S y, 2 . B u t i=i £ yi2 = £ (y, 1=1
>= 1
Yi + Yi)'' N
A
= 2 (yf <=i
Y<)a + 2 S t=i
A
-
y,)Y 4 + S Y«a «= I
However, S «=i
-
Y<) Yj = i ( y t t=i
byxXi)K^> = byX
since Hence
byx — sxyjsx2
A = £ j=i
S y, a = E (y< i=) i=l
S Xiyi - byx S x^ j = 0, \< = i i=i / A S i=i
Yi)2 + E Y«2. i=i
.
(10.2.2)
Thus : The sum of squared deviations of observed values of y from t h e sample mean = sum of squared deviations of observed values of y f r o m the regression line of y on x + sum of squared deviations of t h e corresponding points on t h e regression line from the sample mean. or VARIATION ABOUT MEAN = VARIATION ABOUT REGRESSION LINE + VARIATION OF REGRESSION LINE ABOUT MEAN,
Now (10.2.2) m a y be re-written : S yi» = S (y< - bvxXi)* + byx2 E i =» 1 i=l i=l. N A A = E yt" - 2byx E Xtyt + byx1 E x? + V
i.e.,
y 2
i= 1 i= 1 i=1 i=l = Ns,* - 2Nsxy*ls** + Nsxy'lsx* + Nsxyt/Sx* = iVs„ a (l - S ^ / S x V ) + Nsy* (SxyVSxa . V ) s E y,-2 = Nsy%(l - r*) + Nsv*r* . (10.2.3.)
i= l
now seen as a n obvious algebraic identity. 27ww sample variation of y about the regression line of y on x is measured by Nsyl(\ — j-1), while the variation of, the regression about the sample mean is measured by Nsy*r%.\
testing
regression a
To the term
and
correlation
igi
S y? there correspond N — 1 degrees of «= 1
freedom, since the y's are subject only to the restriction t h a t their mean is given (in our treatment, y — 0); corresponding x to the sum 2 (y; — Yi)2 there is one additional restraint— «= l that the regression coefficient of y on x shall be byx- Thus corresponding to iV%s( 1 — r2) we have N — 2 degrees of N freedom. Consequently 2 Y<2 = Ns v l r 2 has but one degree of freedom. «= 1 Now suppose that the parent population is uncorrelated (p = 0); then the sample variation of the regression line about the mean and the random variation of the y's about the regression line should yield estimates of the corresponding population parameter not significantly different. If, on t h e other hand, the regression coefficient is significant, i.e., if there is in fact an association between the variates in the population of the kind indicated by the regression equation, the estimate N
provided by S V,2 /1 = N s f r * should be significantly greater <= l than t h a t provided by S (yi - Y()*/(N - 2) = Nsv*(l - r*)/(N - 2). t= i We may therefore set out an analysis of covariance table as follows: Sum of squares.
Source of variation.
Degrees of freedom.
Mean square.
Of regression line about Nst*r* mean
1
Ns/r>
Residual (of variate Ns*(l - r>) about regression line)
N - 2
Ns„*( 1 - ••»)/ (AT-2)
Ns,»
TOTAL
N - 1
—
If then t h e sample d a t a does indicate a significant association of the variates in the form suggested by the regression equation, the value of z given by , . r\N '
=
*
l n
i _
- 2) y
»
, =
l n
fN - 2~\i "ITZTFJ
(10-2-4>
192
statistics
will be significant at least at the 5% point. Alternatively, the value of F = r*(N - 2) /(I - ra) . . (10.2.4 (a)) can be tested using Table 8.6. A significant value of either z or F requires the rejection of the null hypothesis that, in the parent population, p = 0. Thus when we test the significance of a regression coefficient we are also testing the significance of r, the correlation coefficient. If neither z- nor i^-tables are available, we m a y use /-tables, for, as we shall now show, r[{N — 2)/(I — r2)]i is actually distributed like t with N — 2 degrees of freedom. 10.3. Relation between the t- and z-distributions. We have (8.6.7) g 2 /2 exp (vtz) = 2vi '', V' • B(Vi/2
Vs/2)
(v! exp 2z + vj) 2
Now p u t z = | l n < , Vj = 1 and v a = v. 2v"/2 = w k r ) -
t
1
2
Then
2v-i = w k T v
/ (1
t"\ 9
>•+1 2
+ in-t* * (i» + v) 2 However, since z ranges from 0 to while t ranges from — co to -f- co, we must remove the factor 2, with the result that dp{t)
= v»f?(v/2, 1) " ( 1 +
dt
•
( s e e 8 L4
- >
In other words the distribution of t, like t h a t of F, is a special case of t h a t of z. Consequently, r[(N — 2)/(l — r2)]i is distributed like t with N — 2 degrees of freedom. 10.4. Worked Example: In a sample of N — 16 pairs of values drawn from a bivariate population, the observed correlation coefficient between the variates is 0-5. Is this value significant ? Find the minimum value of r for a sample of this size which is significant at the 5% level. Treatment: N = 16, r = 0-5 *= 1, K, = 14. Then F = 0-25 x 14/(1 - 0-25) = 4-667 for 1 and 14 degrees of freedom; or t = 2-16 for 14 degrees of freedom. The 5% and 1% points of F for vj = 1 and v, = 14 degrees of freedom are 4-60 and 8-86; the value of / significant at the 5% level
testing
regression
and
correlation
193
is 2-14 (using Tables 8-6 and 8-1 respectively). We conclude, therefore, that the observed value of r, 0-5, is just significant at the 5% level: there is less than 1 chance in 20 but more than 1 chance in 100 that this value should arise by chance in random sampling of an uncorrelated population. The required minimum value of r is given by r* X 14/(1- r') = 4-60 or r = 0-497 10.5. The Distribution of r. So far we have assumed t h a t the population we have been sampling is uncorrelated. We must now consider the problem of testing the significance of an observed value of r when p 0. The distribution of r for random samples of N pairs of values from a bivariate normal population in which p ^ 0 is by no means normal, and in the neighbourhood of p = ± 1 it is extremely skew even for large N. I t was for this reason t h a t Fisher introduced the important transformation z = | In [(1 + r)/( 1 - r)] = t a n h - 1 r . (10.5.1) The importance of this transformation lies in the fact t h a t — z is approximately normally distributed with mean \ In [(1 + p)/(l - p)] = tanh' 1 p and variance 1/{N - 3) and, as N increases, this distribution tends to normality quite rapidly. (а) To decide whether a value of r calculated from a sample of N pairs of values from a bivariate normal distribution is consistent with a known value of t h e population correlation coefficient p, we put z = i In [(1 + r)/( 1 - r)] = tanh" 1 r; Z = i In [(1 + P )/(l - p)] = tanh" 1 p. Then (z — Z) /(N — 3)~* is approximately normally distributed with unit variance. Now t h e value of such a variate which is exceeded with a probability of 5% is 1-96 (see Table 5.4). Therefore for r to differ significantly from the given value of p at the 5% level, we must have (* - Z)(N - 3)4 > 1-96 (б) Now assume t h a t a sample of N t pairs yields a value of r = rt and a second sample of N z pairs a value r — r2. If the sampling is strictly random from the same population or from two equivalent populations, rx and r2 will not differ significantly. Should there be a significant difference, however, we should
194
statistics
have reason to suspect either t h a t the sampling had not been strictly random or t h a t the two samples had been drawn from different populations. Let zx be the ^-transform of r1 and z2 t h a t of r2. On the hypothesis t h a t the two samples are random samples from the same population (or equivalent populations), we have (p. 141), var
— z2) = var zt + var z2 = l/(Af t — 3) + 1 /{N2 — 3)
Hence the standard error of zt — z2 is V i V 7 ^ 3 +
*
n d
<*>-
'
•
>
/
'
+
will be approximately normally distributed with unit variance. Consequently if 1*1 - * t l / V i v 7 ^ " 3
+
<
1-96
there is no significant difference between rl and r2 and we have no grounds for rejecting the hypothesis t h a t t h e samples have been drawn at random from the same population; if, however,
we have grounds for suspecting t h a t they have been drawn from different populations or that, if they have been drawn from one population, they are not random samples. 10.6. Worked Example: A sample of 19 pairs drawn at random from a bivariate normal population shows a correlation coefficient of 0-65. (a) Is this consistent with an assumed population correlation, p = 0-40 ? (6) What are the 95% confidence limits for p in the light of the information provided by this sample ? (c) If a second sample of 23 pairs shows a correlation, r = 0-40, can this have been drawn from the same parent population ? Treatment: (a) z = i log, (1-65/0-35) = 0-7753 Z = i log, (1-40/0-60) = 0-4236 (z — Z) (N — 3)* is normally distributed about zero mean with unit variance. In the present case (z - Z)(N ~ 3)* = 1-4068, which is less than 1-96. Consequently the value r = 0-65 from a sample of 19 pairs is compatible with an assumed population correlation of 0-40.
testing
regression
and
correlation
195
(6) To find the 95% confidence limits for p on the basis of the information provided by the present sample, we put | Z - Z | x 4 < 1-96 or | z - Z | < 0-49. Consequently 0-7753 - 0-49 < Z < 0-7753 + 0-49 giving 0-2853 < Z < 1-2653 or 0-2775 < p < 0-8524 and these are the required 95% confidence limits for p. (e) The ^-transforms of rt = 0-65 and of »-s = 0-40 are respectively zt = 0-7753 and = 0-4236. On the assumption that the samples are from the same normal population (or from equivalent normal populations), the variance of their difference is equal to the sum of their variances. The standard error of — z t is then (1/16 + 1/20) i = 0-3354. Thus — 2a)/0-3354 is distributed normally about zero mean with unit variance, and since (0-7753 - 0-4236)/0-3354 = 1-044 < 1-96, we conclude that there is no ground to reject the hypothesis that the two samples have been drawn from the same population (or from equivalent populations). 10.7. Combining Estimates of p. Let samples of Nu N t , . . . N t be drawn from a population and let t h e corresponding values of r be rv r2, • • • ?k- How shall we combine these k estimates of p, the population correlation, to obtain a better estimate of t h a t parameter ? Let the 2-transforms of n, (i = 1, 2, . . . A) be Zi, {i = 1, 2, . . . k). Then these k values are values of variates which are approximately normally distributed with variances (Ni — 3)- 1 , (i = 1, 2, . . . k) about a common mean Z = tanh _ 1 p. If we " weight " these k values with weights tm, (i — 1, 2, . . . k), the weighted mean is k
t
k
k
£ niiZil £ mi = S MiZi i=l 1 t=1
where
M,: = w,-/ S m,. t'=l
k
If the variance of zi is a,-2, t h a t of S MiZi, a 2 , is given by i= 1 k
k
k
a 2 = 2 M M 3 = 2 mi2o,'2/( 2 m,)». t=i i=i t=i Now a 2 is a function of t h e k quantities nu. Let us choose these ft quantities in such a way t h a t a 2 is a minimum. The necessary condition t h a t this should be so is t h a t for all i.
da" 18mi
-0
196
s t a t i s t i c s
i.e., for all i, 2 m^j mm* — ( . 2 m S a f ) ( . 2 m ^ m^a,'2 =
i.e., for all i,
k
=0
k
2 m^c^j 1n I
2 m%, a constant ; 1= 1
nti oc 1 /<Tia, lor all i.
i.e.
The minimum-variance estimate of Z is then k 2
k
f'e]
(Ni
-
3)zil
2
/ = 1(Ni
-
3)
and the required combined estimate of p is p = t a n h ^ 2 (Ni - 3 ) * / S ^ (iV< -
3) J
.
(10.7.1)
10.8. Worked Example : Samples of 20, 30, 40 and 50 are drawn from the same parent population, yielding values of r, the sample correlation coefficient, of 0-41, 0-60, 0-51, 0-48 respectively. Use these values of r to obtain a combined estimate of the population correlation coefficient. Treatment: We form the following table :
0-41 0-60 0-51 0-48 TOTALS
z,.
N( - 3.
(Nt - 3)z<.
0-436 0-693 0-563 0-523
17 27 37 47
7-412 18-711 20-831 24-581
128
71-535
—
£(.V(-S)», (-i giving
' p = tanh 0-5589 = 0-507
N O T E : (1) To save work tables of the inverse hyperbolic functions should be used to find the ^-transforms of r. (2) The weighted mean of z obtained in the previous section and used here is approximately normally distributed with variance
TESTING
REGRESSION
AND
CORRELATION
197
1/ E (Nt — 3). The accuracy of our estimate is then that to be expected from a sample of £ E (N, — 3) + 3 J pairs. In the present example this variance is 1/128 = 0-0078, i.e., the standard error of Z is 0-0883. Thus we may expect p to lie between 0-368 and 0-624. The value of Z we have obtained may be treated as an individual value calculated from a single sample of 131 pairs. 10.9. Testing a Correlation Ratio. When the regression of y on x in a sample of N pairs of values from a bivariate population is curvilinear, the correlation ratio of y on x for the sample, eyx, is defined by (6.14.4) eyx1 = V / V . where s s is the variance of t h e means of the .^-arrays and s„ 2 is the sample variance of y. Moreover, eyX2 — r* may, provisionally, be taken as a measure of the degree t o which the regression departs from linearity (but see below, 10.10). We now have to devise a method of testing whether a given evx is significant, i.e., whether such a value of could have arisen by chance in random sampling. We take Table 6.2.2 to be our correlation table and assume our origin t o be taken at the sample mean (x = 0 = y). Then the sum of the squared deviations of the y's from this mean is a
S V f y y f =
i j
S
-
i j
yt +
vi)\
where yi is t h e mean of t h e y's in the *jth array. the right-hand side, 2 i
X f y y f j
= 2 2My i j
}
The cross-product
Expanding
- y'02 + 2 S/yyi 2 + 2 2 i
vanishes
i
because
if
2 f y U V i ~ i j
Vi)
2 f% = «<, i
the
frequency of the y's in the #
ZfiWiVi i
~
yi)
=
2 (y,- 2 2 f y ) ' j
S(y,-2fyy}) «' j
= 2 («iy<2 - n,y,2) = 0. 1
Consequently
2 2 f a y ? = 2 2 f i } ( y i - y,-)2 + 2 2 fry?, which, by 6.7, i j
i
i
i
i
= Nsy*(\ - eyx*) + Ns/efX* . . (10.9.1) If now there are p ^-arrays, the p values of yi are subject t o the single restriction 2 2 fyyj = 2 M;yt- = Ny{ = 0, with our pre< j
sent origin).
t
Thus the term Nsv2eyxs
has p — 1 degrees of
198 s t a t i s t i c s
freedom. Likewise, the Ny/s are subject to the same restriction and, so, for the term £ fay j2 there are N — 1 degrees of i j freedom. Hence the term 1 — eyxl) involves (N — 1) — (p — 1) = N — p degrees of freedom. On the null hypothesis that there is no association between the variates in the population, each array may be regarded as a random sample from the population of y's. 2 Jjfy{yj — yi)3 «j is the variation sum of squares within arrays, while the term 2 2 fijfi2 is the variation sum of squares between arrays. On « 3 the null hypothesis, then, each of these sums divided by the appropriate number of degrees of freedom should give unbiased estimates of a s 2 . The corresponding analysis of covariance is, then : Source of variation.
Sum of squares.
Degrees of freedom.
Between arrays
Nsszeyxz
Within arrays .
Ns„*(l - V ) N - p
TOTALS
Nst*
.
PN -
1
1
Estimate of at*. NsSe„x*l(p - 1) NsJ{l - e^)l(N - p) —
If, then, the value of
is found to be significant, we must reject the hypothesis that there is no association of the kind indicated by the regression function in the population. In other words, the value of eyx obtained from the sample data is significant. 10.10. Linear or Non-linear Regression ? In 10.2 we assumed that the regression of y on x was linear ; in 10.9 we assumed it to be non-linear. We must now complete this set of tests with; one which will indicate whether, on the sample data, regression: is linear or non-linear, a test which is in fact logically prior to the other two. To do this we return to equation 2 Z f a y j * = 2 Xfyly, - y<)2 + 2 2 f y t f . « j >3 ' 3 Let byx be the coefficient of regression of y on x; then, with our assumption that x = 0 = y, the regression line is y = byzx.
testing
regression
and
correlation
199
Let Yi be the estimate of y obtained from this equation when x — xi- Then we may write £ £fyyf= £ - y«)a + £ £fy(yi - Y, + K<)a i j i j i j = £ £My, - yi)» + £ -Lfi0i - Yi)'1 + £ ZfyYS, i j i j i i the cross-product again vanishing (why ?) The first term on the right-hand side of this equation still represents the variation of y within arrays; the second term represents the variation of array-means from the regression line; and the third term represents the variation of the regression line about the sample mean. We have already seen t h a t the term s Zfyiyi
- yiY ^ A V ( i - «,•yx
1
has N — p degrees of freedom.
Furthermore,
S S f y ( V i - Yi)* + s S f y Y i * = Nsv%x* i j j with p — 1 degrees of freedom. Now we may write
•
Z Z UjY? = byx2 S S fax? = byX* E mx?. i
j
i
)
i
But S niXi2 is independent of the regression and, so, the variai tion it represents depends only on byx and to it, therefore, corresponds but one degree of freedom. Moreover, £ niXi2 = (sayVsx1) . Nsx2 = Nr2sv2. t Consequently the term £ — Yi)2 = Nsy2(eyx2 — r2), i j with p — 2 degrees of freedom. On the hypothesis t h a t regression is linear, the mean square deviation of array-means from the regression line should not be significantly greater than that of y within arrays. The analysis is shown in the table on page 200. We may thus test V
(1
e„x)(p
2)
(10.10.1)
If this value of F is significant, the hypothesis of linear regression must be rejected. I t follows t h a t it is not sufficient to regard eyx2 — r2 by itself as a measure of departure from
200
statistics
Source of variation.
Sum of squares.
Of array means about Ns.He,/ regression line
- r«)
Degrees of freedom. p-
2
Mean square. W
Of regression line Nr\» about sample mean
1
Nr's*
Within arrays
Ns,»(l - «„»)
N - p
Ns,*(1
TOTALS
Ns„>
N - 1
- '*)/ ( p - 2)
(.N-p) —
linearity, for F depends also on eyxz, N and p. If t h e value of F is not significant, there is no reason t o reject t h e hypothesis and analysis m a y proceed accordingly. 10.11. Worked Example: Test for non-linearity of regression the data of 6-8 and 6-15. Treatment: We have e,x* — 0-471; eyI2 — r* = 0-009; N = 1000 and p = 9. 0-009 x 991 „ „„, OAnn. F = 0-529 X 7 = 2 - 4 0 9 f ° r "» = 7 ' ~ Using Table 9-6, we find that the 1% and 5% points of F for „, = 7, v2 = 991 are 2-66 and 2-02. The value of F is, therefore, significant at the 5% level, but not at the 1% level. There is some ground for believing that the regression is non-linear. EXERCISES ON CHAPTER TEN 1. Test for significance the value of r found in Exercise 4 to Chapter Six. 2. A sample of 140 pairs is drawn at random from a bivariate normal population. Grouped in 14 arrays, the data yielded r = 0-35 and em = 0-45. Are these values consistent with the assumption that the regression of y on x is linear ? 3. Test the values of eyl and r found in Exercise 8 to Chapter Six. Is there reason to believe that the regressidn of y on x is non-Unear ? 4. Random samples of 10, 15 and 20 are drawn from a bivariate normal population, yielding r — 0-3, 0-4, 0-49 respectively. Form a combined estimate of p. Solutions 2. Yes.
3. No.
4. p = 0-43 (2 d.p.).
CHAPTER CHI-SQUARE
ELEVEN
A N D
ITS
USES
11.1. Curve-fitting. W h a t we are actually trying to do when we " fit " a continuous curve to an observed frequency distribution is to find a curve such t h a t the given frequency distribution is t h a t of a random sample from the (hypothetical) population defined by the curve we ultimately choose. Suppose that the observed frequencies of the values Xi of the variate x k are (i = 1, 2, . . . k), where S = N. The value Xi will in fact be the mid-value of a class-interval, Xi ± -J/j,-, say. Let
xi —i M
say. The question we now ask i s : How well does this theoretical curve fit the observed d a t a ? On the hypothesis t h a t t h e fitted curve does in fact represent the (hypothetical) population from which the set of observed values of x is a random sample, the divergence of observed from theoretical frequencies must result from random sampling fluctuations only. If, however, this total divergence is greater than t h a t which, t o some specified degree of probability, is likely t o result from random sampling, we shall be forced t o conclude that, at this level of significance, the fitted curve does not adequately represent the population of which the observed x's have been regarded as a sample. Crudely, the " fit " is not good. 11.2. The Chi-Square Distribution. Of a sample of N values of a variate x, let x take the value xx on n1 occasions out of t h e N, the value x t on » 2 occasions, and, finally, the value xt on k nk occasions, so that 2 «,• = N. If now the probability of * taking the value X{, (i = 1 , 2 , . . . k) be pi, (i = 1 , 2 , . . . k), the probability of x taking the value xx on occasions, the 201
202
statistics
value on n2 occasions and so on, regardless of order, will be (2.10) t AT! P = -J: . n pfH . . . (11.2.1) n «i! < = 1 7= 1 For sufficiently large values of approximation to n !, viz.,
we may use Stirling's
n ! ^ (27i:)iwC'! + i) . exp (— n) then P
~
(2-K)iW + iexp{n
N)'
_£
n
[{2K)in. t + i exp (— «<)]
2
(2tt)
I
exp
* \ t=J E mj
t But S iti = N, and,I, therefore i=3 Ar— 1 [ ( S 1f («< + « ] - \T*~*
P -
—'
^
rn-, * (2tc) 2 n p d >• =
2
£
• n [piln^ +i 4=1
]
1 (2ttN)
pi7H
i=l
. n n
1= 1
•
(11.2.2)
pi*
k Now the expression (27tiV)Cfc~1)/2 II £,4 is independent of the i=1 Hi's, and, for a given N and a given set of theoretical probabilities, pi, is constant. Therefore, putting 1/C for this expression, we have, for sufficiently large rii, In P =2= In C +
2 (m + -i) In »= l
(Npt/tn).
Write = (w, - Npi)/(Np,)i,
i.e.,
n, = Npt +
Xi{Npt)i.
chi-square
and
its
uses
203
Then, since k k k 2 m = N = S Npi, 2 Xi(Npi)i = 0 •= 1 1=1 i s= 1 indicating that only k — 1 of the Xi's are independent. follows t h a t In P — In C + I [Np, + X((Npi)i + X 1 X In [Np,/(Npi — InC -
£ [AT/., + Xi(Np()i 1
+
It
X,(Np,)i)]
+ i ] In [1 +
X,(Np()-i]
If none of t h e m ' s are of the order 2V-1, we have, on expanding the logarithm in a power series (see Abbott, Teach Yourself k
Calculus, p. 332) and using the fact t h a t
2 Xi{Np()i
= 0,
i = 1
In P ^ In C - i
2 Xt* - i 2 Xi{Npi)-i 4=1 i=1 or, since N is large and terms in N~i and higher powers of N~i may be expected, In P
or
In C - i
2 X? »•= l
P znz C exp (— i
Z Xi'). t=i
.
(11.2.3)
Now the are integers and since w; = Npi + Xi(Npi)i, to a change of unity in m, the corresponding change in Xi, AXi, will be given by A Xi = (Npi)-$. Remembering t h a t only k — 1 of the Xi's are independent, we have p ^
1 (
(2,rA0 *_ (k r
2
(2K) - >i
exp ( - * 2 Xi 2 )
_
1)/2
V
n pii i=1 1
1=1
/
' k
\
exp ( - * 2
n (Npi)i.
pki
v
i=1
'
1 = 1
^ {teF-Wpt
exp
(- i j , ^
2
) ^ . ^
• • • A*.-,
statistics
204
or, as N tends to infinity, the probability of w ^ ' s , n^x^s, etc., P, is given approximately by the probability differential of a continuous distribution defined by P — dp = B exp
^Xi^dX^dXi
. . . dX^x
(11.2.4)
Let us now consider the variate X, in more detail. We begin k by recalling t h a t 2 = jV; if then we allow N, the sample t=1 size, to vary, we have on the assumption t h a t the frequencies are independent, k
2 CT„i2 = Ojy2 . t-1
.
.
.
(11.2.5)
k
or
var N = 2 var m i=]
.
(11.2.5(a))
Again, if we treat the event of x taking t h e particular value X\ as a success in N trials, the frequency of success will be distributed binomially with mean Npi and variance Npi( 1 — pi). Now put zi = ni — Npi. When N is treated as constant, var (zi) — NPi(l — pi). If, however, we write ni — zi + Npi and allow N to vary, var (ni) — var (zi) + pi* var (N), or var (ni) = Npi(l — pi) + pi2 var (N). Summing over the i's, we have k k k var (N) = S var (ni) = N S pi( 1 - pi) + var (N) . 2 pi2 i=l t=l i=l k or, since T, pi = I, var (N) = N. i= 1
Therefore, var (m) = Npi(\
- pi) + Np? = Npi
(11.2.6)
When, then, N is sufficiently large, the variates Xi = (m — Npi) l(Npi)i are approximately normally distributed about zero mean with unit variance. I t remains, therefore, to find the distribution of the sum of the squares of k standardised normal variates, subject to the restriction t h a t only ft — 1 of them are independent. I.et P = (Xlt X2, . . . Xk) be a point in a ft-dimensional Euclidean space and put
chi-square
and
its
k subject to the restriction E Xi(Npi)t
uses
= 0.
205
An element of
i= 1
volume in the X-space now corresponds (see 7.11) to the k element of volume between the two hyperspheres 2 Xi 2 = / a k k *=^ and 2 Xi2 = (x + subject to 2 Xi(Npt)i = 0. Using i=l i=l arguments parallel to those used in 7.11, we find t h a t the probability t h a t of the N values of the variate x, m are Xi, (i = 1, 2, . . . ft), is approximately equal to the probability t h a t x = ^ S Xi 2 ^ * lies between /
an
d X + <^X v i z -.
d p = A exp ( - }x 2 )x 4 " 2 rfx •
•
(H-2.7)
where, since the probability of x taking some value between 0 and oo is unity, 1 = A j
exp ( -
ixV-^X
o giving
IjA = 2(*- 3 )' 2 r[(ft - l)/2] .
.
(11.2.8)
This defines a continuous distribution, which although actually t h a t of x is sometimes called the x a ~distribution. Since of the kXt's only ft — 1 are independent, we may say t h a t x has ft — 1 degrees of freedom. Putting, conventionally, v = ft — 1, we have dp = , _ 2 1
X - 1 exp ( - } z a Wx
(H.2.9)
T(v/2)
I t can be shown, however, t h a t if instead of one equation of constraint, there are p such linear equations and the number of degrees of freedom for x is, consequently, k — p, (11.2.9) still holds. When v = 1, dp = (2 Mi exp ( - }X2WX. which is the normal distribution with probability density doubled—due to the fact t h a t x ^ 0, whereas in the normal distribution the variate takes negative and positive values. The x 2 -distribution proper is obtained by writing (11.2.9) in the form dp = — ^ exp ( - i X 2 ) (X2)"'2 " W ) 2 ' r(v/2)
(11.2.10)
206
s t a t i s t i c s
If now we write /
2
= S, 1
exp (— ^ S ) S v / 2 _ 1 ^ S 2V/2 r(v/2)
(12.2.10(a))
Then t h e probability t h a t x 2 will not exceed a given value
P l y ' < Zo2) = 2
— f r( v /2) •'o
=Z
° exp ( -
i S ^ - ' d S
The right-hand side is, in fact, an Incomplete r-function, and the above equation may be written in Karl Pearson's notation P
(X* < Xo2) =
1
( v ^ f ^ T
v/2
) •
<»•«•»>
Tables of this function are given in Tables of the Incomplete T-function, edited by Pearson and published by the Biometrika 03
V= 2 0-2 V=4
01 V=8
t > • 5 „ _ i
10 e ^ r - x V a X x
r(v/a)
F i g . 11.2.
15 2
) ^
1
chi-square
a n d
its
uses
207
Office, University College, London. With them evaluate P(x 2 < Xo2) f o r v < 3 0 For many practical purposes, however, we may formulas which given an approximate value of y 2 with a probability of 0-05. The approximate 0-05 X2 for v ^ 10 is l'55(v + 2),
we can use two exceeded point of
while that for 35 ^ v > 10 is l-25(w + 5). Tables of x 2 for various values of v ^ 30 are readily available. They include those in : (1) Statistical Tables for use in Biological, Agricultural and Medical Research, by Sir R. A. Fisher and F. Yates; (2) Statistical Methods for Research Workers, by Sir R. A. Fisher; and (3) Cambridge Elementary Statistical Tables, by D. V. Lindley and J. C. P. Miller. Our Table 11.2 is reproduced, by permission of the author and publisher, from (2). TABLE 11.2.
\p. V.
\ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 IS
Values of x 2 with Probability P of Being Exceeded in Random Sampling
0-99
0-95
0-08
0-01
0-0002 0-020 0-115 0-30 0-55 0-87 1-24 1-65 2-09 2-56 3-05 3-57 4-11 4-66 5-23
0-004 0-103 0-35 0-71 1-14 1-64 2-17 2-73 3-32 3-94 4-58 5-23 5-89 6-57 7-26
3-81 5-90 7-82 9-49 11-07 12-59 14-07 15-51 16-92 18-31 19-68 21-03 22-36 23-68 25-00
6-64 9-21 11-34 13-28 15-09 16-81 18-48 20-09 21-67 23-21 24-72 26-22 27-69 29-14 30-58
P. V. 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0-99
0-95
0-05
0-01
5-81 6-41 7-02 7-63 8-26 8-90 9-54 10-20 10-86 11-52 12-20 12-88 13-56 14-26 14-95
7-96 8-67 9-39 10-12 10-85 11-59 12-34 13-09 13-85 14-61 15-38 16-15 16-93 17-71 18-49
26-30 27-59 28-87 30-14 31-41 32-67 33-92 35-17 36-42 37-65 38-88 40-11 41-34 42-56 43-77
32-00 33-41 34-80 36-19 37-57 38-93 40-29 41-64 42-98 44-31 45-64 46-96 48-28 49-59 50-89
NOTE : (1) The value of y* obtained from a sample may be significantly small. Fig. 11.2 will make this clear. When the value of P obtained from the table is greater than 0*95, the probability of a smaller value of is less than 5% and this value must, therefore, be regarded as significantly small. The " fit " is too good to be regarded without suspicion ! (see also W a r n i n g in 11.4). {2) When v > 30, Fisher has shown that V 2 * 1 is approximately normally distributed about mean V i v — 1 with unit variance. Thus V2x* — V 2 r — 1 may be considered as a standardised nonnal variate for values of v > 30.
STATISTICS
208
11.3. More Properties of Chi-Squares. The moment-generating function of the -//-distribution is M(t) s £(«*"') = €(e&) 1 f - 2-'2r(v/2) / e x P ( - iS)S»l2~ 1 exp (St)dS
=
Putting M(t) =
2W2r(v/2l /
5 W 2 _ 1 ex
f S ( l — it) — v, dS = (1 - 2Q--/2 j T(v/2)
exp (— v)v"!2- 1dv
M(f) = (1 - 2t)-"l2
Expanding [1 — 2 w e M(t) = 1 +
~ %t)}dS
Y~r~2tdv
=
i.e.,
P
(1 — tzl 20--Is . r(v/2) w r(v/2) ' .
.
.
(11.3.1)
have + v(v + 2) ^ + . . .
Hence = v; fx2' = v(v + 2) and, consequently, (i2 = jz,' - n,'» = 2v. The mean of the ^-distribution for v degrees of freedom is, therefore, v and the variance 2v. The mean-moment-generating function in standardised units is, then, Mi Hence
2 vt—I- -v f 21 + 51 . 4i . , „ ,~1 1 -s- +, • •• • ,higher 5 F powers of v V2v 2LV2v 2 2v T J
=
t2 = ^ + higher powers of v-1 Thus
Mm(t) -> exp (-J/2) as v-> to .
.
(11.3.3)
chi-square
and
its
209
uses
and, comparing with (5.4.2), we see that the distribution tends to normality about mean v with unit variance. The mode of the /^distribution is given by 3g[S»/*-iexp(-iS)] i.e.,
=0
(v/2 - 1)S"/ 2 - 2 exp ( - -JS) - ^ S * / 2 - i e x p ( - }S) = 0
or
S =
2 Z
=v-2.
.
.
.
(11.3.4)
Using Karl Pearson's measure of skewness, (mean-mode) / standard deviation, the skewness of x 2 is given by
Now consider ^>x 2 - v a r i a t e s . Xi2> X22. • • • Xp2 with Vj, v 2 ,. . . vp degrees of freedom respectively. If S< = Xi2 (i = 1. 2, 3 , . . . p), the moment-generating-function of S< with respect to the origin is [1 — 2*]-W». P But the moment generating function of S S, is, by definition i= 1 f ^ e x p ( 2 S
n ^(exp Sit) t-i
since the S's are independent. Consequently, the m.g.f. of I Si = n (1 - 2 t y m * i=l t=1 P
= (1 - 2i)
=» (1 — 2<)-"/2, where v = £ v«
Hence we have the important theorem : If the independent positive variates *<,(«' «= 1 , 2 , . . . p), are each distributed like x 2 with w degrees of freedom, v (1 = 2, . . . p), then 2 Xi is distributed like x 2 with p
v =
i o l
Z n degrees of freedom. »=i I t follows at once that if the sum of two independent positive variates x1 and xt is distributed like x2 with v degrees of
210
statistics
freedom, and i f ' * ! is distributed like x 2 with vx degrees of freedom, x 2 is distributed like x 2 with v2 = v — degrees of freedom. Some consequences of this additive property are : (1) Suppose we conduct a set of n similar experiments to test a hypothesis. Let the values of y2 corresponding to these experiments be x*2 (» = 1, 2, . . . n) for v,(i = 1, 2, . . . m) degrees of freedom. If then we write X2 = S x»'2. this value of x 2 will i n fact be the value obtained from pooling the data of the n experiments and will correspond to v = 2w degrees of freedom. For example— Three experiments designed to test a certain hypothesis yielded 2 = 9-00 for v, = 5; X22 = 13-2 for v2 = 10; Xl X3 2 =
19-1
f
o r v3 =
15.
None of these on its own is significant at the 10% point, as the reader may verify. Their sum x* = 41-3 for v = 30 d.f. is, however, significant at the 10% point. Thus we see t h a t the data of the three experiments, when pooled, give us less reason for con-] fidence in our hypothesis than do those of any one of the experiments taken singly. (2) Next assume t h a t a number of tests of significance (three, say) have yielded probabilities plt p2, ps. We know nothing more about the tests than this, yet—in view of t h e experience of (1)—we require to obtain some over-all probability corresponding to the pooled data. Now a n y probability may be translated into a value of x2 f ° r a n arbitrarily chosen v. B u t when v = 2, In p = — This is very convenient, for if Xi2. \ 2 a are X2 . Xa the values of x 2 for v = 2 corresponding t o Pi> P2> Ps respectively, the pooled value of x 2 is — 2(ln pr + In p 2 + In £«,) for v = 2 + 2 + 2 = 6 d.f. and the; required pooled probability is obtained. For example:] p1 = 0-150 In — 1-897 120 pt = 0-250 In p% = - 1-386 294 p3 = 0-350 In ps= — 1 049 812 - 4-333 226 X 2 8-666 452
chi-square
and
its
211
uses
The 5% point of x 2 for v = 6 is 12-592 „ 10% „ „ v = 6 „ 10-645 We see then t h a t t h e pooled probability is slightly less than 0-10. To find this more accurately, we notice t h a t the pooled value of x 2 exceeds the 10% point by 0-021, while the 5% point exceeds the 10% by 1-947. Now 1 log 10 0-10 = — 1 and log 10 0-05 = g-69897 = - 1-30103. The difference between these is 0-30103. Therefore the pooled value of -/_2 corresponds t o -
1-
X 0-30103 = - 1-0325 = 2-9675.
The antilog of 2-9675 is 0-0944. Interpolating thus, then, we find the required pooled probability t o be 0-094, to three decimal places. 1 1 A Some Examples of the Application of x a : (A) Theoretical Probabilities Given Twelve dice were thrown 26,306 times and a 5 or a 6 was counted as a success. The number of successes in each throw was noted, with the following results (Weldon's data) : Number of successes.
Frequency
Number of successes.
Frequency.
0 1 2 3 4 5
185 1,149 3,265 5,475 6,114 5,194
6 7 8 9 10
3,067 1,331 403 105 18
Total
26,306
Is there evidence that the dice are biased ? Treatment: We set up the hypothesis that the dice are unbiased. This means that the probability of throwing a 5 or a 6, a success, is -J = i and the probability of a failure, f. The theoretical frequency generating function on this hypothesis is then 26,306 a + 1 Either natural logarithms (base e) or common logarithms (base 10) may be used, since In x = In 10 X log10 x.
212
statistics
The estimated frequencies are then found to be (correct to the nearest integer) : Number of successes
0
Observed frequency (o) .
185
1,149 3,265 5,425 6,114 5,194 3,067
Theoretical frequency (
1,217 3,345 5,576 6,273 5,018 2,927
1
2
4
3
6
5
Number of successes
7
8
9
10
Totals
Observed frequency (o) .
1,331
403
105
18
26,306
Theoretical frequency (e)
1,254
392
87
14
26,306
Xs == 2
Then
^ = 38-2.
There are 11 classes and one restriction, namely, the size of the sample total, has been imposed. The number of degrees of freedom is thus 10. The 1% level of for v = 10 is 23-31. The value of x 2 obtained is then highly significant and the hypothesis that the dice are unbiased is, therefore, rejected. (B) Theoretical Probabilities not Given The following table gives the distribution of the length, measured in cm., of 294 eggs of the Common Tern collected in one small coastal area: Length (central values).
Frequency.
3-5 3-6 3-7 3-8 3-9 40 41
1 1 6 20 35 53 69
Length (central values). • 4-2 4-3 4-4 4-5 4-6 4-7
Frequency. 54 34 12 6 1 2
chi-square
and
its
213
uses
Test whether these results are consistent with the hypothesis that egg-length is normally distributed. (L.U.) Treatment: We have first to fit a normal curve. This entails calculating the sample mean and sample variance. The sample mean is an unbiased estimate of the population mean. The sample variance may be taken to be an unbiased estimate of the population variance, since N = 294 is large. We find by the usual methods that x = 4-094; s = 0-184. Estimated frequencies are obtained by the method of 5-6 and we find Length
3-5
3-6
3-7
3-8
3-9
4-0
4-1
1
6
20
35
63
69
2-0
6-7
17-9
37-0
55-3
62-6
4-4
4-5
4-6
4-7
Total
8
Observed frequency (0)
1
Estimated frequency (c)
0-4 s
J
v
9-1 Length
4-2
4-3
9
Observed frequency (0)
54
34
12
6
1
2
294
Estimated frequency (c)
54-1
33-8
16-3
6-1
1-4
0-4
294
7-9
e
— = 2 — — 2 2 o + 2 e; but iV = 2o = 2 e, e .-. x2 = 2 t _ N
N O T E : (1) This form is preferable where the estimated frequencies are not integers and, consequently, o — e is not integral, since it removes the labour of squaring non-integers. (2) Because we derived the distribution of x* on the assumption that Stirling's approximation for n ! held, i.e., that the class frequencies were sufficiently large, we group together into one class the first 3 classes with theoretical frequencies < 10 and into another class the last 3 classes with frequencies < 10. This effectively reduces the number of classes to 9. There is some divergence of opinion as to what constituted a
statistics
214
" low frequency '' in this connection. Fisher (Statistical Methods for Research Workers) has used the criterion < 5; Aitken (Statistical Mathematics) prefers < 10; while Kendall (Advanced Theory of Statistics) favours < 20. The reader would do well to compromise with a somewhat elastic figure around 10. x2—
4.
4
.
i
i
2
4.
ii!
~ 9-1 + 17-9 + 37-0 + 55-3 + 62-6 + 54-1 W W W _ + T 33-8 T 16-3 ^ 7-9 = 296-5 - 294 = 2-5 We must now calculate the corresponding degrees of freedom. There are effectively 9 classes. One restriction results from the fact that the total observed and total estimated frequencies are made to agree. Also, from the sample data, we have estimated both the mean and variance of the theoretical parent population. We have then 3 constraints and, consequently there are 9 — 3 = 6 degrees of freedom. Entering the table at v = 6, we find that the chance of such a value of x2 being obtained at v — 6 lies between P = 0-95 and P = 0-50 at approximately P = 0-82. We conclude,: therefore, that the fit is good but not unnaturally good, and, consequently, there is good reason to believe that egg-length is normally distributed. W A R N I N G : It may happen that the fit is unnaturally good.\ Suppose the value of x2 obtained was such that its probability of occurrence was 0-999. Fisher has pointed out that in this case if the hypothesis were true, such a value would occur only once in a thousand trials. He adds :
" Generally such cases are demonstrably due to the use of inaccurate formulae, but occasionally small values of x2 beyond the expected range do occur. . . . In these cases the hypothesis, considered is as definitely disproved as if P had been 0-001" (Statistical Methods for Research Workers, 11th Edition, p. 81). 11.5. Independence, Homogeneity, Contingency Tables. When we use the //-distribution to test goodness of fit we are testing for agreement between expectation and observation. Tests of independence and homogeneity also come under this general heading. Suppose we have a sample of individuals which we can classify in two, or more, different ways. Very often we want to know whether these classifications are independent. For instance, we m a y wish to determine whether deficiency in a certain vitamin is a factor contributory t o t h e development of a certain disease. We take a sample of individuals and classify them in two ways : into those deficient in the vitamin and those not. If there is no link-up between the disease and the
215 vitamin deficiency (i.e., if the classifications are independent), then we calculate the expected number of individuals in each of the four sub-groups resulting from the classification : those deficient in the vitamin and diseased; those deficient in the vitamin and not diseased; those diseased but not deficient in the vitamin, and those neither diseased nor deficient in the vitamin. We thus obtain four observed frequencies and four expected frequencies. If the divergence between observation and expectation is greater than is probable (to some specified degree of probability) as a result of random sampling fluctuations alone, we shall have t o reject t h e hypothesis and conclude that there is a link-up between vitamin-deficiency and the disease. It is usual to set out the sample data in a contingency table (a table in which the frequencies are grouped according to some non-metrical criterion or criteria). In the present case, where we have two factors of classification, resulting in the division of the sample into two different ways, we have a 2 x 2 contingency table. Now suppose classification I divides t h e sample of N individuals into two classes, A and not-A, and classification 2 divides the sample into two classes, B and not-B. Let the observed frequency in the sub-class " A and B " be a ; t h a t in " not-A and B " be b; t h a t in " not-B a n d A " be c; and t h a t in " not-A and not-B " be d; we may display this in the 2 x 2 contingency table chi-square
and
its
uses
A.
Not-A.
Totals.
B
a
b
a + b
Not-B
c
d
c+ d
a+ c
b+d
TOTALS
a+b+c+d=N
On the assumption that the classifications are independent, we have, on the evidence of the sample data, and working from margin totals, the probability of being an A = (a -f c) /N; a not-A = {b + d)IN: aB = (a + b)IN; „ a not-B = (c + d)/N. Then the probability of being " A and B " will be
^
216
statistics
X
^ ^
an(j
the expected frequency for this sub-class in a
sample of N will be
Likewise, the probability
of being " B and not-A " is ^
and the corre-
sponding expected frequency ^
^^
^ • I n this way we
can set up a table of corresponding expected frequencies : A.
Not-A.
.
(a + c)(a + b) N
(6 + d)(a + b) N
Not-B
(a + c)(c + d) N
(b+d)(c + d) N
B
Providing the frequencies are not too small, we may then use the ^-distribution to test for agreement between observation and expectation, as we did in the goodness-of-flt tests, and this will, in fact, be testing our hypothesis of independence. Xs, in this case, will be given by a x
[" _ (a + c)(a + b) I" _ (b + d)(a + b) -|» L (a + fc + c + L (a + b + c +
(ad — be)2 f 1 N L(a + c)(a + 6)
+
+ (a , , +,
1 (6 + d)(a + b)
L
„ c)(c +, d)
+
1
1
(b + d)(c + d) J
_ (ad - be)* m ~ N " (a + b)(a + c)(e + d)(b + d) x
,, (ad - bcY(a + b + c + d) ~ (a + b)(a + c)(c + d)(b + d)
'
l1101>
chi-square
and
its
217
uses
I t remains to determine the appropriate number of degrees of freedom for this value of -/2. We recall t h a t the expected values are calculated from the marginal totals of t h e sample. Directly then we calculate one value, say t h a t for " A and B ", the others are fixed and may be written in by subtraction from the marginal totals. Thus the observed values can differ from the expected values by only 1 degree of freedom. Consequently for a 2x2 contingency table, the number of degrees of freedom for y2 is one. Worked Example : A certain type of surgical operation can be performed either with a local anessthetic or with a general anessthetic. Results are given below : Alive.
Dead.
511 173
24 21
Local General .
Test for any difference in the mortality rates associated with the different types of anmsthetic. (R.S.S.) Treatment: Our hypothesis is that there is no difference in the mortality rates associated with the two types of anaesthetic. The contingency table is : Alive.
Dead.
Totals.
Local General .
511 173
24 21
535 194
TOTALS
684
45
729
.
Using the marginal totals, the expected values are (correct to the nearest integer) : Alive. Local General TOTAL
Dead.
Totals.
33
535
182
12
194
684
45
729
218
statistics 2
Accordingly, x — 9-85 for v = 1 d.f. The probability of this value of x2 is 0-0017 approximately The value of x2 obtained is, therefore, highly significant, and we conclude that there is a difference between the mortality rates associated with the two types of anaesthetics. Suppose we were to regard the frequencies in the general 2 x 2 contingency table to represent two samples distributed according t o some factor of classification (being A or not-A) into 2 classes, thus :
Sample I Sample I I TOTALS .
A.
Not-A.
Totals.
a c
b d
a + b c+d
a +c
b+d
a + b+ c+
d(=N)
We now ask : " On the evidence provided by the data, can these samples be regarded as drawn from the same population ? " Assuming t h a t they are from the same population, the probability t h a t an individual falls into the class A, say, will be the same for both samples. Again basing our estimates on the marginal totals, this probability will be (a -f c) JN and the estimated or expected frequency on this hypothesis of the A's in the first sample will be (a + c) X (a + b)/N. In this way we calculate the expected frequencies in both classes for the two samples, and, if the divergence between expectation and observation is greater to some specified degree of probability than the hypothesis of homogeneity demands, it will be revealed by a test which is mathematically identical with t h a t for independence. 11.6. Homegeneity Test for 2 x k Table. The individuals in two very large populations can be classed into one or other of k categories. A random sample (small compared to the size of the population) is drawn from each population, and the following frequencies are observed in the categories : Category. Sample 1 Sample 2
1 «„ nal
2
. . • . t . .
nn . «s2 .
• «i< •
. k
Total.
•
n
•
««*
n W.
Devise a suitable form of the -/^-criterion for testing the hypothesis that the probability that an individual falls into the tih
chi-square
and
its
219
uses
category (t = 1, 2, . . . ft) is the same in the two populations. Derive the appropriate number of degrees of freedom for X2- (L.U.) This is a homogeneity problem, for the question could well be reformulated to ask whether the two samples could have come from t h e same population. On the assumption t h a t the probability that an individual falls into the tth class is the same for the two populations, we use the marginal totals, nlt + nn, (t = 1, 2, . . . k), to estimate these probabilities. Thus our estimate of the probability of an individual falling into t h e tth class, on this assumption, is («« + «2/)/(#i + M*)- The expected frequency, on this assumption, in the fth class of the 1st Sample is therefore Ni{nlt + nu)/(A'j -f N2) and t h a t for the same class of the 2nd Sample is Nt(nlt + nn)j(N1 + jV2). Consequently,
v
N1 + N, ' ,2 1 Mfi(Ni
_ N.n^ + N,)(nu + nu)
" '
a
'J /
Nx + Nt
J
[N 1w2i - N2nlt]* \ N2(N1 + N2)(nlt + n2l)J
+
or . _
|
X t= 1
N1N2(nlt
- M^n)* +n^)
=
N
1
N
| 2 (=1
~ nJN,)* (nu + na) (11.6.1)
How many degrees of freedom must be associated with this value of x 2 ? To construct the table of expected frequencies, we must know the grand total, one of the sample totals and k — 1 of the class totals. There are, therefore, 1 -f 1 + k — 1 = k + 1 equations of constraint, and, since there are 2k theoretical frequencies t o be calculated, v = 2ft — (k + 1) = k — 1 degrees of freedom. 11.7. h X k Table. In the case of a h x k table, we follow exactly the same principles, whether we are testing for independence or homogeneity. To calculate the appropriate number of degrees of freedom, we note t h a t there are h x k theoretical frequencies t o be calculated. Given the grand total, we require h — 1 and ft — 1 of t h e marginal totals to be given also. There are thus ft + ft — 1 equations of constraint, and consequently v = ftft — (h + ft — 1) = (ft — l)(ft — 1) degrees of freedom.
220
statistics
Worked Example : A Ministry of Labour Memorandum on Carbon Monoxide Poisoning (1945) gives the following data on accidents due to gassing by carbon monoxide : 1941.
1942.
1943.
Totals.
24 28 26
20 34 26
19 41 10
63 103 62
80 68
108 51
123 32
311 151
226
239
225
690
At At At In
blast furnaces gas producers gas ovens and works . distribution and use of gas . Miscellaneous sources
Is there significant association between the site of the accident and the year ? Treatment : On the assumption that there is no association between the origin of an accident and the year, the probability of an accident in any given class will be constant for that class. The probability of an accident at a blast furnace is estimated from the data to be (24 + 20 + 19) /(690) = 63/690. Hence the expected frequency of accidents for this source in a yearly total of 226 will be 63 X 226/690 = 20-64. Proceeding in this way, we set up the following table : 1941.
Blast furnaces Gas producers . Gas-works and coke ovens Gas use and distribution Miscellaneous
1942.
o.
e.
0.
24 28
20-64 33-74
20 34
21-82 35-68
1943. O.
E.
19 41
20-54 33-59
26
20-31
26
21-48
10
20-22
80 68
101-86 49-46
108 51
107-72 52-30
123 32
101-41 49-27
We find that x s — 34-22 for v = (5 1)(3 — 1) = 8 d.f. and the table shows that this is a highly significant value. We, therefore, reject our hypothesis of no association between source of accident and year, i.e., the probability of an accident at a given source is not constant through the years considered. 11.8. Correction for Continuity. The x 2 -distribution is derived from the multinomial distribution on the assumption
chi-square
and
its
uses
221
t h a t the expected frequencies in the cells are sufficiently large to justify the use of Stirling's approximation t o n ! When, in some cells or classes, these frequencies have fallen below 10, we have adjusted m a t t e r s b y pooling t h e classes with such low frequencies. If c such cells are pooled, t h e number of degrees of freedom for x 2 is reduced b y c — 1. If, however, we have a 2 x 2 table with low expected frequencies, no pooling is possible since v = 1 for a 2 x 2 table. We have, therefore, t o tackle t h e problem f r o m some other angle. I n fact, we m a y either modify t h e table a n d then apply t h e y_2-test or we m a y abandon approximate methods and calculate f r o m first principles the exact probability of any given set of frequencies in t h e cells for t h e given marginal totals. I n t h e present section we shall consider t h e method by which we " correct " t h e observed frequencies t o compensate somewhat for t h e fact t h a t , whereas the distribution of observed frequencies is necessarily discrete, t h a t of t h e //-distribution is essentially continuous. I n t h e course of t r e a t m e n t of t h e example in t h e next section we shall develop and illustrate t h e " exact " method. Suppose we toss an unbiased coin t e n times. The expected number of heads is J x 10 = 5 and t h e probability of obtaining just r heads is t h e coefficient of f in t h e expansion of + i)10. W e have : The probability of 10 heads, £(10H) = (J) 10 = 0-00099. This is also the probability of 0 heads (or 10 tails). Therefore t h e probability of either 10H or OH is 2 x 0-00099 = 0-00198. Using the x 2 -distribution, t h e value of x 2 for 10 heads or 0 heads is _ (10 - 6)« (0 - 5)' * 5 5 and this value is attained or exceeded for v = 1 with a probability of 0-00157. Half this, 0-000785, gives t h e x 2 -estimate of the probability of just 10 heads. The probability of 9 heads is 10 x 0-00099 = 0-0099 and hence t h e probability of 9 or more heads in 10 tosses is 0-00099 + 0-0099 = 0-01089, while the probability of 9 or more heads and 1 or less tails is 2 x 0-01089 = 0-02178. The corresponding value of x 2 — ^
g ^
+ ^ ~ ^
= 6-4
and, for v = 1, t h e probability of this value being attained or exceeded is 0-1141. Half this value, 0-05705, gives us t h e X ! -estimate of t h e probability of obtaining 9 or more heads. W e can see t h a t t h e x 2 -estimates are already beginning t o diverge quite considerably from t h e " e x a c t " values. The
222
statistics
problem is : can we improve matters by finding out why this should be ? We recall t h a t when v = 1, the //-distribution reduces t o the positive half of a normal distribution. The area of the tail of this distribution to the right of the ordinate corresponding to a given deviation, r, of observed from expected frequency, gives therefore a normal-distribution approximation to the probability of a deviation attaining or exceeding this given deviation, irrespective of sign. However, in the case we are considering, the symmetrical binomial histogram is composed of frequency cells based on unit class intervals, the central values of the intervals being the various values of r ; the sum of the areas
FIG.
11.8.
of the cells corresponding to the values r, r + 1, r + 2, etc., gives the exact probability of a deviation ^ + r. When, however, t h e frequencies in the tail are small, we are taking, for the continuous curve, the area t o t h e right of r, but for the histogram t h e area t o the right of r — -J (see Fig. 11.8). Clearly a closer approximation would be obtained if we calculated X2 for values not of r, the deviation of observed from expected frequency, but for values of ) r — ^ i.e., if we " correct " the observed frequencies by making them % nearer expectation, we shad 2 obtain a " better " value of x This is Yates' correction for continuity for small expected frequencies. I t s justification is based on the assumption t h a t the theoretical frequency distribution is a symetrical binomial distribution (p = q = £). If this is not so, the theoretical
chi-square
and
its
223
uses
distribution is skew, and no simple adjustment has been discovered as yet t o offset this. However, if p is near the correction should still be made when the expected frequencies 2 are small, for the resulting value of y yields a probability definitely closer to the " exact " value than t h a t we obtain when the correction is not made. This is brought out in the following example. 11.9. Worked Example. In experiments on the immunisation of cattle from tuberculosis, the following results were obtained : Died of T.B. or very seriously affected.
Unaffected or slightly affected.
Totals.
Inoculated with vaccine .
6
13
19
Not inoculated or inoculated with control media
8
3
11
14
16
30
Totals
Show that for this table, on the hypothesis that inoculation and susceptibility to tuberculosis are independent, x2 = 4-75, P = 0-029; with a correction for continuity, the corresponding probability is 0-072; and that by the exact method, P = 0-071. (Data from Report on the Sphalinger Experiments in Northern Ireland, 1931-1934. H.M.S.O., 1934, quoted in Kendall, Advanced Theory of Statistics, I). Treatment: (1) On the hypothesis of independence—i.e., that the probability of death is independent of inoculation—the probability of death is i j . Therefore the expected frequencies are : 14 x 19 = 8-87 10-13 30 5-13 5-87 Each observed frequency deviates from the expected frequency by ± 2-87. Hence X2
(2-87)2
corresponding
+ 5^7 + 8^7 + ioojtJ
=== 8-237 X 0-577 = 4-75 and for v = 1 the probability of x" attaining or exceeding this value is 0-029. This figure, it must be emphasised, is the probability of a proportion of deaths to unaffected cases of 6 : 13 or lower in a
224
statistics
sample of 19 inoculated animals and of a proportion of 8 : 3 or higher in a sample of 11 animals not inoculated, on the hypothesis of independence, i.e., on the assumption that the expected proportion for either sample is 14 : 16. (2) The observed frequency with the continuity correction applied are: 6-5 12-5 7-5 3-5 and, consequently x2 === (2-37)2 x 0-577 = 3-24, yielding, for v = 1, P = 0-072. (3) We must now discuss the method of finding the exact probability of any particular array of cell frequencies for a 2 x 2 table. Consider the table a b ( a + 6) c d (c + d) (a + b + c + d)=N, say. (a + c) (b + d) First, we consider the number of ways in which such a table can be set up with the margin totals given from a sample of N. From N items we can select a + c items in ways, when b + d items remain, while from AT items we may select a + b items in (a + b) w ays, with c + d items remaining. Therefore, the total number of ways of setting up such a table with the marginal totals as above is / N \( N (Nl)» [a + c + bj ~ (a + c)! (6 + d)! (a + 6)! (c + d)! ~ n»' say' Secondly, we ask in how many ways we can complete the 4 cells in the body of the table with N items. Clearly this is the number of ways in which we can divide the N items into groups of a items of one kind, b items of a second kind, c items of a third kind and d items of a fourth kind, where W = a - ) - b + c + d. But (2.10) we know this to be ah
a\ 6! el d\ = "" Say' Consequently the probability of any particular arrangement, P(a, b, c, d), will be given by P(a, b, c, d) = — =
N\a\b\c\d\
"
(11 9 1)
-'
How shall we use this result to solve our present problem ? We are interested here, we emphasise, in the probability of obtaining a proportion of deaths to unaffected cases of 6 : 13 or lower in a
chi-square
and
its
uses
225
sample of 19 inoculated animals and of obtaining a proportion of deaths to unaffected cases of 8 : 3 or higher in a sample of 11 animals not inoculated. In other words, we are interested in the probability of each of the following arrays : 16 13 I | 5 14 M 4 15 I I 3 16 | I 8 3 | 9 2 M 10 1 M 11 0 | But it will be seen immediately that the probability of obtaining a 6 : 13 ratio among inoculated animals is also precisely that of obtaining a ratio of 8: 3 among animals not inoculated. Hence the required probability will be twice that of the sum of the probabilities of these 4 arrays. The probability of | j j
| is, (by 11.9.1),
19! 16! 141 11! _ 19! 14! 30! 16! 11! 3! 0! ~ 30! 3!' We may evaluate this by means of a table of log factorials (e.g.. Chambers' Shorter Six-Figure Mathematical Tables. We have log 19! = 17-085095 log 30! = 32-423660 log 14! = 10-940408 log 3 ! = 0-778151 28-025503 33-201811 28-025503 - 33-201811 = 6-823692 and the antilog of this is P(3, 16, 11, 0) = 0-00000666 n =PPU( 4 , 1 5 , 1in 0 , 1)
191 I 6 ! 141 111 3 0 !
1 5
,
1 0 U !
u
_=
1611
- f j -
pP[„ 3 , 11f i6 , „1 1 . m0 )
= 44 x 0-00000666 = 0-00029304 Similarly, P(5, 14, 9, 2) = P(4, 15, 10, 1) = 15 X 0-00029304 = 0-00439560 Finally, P(6, 13, 8, 3) = ^ = P(5, 14, 9, 2) = 7 x 0-00439560 = 0-03076920 The required probability then is 2 x 0-03546450 = 0-07092900. 11.10. ^-determination of the Confidence Limits of the Variance of a Normal Population. We conclude with an example of the way the x 2 -distribution may be used to give exact results when the observed data are not frequencies. Let us draw a small sample of N( < 30) from a normal N population. If ATS2 = S (xt — x)2 and a 2 is the population
226
statistics
variance, NSs/a2
N
2 (xi — x)2/a* and is thus distributed »=I like x with N — 1 degrees of freedom (the one constraint 2
N
being S Xi = Nx). Our problem is this : i=1 Given N and S 2 , to find the 95% confidence limits for a 2 . Since TVS2/a2 is distributed like x2. the value of ATS2/a2 t h a t will be exceeded -with a probability of 0-05 will be t h e 0-05 point of t h e //-distribution for v = N — 1. Let Xo-os2 be this value. Then the lower 95% confidence limit required, on the basis of t h e sample information, will be NS2 lx0.052. Likewise the upper 95% confidence limit will be NS'/xo. 9 $ 2 , where Xo-951 2 is the 0-95 point of the x -distribution for v = N — 1. Worked Example : A sample of 8 from a normal population yields an unbiased estimate of the population variance of 4-4. Find the 95% confidence limits for a. Treatment: We have 4-4 = 8Sa/(8 - 1) or 8S2 = 30-8. The 0-95 and 0-05 points of the ^-distribution for v = 7 are 2-17 and 14-07 respectively. Therefore the lower and upper 95% confidence limits for a2 are, respectively, 30-8/14-07 = 2-19 and 30-8/2-17 = 13-73. The corresponding limits for o are (2-19)* = 1-48 and (13-73)* = 3-69. EXERCISES ON CHAPTER ELEVEN 1. The Registrars-General give the following estimates of children under five at mid-1947 : England and Wales.
Scotland.
Total.
Males Females .
1,813,000 1,723,000
228,000 221,000
2,041,000 1,944,000
TOTAL
3,536,000
449,000
3,985,000
.
On the assumption that there is no difference between the proportion of males to females in the two regions, calculate the probability that a child under five will be a girl. Hence find the expected number of girls under five in Scotland and say whether the proportion is significantly high. (L.U.) 2. The following data give N, the number of days on which rainfall exceeded R cm at a certain station over a period of a year: R . 0-00 0-04 0-10 0-20 050 1-00 N . 296 246 187 119 30 3
chi-square
and
its
227
uses
Test by means of x2 whether the data are consistent with the law log10 N = 2-47 - 1-98-ff. Is the " fit " too good ? (R.S.S.) 3. The following information was obtained in a sample of 50 small general shops : Shops in Urban Districts.
Rural Districts.
Total
Owned by man „ women
17 3
18 12
35 15
TOTAL
20
30
50
.
Can it be said that there are relatively more women owners of small general shops in rural than in urban districts ? (L.U.) 4. A certain hypothesis is tested by three similar experiments. These gave x2 = 11-9 for v = 6, x* = 14-2 for v = 8 and x2 = 18-3 for v = 11. Show that the three experiments together provide more justification for rejecting the hypothesis than any one experiment alone. 5. Apply the x2 test of goodness of fit to the two theoretical distributions obtained in 4.7., p. 72. Solutions 1. 219,000. Yes. 3. No.
2. Far too good,
2 x
< 0 02 for v = 5.
APPENDIX : CONTINUOUS BIVARIATE
DISTRIBUTIONS
Suppose t h a t we have a sample of N value-pairs (xi, y f ) from some continuous bivariate parent population. Across the scatter diagram draw the lines and
x = x + iA*, x = x — \bs.x, y — y y = y — \&y
(Fig. A.l).
^Ay
Consider the rectangle ABCD of area A*Ay about
dc+5 FIG.
A x
A.l.
the point (x, y). Within this rectangle will fall all those points representing value pairs (x^ y3) for which x — < Xi < x + and y — iAy < y, < y + iAy Let t h e number of these points be AN. The proportion of points inside t h e rectangle to the total number N in the diagram is then AN IN = Ap, say. Ap, the relative frequency of t h e value-pairs falling within the rectangle ABCD, will clearly 228
continuous
bivariate
229
distributions
< 1. The average, or mean, relative frequency per unit area within this rectangle is Ap/AA, where AA — Ax . Ay. If we now increase N, the sample size, indefinitely and, simultaneously, reduce Ax and Ay, we may write Limit Ap/AA
= dp Id A,
Az—>-0 AY—>-0
which is now the relative-frequency density at (x, y) of the continuous parent population. In this parent population, however, the values of t h e variates are distributed according to some law, which m a y be expressed by saying t h a t t h e relativefrequency density a t (x, y) is a certain function of x and y,
.
.
(A.l)
Here dp is the relative-frequency with which t h e variate x assumes a value between x ± \dx, while, simultaneously, the variate y assumes a value between y Jt \dy. But, since the relative frequency of an event E converges stochastically, as the number of occurrences of its context-event tends to infinity, to t h e probability of E's occurrence in a single occurrence of its context-event, we may say : dp is the probability t h a t the variate x will assume a value between x ± \dx, while, simultaneously, t h e variate y assumes a value between y ± \dy. Then >(x, y) is the joint probability function of x, y or the joint probability density of x, y. Now let the range of possible values of x be a ^ x ^ b and t h a t of y, c ^ y ^ d\ then, since both x and y must each assume some value fX =* ™ «a
ry == = c ry /
x
y
b
c
or, if for values outside x~^b, c ^ y ^ d ,
we define
+ 00 ,+ »
/
/
<j>(x, y)dxdy = 1
.
..
(A.2)
230
It
statistics
follows t h a t
Y i > y >
the
probability
that
X1 ^ x ^ X t
and
is
P(X^
f
l
y)dxdy
J
(A.3) If x a n d y are statistically independent, i.e., if the probability, dp1 = <ji1(x)dx, of x taking a value between x i \dx is independJ
x,
r,
ent of t h e value taken by y, and if the probability, dp2 — (y)dy, o t y taking a value b e t w e e n ^ ± \dy is independent of the value taken by x, by the law of multiplication of probabilities we have y) = H * ) • W v ) • • • (a.4) and all double integrals resolve into t h e product of two single integrals. For instance (A.3) becomes P ( X
Vx (*)d*
l
J
X,
• (
'HVWV
Y,
continuous
bivariate
distributions
231
Clearly variates for which (A.4) holds are uncorrelated. In Fig. A.2 let t h e rectangle ABCD in the xOy plane be formed by the lines x = x ± \dx, y = y ± \dy. At every point P, (x, y), in this plane for which <j>(x, y) is defined, erect a perpendicular, z, of length <j>{x, y). Then as x and y vary over the xOy plane, Q generates a surface z =
:
AN xry N
^xry'AA
For the corresponding class rectangle of t h e continuous parent population, we have, accordingly, dp x?ysdA =
\j.rs'
=
/• 4 - 00
/
— 00 In particular: •+• 00
(A.5)
00
/ I %>{%, y)dxdy; — 00 —00 /• + «> r + w y - Hoi' = I J y
»
—
DO
+ OO , + 00
/ I
x*4>(x, y)dxdy, 0 O H02 '=/ J.. + 00 y*
-OO
-CO
+
(A.6)
232
statistics
The corresponding moments about t h e mean (x, y) of the distribution are : r -f +
wj
f + »
/
(x - x)2<j>(x,y)dxdy; — CO —+ ocoo , + oo -00 — CO / / (y - y)
cov (x, y) =
—
,r -+f O 00 O
(A.7)
00
, ^. ++ 0000
Also, since / /
/
/
+ « ,•+«>— CO
/ — 00
/
(#2 — 2xx + x2)
•'_<»
or and likewise
y)dxdy +
Finally, f r+ woo
Cxy
/
yr ++<0 «
/ (Ary — -+00OO •'-co .+ CO
— .ry + xy)(j>(x, y)dxdy .+ 00.+ 00
/ — 00 I•'—eoxy<j>(x, y)dxdy — yl•'—00 / » x<j>(x, y)dxdy r +
y+ ®
—*/ I y
=
= (!„' — xy
The moment-generating function, M(tlt continuous distribution is defined t o be
.
.
.
(A.9)
t2) of a bivariate
CONTINUOUS
M{tv
BIVARIATE
DISTRIBUTIONS
233
tt) = e(e"' + »'•) -+• to , + 00
/-00
/— OO exp (xtl + yt2)rj>(x, y)dxdy
(A. 10)
As the reader should verify, the moment of order r, s about x = 0, y = 0 is given by the coefficient of t/t2s/rl 5! in the expansion of M(tlt i 2 ). Regression and Correlation. Since the probability, dp, t h a t x lies between x ± \dx, when y lies between y ± \dy is dp — <j>{x, y)dxdy, the probability t h a t x lies between x ± \dx when y takes any value in its range is f +-f tooo
r/•+-t- 0000
/' — OO (<j>(x, y)dx)dy = dx I
+ 00
.+00
/
/
y -J- \dy for any value of x, i.e., the relative frequency of t h e x's in the y-array. The mean of the y's in the #-array, y x , is, then, given by + to
/
— 00
y
l(x)
(A.11)
Likewise,
/
Hy)
Exercise: Show that the variance of the y's in the x-array, (
« .
(y -
y) J.. U*)
y
Now the right-hand side of (A.11) is a function of x, and, therefore.
234
statistics
The equation of the curve of means of the x-arr ays, i.e., the curve of regression of y on x is
• •
,A 12)
:?]dx . . . My) If the regression of y on x is linear, we m u s t have
(A. 13)
•
-
while t h e regression equation of x on y is r
x
t
{ x
./_«,
y = f ^
V
- ^ d y = A x
+ B
(A. 14)
Multiply this equation b y <j>i{x) a n d integrate over the whole * - r a n g e ; we have /• + CO , + CO , + «> /•+» J I y<j>{x, y)dxdy = A | xcf>x(x)dx + B I
— oo
— 00
/ + CO .+ 00 / — co - j;
.+ 00 .+ » y)dxdy + B I I
i.e., (*oi' = ^ n . ' + S . . • (A. 15) Now multiply (A. 14) b y and integrate, obtaining I I xycf>(x, y)dxdy — A I / x2cf>(x, y)dxdy J _ CO — OO — 00 — CO / + 05 ,+ 00 / y)dxdy -oo •'-oo i.e., H n ' = ^ a o ' + Six 1 0 ' . . . (A. 16) Solving (A. 15) and (A. 16) for a n d B, we have ^ _ P u ' ~ ^ioVOI' _ (fix _ f^ao' ~ (HioT (J-20 <*r2 _ l^oi'mo' ~ ^ioVii' _ + j*io'2) ~ I^iq'Cm-ii + P i o W ) 2 M'jo' - ( t V ) ' 1*20 — t*01 ^20 ~ t^io ^11 _ .-7 s a*y — —/ A 2 1^20 Gz Consequently, (A. 14) becomes =
Cx
Ov
GxOy \
ax
/
.
(A. 17)
CONTINUOUS
3IVARIATE
DISTRIBUTIONS
235
and the correlation coefficient between x and y, p, is p =
« a
=
H u
F>z1y
.
GxGy
.
.
.
(A.18)
I t should be noted, however, t h a t if we define p by (A.18), this does not necessitate t h a t the regression be linear. Finally, consider the standard error of estimate of y from (A. 17) : since GzylGx2
=
(GxylGxGy)
Oy/ax
r 1- " rr +f r+
5, 2 = /
r + CO
= /
— CO /•+ ao
[(y ~ y)2 ~
/
— CC
pGy/ax,
[y - y - p- y (x - x)]2<j>(x, y)dxdy
/
— 00
=
(y - y)(x ~ X)
—CO
+ P2 / . i.e„
2
O
I ^2 y
2
(* -
„2
Ss2 = c„ 2 (l - p2)
(A. 19)
The Bivariate Normal Distribution. Consider the bivariate distribution whose probability density is j>(x, y) = C exp (— Z) where and
(1 — p2)i
C = Z =
{x*/ax2
(A.20)
— 2pxyjaxav + y2/av2}/2{ 1 - p2)
The distribution so defined is called the Bivariate Normal distribution. For the moment we shall regard ax, ay and p as undefined constants. We begin by finding what are usually called t h e marginal distributions, viz., t h e probability function of x for a n y value or all values of y, and t h e probability function of y for any or all x. The probability t h a t x lies between x ± \dx, whatever the value assumed by y, is given by + "t- 00«>
/
<j>(x, y)dy
236
i.e.,
statistics
^(x) = C j
exp { -
^ [(p 2 * 2 /<j* 2 - 2pxyfa^y
_
p2)x2laxz]}dy
+ y*hv*) + (1 = C exp { - x''l2ax } x 2
XJ Put Then
exp { -
2(1
V = yl
_
p2)
(yhy - pxlax)*}dy
± 00,
±00.
Hence, <£i(*) = Cctj, exp ( - a;2 /2c*2) J
exp
-
^
2(1
p2)]
dy.
But (see footnote to 5.4 (e), page 82) - 00
L
exp [ - V 2 /2(l - p2)]dV = [2tc(1 - p2)]*.
Therefore fcix) = — 1 = exp ( - * 2 / 2 ^ 2 ) .
.
(A.21)
and, likewise, Hv) = exp ( - y 2 / 2 V ) Thus, marginally, both a; and 3/ are normally distributed with zero mean and variances a x 2 and a„2 respectively. Moreover, if we put p = 0 in (A.20) ( - *2/2a*2) . _ L = exp ( - y2/2
<£(*, y) = — e x p 5,V2JT
VyVZK
=
t2) = C J
J
exp |_xt1 + yt2 -
2pxylaxav
+
2(1
_
p2) {^
2
/a x 2
y 2 /o y 2 }J (fAtfy (A.22)
CONTINUOUS
BIVARIATE
DISTRIBUTIONS
237
The reader will verify that, if we make the substitutions X — x — ax(axt-L + per;,t2) Y = y — ay(ayt2 + pa**!), (A.22) becomes M(ty, t2) = exp (i(<J*V + 29az(jytlti + < W ) ) X X9
n
*vJ
ex
- ZpXYiw, + Y»/c„*)/2(l - p*)-]dXdY But the second factor in the right-hand expression is equal to unity; therefore M(
f
= P<W* or
P
p = i^J- ; (j.02 = o /
Thus, although up to now we have not considered GX, &Y, and p to be other t h a n undefined constants, they are in fact the standard deviations of x and y and the correlation coefficient of x and y respectively. Exercise: Show that : (i) nrl = 0, when r + s is odd; (ii) filn = 3CTx* ; H31 = 3pox\
; ft22
= (1 + 2pa) ( u / O :
=
Mm =
3a*.
Some other properties: (A.20) may be written Atx
V)
=
e x
1
P [ -
Why
2
-
2pyx/oia*
27ta^„(l - p2)i + pV/(ji 2 )/(2(l - p2)]. exp ( - *72a* s ) = — l - j = exp ( - x2/2gx2) . — n ^ x o®V2it - P2) V 2 t c X exp { - [ ( * - 9%*) Y{2<J„2(1 - P 2 )}]} 2
But by (A. 19) S„ = a„ 2 (l - p2), and, therefore,
We see, then, that, if x is held constant, i.e., in any given xarray, the y's are distributed normally with mean y = pOyXf<jx 11 and variance 5 y 2 .
238
statistics
T h u s t h e regression of y on x is linear, t h e regression equation being y = P-*.x
.
.
.
.
(A. 24)
2
a n d t h e variance in each array, being S„ , is constant. Consequently, regression for the bivariate normal distribution is homoscedastic (see 6.14). SOME
MATHEMATICAL T H E I R
SYMBOLS
AND
MEANINGS
exp x = ex, exponential function, where e — 2-71828 . . . is base of natural logarithms. In x = logc x, natural logarithm. Ax = small increment of x. L t f { x ) or X —> a
Limit f(x) = the limit of f(x) as x t e n d s t o a. x—> a - > - tend t o (limit). 00 = infinity. n\ — factorial n, n(n — \)(n — 2) . Q = H / s ! (r - s)! n S Xi = x1 + -f . . . + xn; !=1 m
.
.
(3.2.1)
sum of . . .
n
2 S Xij = Xlx + X12 + ...-)xln <=1 x "I" z i "f- -^22 + • • • + + • • • + x
+ ml +
x
m
+ • • •+
n x mn
n
I I Xi = <= i = > ; < = ^ ; ^ =
Xi . x2 . x3 . . . xn_ 1 • x„; product of . . . approximately equal to. greater t h a n ; less t h a n . greater t h a n or equal t o ; less t h a n or equal to.
Population parameters are, in general, denoted by Greek letters; estimates of these parameters f r o m a sample are denoted b y t h e corresponding R o m a n letter.
SUGGESTIONS F O R F U R T H E R
READING
B. C. Brookes and W. F. L. Dick, Introduction to Statistical Method (Heinemann). K. A. Brownlee, Industrial Experimentation (H.M.S.O.). F. N. David, A Statistical Primer (Griffin). B. V. Gnedenko and A. Ya. Khinchin, An Elementary Introduction to the Theory of Probability (Freeman). C. G. Lambe, Statistical Methods and Formulae (English Universities Press). P. G. Moore, Principles of Statistical Techniques (Cambridge University Press). M. J. M. Moroney, Facts from Figures (Penguin Books). F. Mosteller, R. E. K. Rourke, G. B. Thomas Jr., Probability and Statistics (Addison-Wesley) C. G. Paradine and B. H. P. Rivett, Statistical Methods for Technologists (English Universities Press). M. H. Quenouille, Introductory Statistics (Pergamon). L. H. C. Tippett, Statistics (Oxford University Press). S. S. Wilks, Elementary Statistical Analysis (Princeton University Press). G. Yule and M. G. Kendall, Introduction to the Theory of Statistics (Griffin). More Advanced A. C. Aitken, Statistical Mathematics (Oliver and Boyd). F. N. David, Probability Theory for Statistical Methods (Cambridge University Press). W. Feller, An Introduction to Probability Theory and its Application (Wiley). R. A. Fisher, Statistical Methods for Research Workers (Oliver and Boyd). I. J. Good, Probability and the Weighing of Evidence (Griffin). P. G. Hoel, Introduction to Mathematical Statistics (Wiley). M. G. Kendall and A. Stuart, The Advanced Theory of Statistics (Vol. I of new three-volume edition) (Griffin). M. G. Kendall, Exercises in Theoretical Statistics (Griffin). A. M. Mood, Introduction to the Theory of Statistics (McGrawHill). Answers to Problems in Introduction (McGraw-Hill). M. H. Quenouille, The Design and Analysis of Experiment (Griffin). 239
240
statistics
C. E. Weatherburn, A First Course in Mathematical Statistics (Cambridge University Press). Also M. G. Kendall and W. R. Buckland, A Dictionary of Statistical Terms (Oliver and Boyd).
THE ENGLISH LANGUAGE BOOK SOCIETY, in a d d i t i o n t o its l o w - p r i c e d editions
of
specialist
works,
publishes
v a r i e t y o f series f o r t h e n o n - s p e c i a l i s t
a
each
v o l u m e in w h i c h h a s b e e n s e l e c t e d f o r its high standard
of
accuracy
and
usefulness
or
interest.
This is the metricated edition of a book which will help readers who have to teach themselves something about statistics to understand some of the fundamental ideas and mathematics involved. There is after the first introductory chapter a set of exercises at each chapter-end for the reader to work through. T E A C H Y O U R S E L F STATISTICS is f o r sale at a U . K . price o f £ 0 . 2 0 (4s. Od.) Inside y o u will find a list o f s o m e o t h e r E L B S low-priced editions. Y o u r ^ o o k - s e l l e r m a y be a b l e t o s h o w y o u a c o m p l e t e list o f E L B S titles. In c a s e s o f difficulty c o n s u l t y o u r l o c a l b r a n c h o f t h e British C o u n c i l o r write t o : The
English
P.O.
Box
London,
Language 4
Book
Society,
BR,
W . l .
P R A C T I C A L
ISBN 0 340 05906 0
B O O K S