Thiele Pioneer in Statistics
This page intentionally left blank
Thiele Pioneer in Statistics
Steffen L. Lauritzen ...
40 downloads
1063 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Thiele Pioneer in Statistics
This page intentionally left blank
Thiele Pioneer in Statistics
Steffen L. Lauritzen Professor of Mathematics and Statistics Aalborg University, Denmark
Great Clarendon Street, Oxford OX2 6DP Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide in Oxford New York Auckland Bangkok Buenos Aires Cape Town Chennai Dar es Salaam Delhi Hong Kong Istanbul Karachi Kolkata Kuala Lumpur Madrid Melbourne Mexico City Mumbai Nairobi São Paulo Shanghai Taipei Tokyo Toronto Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries Published in the United States by Oxford University Press Inc., New York © Oxford University Press, 2002 The moral rights of the author have been asserted Database right Oxford University Press (maker) First published 2002 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this book in any other binding or cover and you must impose this same condition on any acquirer A catalogue record for this title is available from the British Library Library of Congress Cataloging in Publication Data (Data available) ISBN 0 19 850972 3 10 9 8 7 6 5 4 3 2 1
Preface In 1979, on the occasion of the 500th anniversary of the foundation of the University of Copenhagen, Anders Hald wrote an essay on the history of statistics at the University. Hald's fascination from reading Thiele's books and papers was contagious, and I too became infected with enthusiasm for Thiele. In this book I hope to share my fascination with others. For many years I have felt it unreasonable that Thiele's masterpiece of 1889 had not been published in a language more widely accessible than Danish, and that ‘someone’ ought to put it right and translate it. In 1997 it became obvious that I was that ‘someone’. Fortunately, I received a grant to stay at the wonderful Institution of San Cataldo in Southern Italy in the month of September 1998. In such an inspiring, beautiful and secluded place, miracles are abundant and the first draft translation of the 1889 book was consequently completed. Back in busy academic life it was difficult to complete the remaining part of the project. Two periods at Klitgaarden, the former summer residence of King Christian X of Denmark, gave me the necessary peace and quiet. Institutions such as San Cataldo and Klitgaarden are as important now as they ever were, and for me Thiele will always be associated with blue sea, olives and lemon trees, as well as brisk winds, rolling waves and blowing sand. And what experience could be more royal and inspiring than working at the desk of Queen Alexandrine, in the room of Frederik IX, in the house of Christian X, completing a book about the son of the librarian to Christian VIII, while studying a painting of him, pictured with Frederik VIII at a meeting at the Royal Danish Academy of Sciences and Letters? I am indebted to the International Statistical Institute for permission to reproduce material from International Statistical Review. In addition to the institutions mentioned, I am grateful to Anthony Edwards and Stephen Stigler for commenting on the early draft translation. Bjarne Toft deserves sincere thanks for collecting and sending me plenty of valuable material about and by Thiele, mostly from the files of the late N. E. Nørlund. My colleagues at Aalborg University should be praised for allowing me to go away and work on Thiele without even raising an eyebrow.
vi
PREFACE
Finally, the existence of this book owes much to Anders Hald, for fascination, encouragement, and discussions, for allowing me to reproduce some of this work, and also for working carefully through my early draft manuscripts, supplying an uncountable number of valuable comments at many levels. Indeed, several of the footnotes supplied in this book stem directly from him. S.L.L. Klitgaarden, Skagen October 2001
Contents 1 Introduction to Thiele S. L. Lauritzen 2 On the application of the method of least squares to some cases, in which a combination of certain types of inhomogeneous random sources of errors gives these a ‘systematic’ character T. N. Thiele 1 Introduction 2 Adjustment of a series of observations of an instrument-constant 3 Determination of the units for the two types of weights 4 Adjustment by linear functions of the unknown elements 5 Numerical example 3 Time series analysis in 1880: a discussion of contributions made by T.N. Thiele S. L. Lauritzen 1 Introduction 2 The model 3 Least squares prediction of the Brownian motion 4 Recursive solution to the prediction problem 5 Thiele's estimation of the error variances 6 Discussion of procedures for estimation of the error variances 7 An application 8 Final comments 4 The general theory of observations: Calculus of probability and the method of least squares T. N. Thiele 1 On the relation of the law of causality to observation 1.1 Accidental and systematic errors 1.2 Actual and theoretical error laws 2 On actual error laws 2.1 Actual error laws and relative frequency 2.2 Actual error laws expressed by symmetric functions 2.3 Actual error laws for functions of observed variables
1 7 9 13 25 30 33 40 41 42 43 46 49 50 54 55 57 61 62 63 65 65 78 93
viii
CONTENTS
3 The hypothetical or theoretical error law of the method of observation. The law of large numbers 3.1 Probability 3.2 The error law of the method expressed by symmetric functions 3.3 On adjustment 3.4 The principle of adjustment by the method of least squares 3.5 Adjustment by elements 3.6 On systematic errors 4 Tables and Figures 5 T. N. Thiele's contributions to statistics A. Hald 1 The background 2 Thorvald Nicolai Thiele, 1838–1910 3 Skew distributions 4 Cumulants 5 Estimation methods and K statistics 6 The linear model with normally distributed errors 7 Analysis of variance 8 A time series model combining Brownian motion and the linear model with normally distributed errors 6 On the halfinvariants in the theory of observations T. N. Thiele 7 The early history of cumulants and the Gram–Charlier series A. Hald 1 Introduction 2 The central limit theorem, the moments, and the Gram–Charlier series 3 Least squares appromixation by orthogonal polynomials 4 Thiele's halfinvariants, their operational properties, and the Gram–Charlier series 8 Epilogue S. L. Lauritzen Bibliography Index
105 109 127 136 149 161 179 185 198 199 201 205 207 212 214 222 223 226 232 233 234 237 241 249 252 261
CHAPTER ONE Introduction to Thiele Fig. 1 Photograph of the young T. N. Thiele, reprinted from Nielsen (1965).
Thorvald Nicolai Thiele was a brilliant researcher and worked as an actuary, astronomer, mathematician, and statistician. This book collects translations into English of his best and most important publications on statistics, together with some articles written about Thiele which place his work in a modern as well as in a historical context. Thiele was born in Copenhagen on Christmas Eve, 24 December 1838, and grew up in a prominent family and a culturally and intellectually stimulating environment. His father, Just Mathias Thiele (1795–1874), was private librarian to King Christian VIII of Denmark and director of the Royal College of Prints.
2
INTRODUCTION TO THIELE
In 1867 Thiele married Marie Martine Trolle (1841–89). They had six children, four of which survived their first year (Brenner 1939). Thiele obtained his master's degree in astronomy from the University of Copenhagen in 1860 and his doctor's degree in 1866, based on a thesis about the orbits of double stars (Thiele 1866). In 1875 he became a professor of astronomy and director of the Astronomical Observatory at the University of Copenhagen, positions he kept until his retirement in 1907. He was the founder and Mathematical Director of the Danish insurance company Hafnia from 1872 until his death in Copenhagen on 26 September 1910. There are many articles on Thiele and his statistical work: see for example those by Burrau (1928), Schweder (1980), Johnson and Kotz (1997), Lauritzen (1999), and Norberg (2001). Thiele was visually impaired by severe astigmatism and could not carry out observational work as an astronomer (Gyldenkerne and Darnell 1990). He therefore turned his interests towards mathematical and computational work. In his later years he was almost blind. Hald (1981) (reprinted in this volume) contains a detailed discussion of his scientific achievements as a statistician, while more details can be found in Hald (2000a) (also reprinted here) and (Hald 1998, 2000b, 2001). Chapter 14 of Farebrother (1999) focuses on Thiele's contributions to the method of least squares. The most famous of Thiele's contributions to actuarial science is his differential equation for the net premium reserve Vt at time t for a life insurance:
Here π is the premium per time unit, δ is the force of interest, and μt is the force of mortality. In fact he never published this equation, but showed it to his friend and colleague J. P. Gram, who described it in his obituary on Thiele (Gram 1910). We refer the reader to Hoem (1983) for further discussion of Thiele's contributions in this area. Other areas to which Thiele made significant contributions include the theory of interpolation (Thiele 1909). Nielsen (1910) contains an almost complete bibliography of Thiele's mathematical writings. The bibliography at the end of this book contains items which have been referenced here and in the notes provided to the translations. Some I have stumbled across and Nielsen has either missed them or classified them as outside the scope of his bibliography on Danish mathematics. As an original thinker, Thiele and his ideas were often far reaching and ahead of his time, so much so that he was rarely understood by his students and contempories. It will appear from the collected texts here that his ideas have
INTRODUCTION TO THIELE
3
tmost importance even today, since they include such things as likelihood, Kalman filtering, the EM algorithm, nonparametric density estimation (via empirical cumulants), residual analysis, and the method of least squares in which he was a true virtuoso. That said, he himself may not always have grasped the full generality and scope of his own ideas. Although he clearly and distinctly expresses the idea of likelihood (page 123), and uses it to discuss estimation of a binomial proportion, he is not aware that this is a universal idea which extends to other statistical problems. When he derives the Kalman filter and smoother (Thiele 1880, page 16 of this volume), he is of course not aware of its potential for on-line computation, and although he describes what is an instance of the EM algorithm (page 25), he has apparently no idea of its wide applicability. After all, Thiele was also a man of his own time. However, his understanding of the halfinvariants, introduced in Thiele (1889) (page 82), was complete, or at least became so after publication of his 1899 paper (Thiele 1899, Chapter 6 of this volume) and predates R. A. Fisher's cumulants (Fisher 1929; Fisher and Wishart 1931). Fisher did not appreciate Thiele's contributions properly and also expressed his distaste for Thiele's original name: see for example the correspondence between R. A. Fisher and Arne Fisher, R. A. Fisher and Ragnar Frisch, and R. A. Fisher and Harold Hotelling, the last strongly exposing Fisher's ignorance about Thiele's work (Bennett 1990, pp. 310–20). Although few could fully understand and appreciate Thiele's ingenuity, he did have strong interactions with his colleagues, in particular L. H. F. Oppermann and J. P. Gram, and also a significant effect on the work of his successors, for example J. F. Steffensen, who wrote the major textbooks used for the degree course on Actuarial Science and Statistics (established in 1917) and an excellent textbook on statistics (Steffensen 1923a), which may have been the best one at the time. Steffensen was strongly influenced and inspired by Thiele, but found Thiele's books impossible for use as textbooks, partly because they were inaccessible to most students, and partly because Thiele had a habit of only writing original material so his books did not contain the basic material needed to understand the more advanced concepts (Hald 1983). Thiele had few students, but he was known as an inspiring teacher although he could sometimes be difficult to follow (Larsen 1954). The Danish-born actuary Arne Fisher, who emigrated to America at the age of 16, was strongly influenced by Thiele (Fisher 1922) and took the initiative to have the English version (Thiele 1903) of Thiele's elementary book (Thiele 1897) reprinted in the Annals of Mathematical Statistics in 1931. A student of Thiele, Kirstine Smith, went to England to study with Karl Pearson and published two papers in Biometrika
4
INTRODUCTION TO THIELE
(Smith 1916, 1918), the second of which represents pioneering research in the optimal design of experiments (Atkinson and Bailey 2001). Kirstine Smith decided to give up life as a researcher and ended her career as a school master in Denmark (Plackett and Barnard 1990, p. 124). Having spent a considerable amount of time on Thiele's work, it appears to me that his interests and talents were strongest in an area which could have been called computational mathematics. When concerned with aspects of computation, his extraordinary abilities and immediate fascination shine through clearly, even when dealing with very complex procedures. This impression of mine conforms with the fact that he (unsuccessfully) worked hard to establish a Professorship of Computational Mathematics (Regnende Matematik) at the University of Copenhagen. Gram (1910) also emphasized these interests of Thiele as being most prominent and wrote that Thiele was the first in Denmark to buy and use a calculating machine. Thiele was indeed greatly interested in all aspects of numbers, including their axiomatization (Thiele 1886). He advocated the use of a number system with base 4 for beginner's arithmetic, base 16 for the advanced, and base 64 for grandmasters (Gram 1910; Larsen 1954). He also had the tiles on the floor of a large room in the building of the Hafnia insurance company laid according to number-theoretic patterns studied in Thiele (1874), see Fig. 2 (Larsen 1954). Incidentally, Thiele, with H. G. Zeuthen and H. Valentiner, arranged that Caspar Wessel's treatise on the geometric interpretation of complex numbers (Wessel 1798, 1999) was translated into French (Wessel 1897). This work had been ignored by contemporary mathematicians, partly because it was written in Danish but, perhaps more importantly, because the entire idea of using geometric insight on algebraic problems was considered unimportant among leading mathematicians at the time (Andersen 1999). Besides his professional interests, Thiele was engaged in aspects of social policy. In Thiele (1892a) he suggests consideration of the results of a vote as an outcome of a repeated experiment with fixed probability distribution, and taking into account the uncertainty due to sampling when evaluating whether the vote is decisive or not. He was interested in issues of insurance (cf. page 105) and old-age pensions from a social policy point of view, advocating what today is known as the Schreiber–Samuelsen model for old-age pensions (Petersen 1986; Thiele 1891). Thiele emphasized that old age is the destiny of us all, as opposed to disability or premature death, which is a random phenomenon. Therefore a pension system should be seen as a social contract between generations, with the State as mediator, rather than as an insurance.
INTRODUCTION TO THIELE
5
Fig. 2 Patterns from Thiele (1874). The coloured squares encode complex right-hand sides for which certain algebraic equations have integer (complex) solutions. The figure should be viewed at some distance.
6
INTRODUCTION TO THIELE
An important facet of Thiele's personality is represented in his activities as initiator: The Danish Mathematical Society was founded in 1873 on his initiative in cooperation with H. G. Zeuthen and J. Petersen (Branner 2000; Ramskou 2000). The Danish Actuarial Society was also founded in 1901 on his initiative (Burrau 1926). As mentioned, in 1872 Thiele founded the first Danish private insurance company, Hafnia, where he acted as its Mathematical Director from 1872 and as chairman of its board of managers from 1902 until his death in 1910. As a matter of curiosity we note that Thiele was an active member of the first Danish chess club, Københavns Skakforening (founded 1865), from the early days of its existence, and he was prominent enough to be mentioned in the hymn written by H. E. Gjersing on the occasion of the 50th anniversary of the club (Nielsen 1965). The core of this book consists of the translations of two of Thiele's most important papers (Thiele 1880, 1899) and his fundamental book (Thiele 1889). To help the reader appreciate Thiele's work, and to set things in a modern and a historical context, the translations have been supplemented with comments, also in the form of reprinted papers (Hald 1981, 2000b; Lauritzen 1981), discussing and describing the contents of the translated material. Thiele uses a relatively modern notation, so this has only been changed in very few places. The few obvious misprints have been corrected without specific notice and hopefully I have not made too many new ones. When seriously in doubt, I have made a note. So now it is time for the reader to enjoy Thiele's marvellous texts!
CHAPTER TWO On the application of the method of least squares to some cases, in which a combination of certain types of inhomogeneous random sources of errors gives these a ‘systematic’ character Presentation This paper is Thiele's first on the method of least squares. It is a brilliant tour de force, but so far ahead of its time that only a few could appreciate the results. It sparkles with ingenuity and it initiated my own interest in Thiele's work in 1979. Thiele's recursive algorithm developed in this paper served as an important source of inspiration for Lauritzen and Spiegelhalter (1988). Enjoy the geometric construction of the Kalman filter on page 26. This may still be a novelty. Fig. 1 Photograph of T. N. Thiele as a young man. From the archives of the Royal Danish Library.
8
ON QUASI-SYSTEMATIC ERRORS
Thiele must have considered this work important, since he simultaneously had it published in French. Strangely, its contents are only mentioned sporadically in Thiele (1889), and nowhere else in Thiele's writings. Was he disappointed because his paper was not properly understood? Or did he not fathom the potential of his own results? S.L.L. Fig. 2 The cover page of the original Danish version of Thiele (1880).
ON QUASI-SYSTEMATIC ERRORS
9
Fig. 3 The cover page of the original French version of Thiele (1880).
1 Introduction Contemporary astronomy is, leaving other important phenomena aside, essentially dominated by its battle against systematic errors. The term ‘systematic error’ is used to contrast both with random errors and with errors which can be remedied by calculating corrections, because the nature and
10
ON QUASI-SYSTEMATIC ERRORS
effect of the sources of error are known. But this contrast is somewhat obscure. There are errors where one can be uncertain about their true origin long after one has identified the mathematical law describing them as a function of some external condition of the observations. And the delimitation towards random errors is uncertain because it is doubtful whether it is possible at all to justify fully that errors can be random. The technical term ‘random error’ is partly defined by the fact that one can not relate repeated observations to any prior identifiable external condition of the observations and specify a relation between the errors and such a condition; partly in positive form through demanding that errors be treated as random so that, when ordered according to size, the number of identical observations being sufficiently large, they should reveal an error law, identifying the frequency of errors between any arbitrarily chosen limits as a function of these limits. In the definition of random errors some would perhaps further demand that this error law should take a specific functional form, the exponential1
where x0 is the most likely value and m is the standard deviation. But even though one would not consider errors as being systematic when the error law is different from the exponential, which is a prerequisite for direct application of the method of least squares, there are cases where the boundary between the two types of error can seem blurred. In several areas, in particular in connection with the so-called personal errors which seem to have some physiological origin, and in connection with series of observations made to determine specific aspects of the settings or functioning of an instrument, to be denoted instrument-constants, as they would be constant had the instrument been perfect, cases occur where it is all too obviously impossible to characterize the errors by an error law and where, when observations are arranged chronologically, there seems to be a dependence between the time of observation and the size of error, so at least initially it would be justified to classify the errors as systematic. But when it has been attempted to correct such errors under the assumption that they were mathematical functions of the time of observation, the hope of ridding the observations of these embarrassing systematic errors has been perpetually frustrated. When believing that the appropriate correction formula
1
At the time, ‘the exponential error law’ was the standard name for the Gaussian or normal distribution, tr.
ON QUASI-SYSTEMATIC ERRORS
11
has been found, and trying it out on sufficiently prolonged series of observations, it has persistently appeared impossible to represent the errors as functions of time or at least to use the formula for prediction of personal observational errors or instrument-constants. It could be that more satisfactory results would be obtained in such cases through a more detailed investigation of the relation of these errors to random errors. If the errors are not directly random, it is at least possible that they could result from the combined effect of several sources of random error, because the well-known theorem that a combination of several sources of random error, each with its error law, can be described as a single source of random error, with its own error law depending on the error laws of the others, has an exception which does not always seem to have attracted sufficient attention. The theorem does not hold when some of the sources of error act in such a way that more than one of the otherwise independent observations are affected by every single error flowing from the source in question. Let us for example assume that a series of values o1, o2, o3, …, on that themselves could have been the proper object of observation, have been influenced by a random source of error which has displaced these by e′1, e′2, …, e′n, according to an error law which can be assumed exponential, and further that one is not considering the o as objects of observation, but their successive sums arranged according to their subscripts, i.e.
and that one in addition must consider the problem in this fashion because other significant sources of random error are affecting the sums with random errors e″1, e″2, …, e″n having a different error law which could also be assumed exponential, then the totality of errors (1.1)
will in general follow neither the exponential nor any other error law but will, when arranged after subscripts of e″, display phenomena very similar to what is seen
12
ON QUASI-SYSTEMATIC ERRORS
for chronological series of personal errors or instrument-constants as mentioned above. There are also rather good internal reasons to explain these phenomena in this or a similar fashion through a combination of random sources of error, and in this respect it is about the same whether it be the personal errors or the errors of instrument-constants, as the sensory organs could be conceived of as instruments. The errors would partly be such errors which would vary at random from one observation to the next as the errors e″ in our example, partly such errors which necessarily are determined by the random state of the instrument at the moment of observation, to the extent that this cannot be calculated from conditions such as temperature, pressure, etc. As regards the state of the instrument one should not, when the same instrument is used, ignore a dependence on the state of the preceding moment. Certainly it is true for many instruments that there is a normal state from which the instrument will never deviate more than what makes it stay within certain limits, and then it will most likely often have a higher probability of drifting towards this normal state than of a further deviation. Still I have a strong inclination to believe that such random variations of an instrument, making it fluctuate around a normal state as the most likely state at all times (making these errors combine with the errors of observation e″), would be rather rare, and at least not at all as common as variations where the influence of the normal state is negligible, and where the ‘status quo’ is the rule, so the random variations of the instrument are realized in such a way that the state at any given moment is the most likely state at the next moment, having as its consequence that the errors accumulate like e′.2 A third type of instrument variations should be mentioned in this context, namely variations where the actual state at any given moment is equal to the average of the state in the preceding moment and the most likely state in the next; such a relationship is difficult to imagine for instrument-constants, but for the clock it should play a dominant role and make the errors in the movements of the clock act through a double summation.3 I have attempted to develop the theory of errors which are generated by two sources of error as in (1.1) and hope thereby to have made a contribution towards reducing the domain of systematic errors. However, there would be a considerably greater benefit by learning how to calculate the corrections
2
This argument, together with the formula (1.1), essentially derives what is today known as the model for the Kalman filter, consisting of Brownian motion with additive independent noise. Clearly Thiele does not have the mathematical concept of a stochastic process at his disposal, but the idea is very clearly expressed, tr.
3
And here integrated Brownian is motion suggested as a model of clock errors! tr.
ON QUASI-SYSTEMATIC ERRORS
13
which fully compensate the systematic errors than by referring them to the domain of random errors and make way for application of the method of least squares in cases where it hitherto was inapplicable; but not every evil can be pulled up by the roots and it is conceivable that occasionally one will deal with errors where the condition for revealing the acting sources of error and their laws exactly is to partition these into two: those which independently act on individual observations, and those which act through cumulation. That this is possible follows from the fact that there already exists a solution of the problem in a special case.4 If such a series of observations is partitioned into groups separated by long time intervals and if the observations in each group are made so quickly after one another that they can be considered simultaneous, then the problem is not difficult; the mean of observations within each group is calculated to yield the most likely value for the group, together with the standard deviations of a single observation and corresponding standard errors of the means; in addition one is gathering information about the most likely values in the intervals between groups of observations by interpolation, assuming a steady change from one group to the next; and the differences occurring in the long time intervals can be exploited to judge the stability of the instrument. Then the problem should also be solvable when the intervals are filled with scattered observations and where the number of repeated observations within groups is reduced. The only difference should be that it becomes somewhat more difficult to derive the desired results. The principle leading to a solution of these problems is not at all sophisticated. When formulating the problem in terms of equations, one should include every equation to give a complete description of the assumed relationships, and not permit any kind of elimination which is not prescribed by the method of least squares. Perhaps it should be noted here that the analytic tool used to implement the method, a continued fraction, is somewhat unusual in practical astronomical calculations, but is in no way inconvenient.
2 Adjustment of a series of observations of an instrument-constant We assume an instrument to have a property which is expressed by the number x and which should have been constant, but in reality x is subject to changes, either continually or at known moments in time. We assume in addition that each of these
4
Thiele refers to the method of normal places, described for example in Thiele (1903, pp. 106–12); see also Farebrother (1999, pp. 214–16) and Hald (1998, p. 393). tr.
14
ON QUASI-SYSTEMATIC ERRORS
changes is completely random and follows an exponential error law determined by the current value of x being the most likely value of x at the next moment, whereas the mean square errors of x or rather of its momentaneous changes, μ, axe known. If we then let x0, x1, …, xn denote the values taken by x at a series of times t0, t1, …, tn (either ta + 1 always later than ta or always earlier than ta), at which either observations of x are made or at which one may wish to know x, then by assumption 0 will be the most likely value of every difference xa + 1 − xa; the mean square error of this difference will be (between the limits a and a + 1), and the absolute weights5 are thus assumed known. In this way we obtain n equations (2.1)
We further assume that observations of x made at times t0, t1, …, tn also are subject to random errors with exponential error laws such that for these observations z0, z1, …, zn we know the most likely values (before adjustment) and also know their mean square errors m0, m1, …, mn or their weights ν0, ν1, …, νn where in general νa = 1/ma. This gives n + 1 equations (2.2)
which together with (2.1) contain all the mutually independent conditions we have available to determine the n + 1 unknowns x0, x1, …, xn which are sought together with their mean square errors after adjustment. Note. To the extent that among the times t there are some where no real observation has been made, and at which we simply wish to know the most likely values of x, we should let the corresponding weights be equal to ν = 0, whereafter the values of z can be arbitrary (not infinite).6
5
Thiele has here identified the basic properties of Brownian motion as having increments which are independent and with variance proportional to the length of this time interval. tr.
6
This idea is in Thiele (1889) further developed into a general method of fictitious observations, an alternative to modern techniques for solving least squares problems using generalized inverse matrices, cf. page 171 of this volume. See also Hald (1981) and Hald (1998, p. 648). tr.
ON QUASI-SYSTEMATIC ERRORS
15
If we apply the method of least squares to (2.1) and (2.2) we obtain the n + 1 equations (2.3)
Among these equations we should first eliminate those xa which have νa = 0, that is corresponding to fictitious observations which are only included to determine the values of x at other than proper observation times. But νa = 0 implies (2.4)
so, when first xa − 1 and xa + 1 have been found, xa is determined by simple interpolation between these values under the assumption that the changes in x are inversely proportional to the weights between the limits of the interval in question, or directly proportional to the mean square errors, or with the length of the time interval itself in so far as the prospects of displacement of the instrument have been constant over the time interval. If the first or last observation z0 or zn was ficticious, that is if the values of the instrument-constant before or after the period of observation are sought, then ν0 = 0 implies that x0 = x1 and νn = 0 that xn = xn − 1so the x obtained for the first and last regular observation times are the most likely values for the past and future respectively. The equations (2.3) thus show what is also completely obvious: that the elimination of xa when νMa = 0 does not change these equations in any other way than yielding the result which one would directly have obtained by omitting the
16
ON QUASI-SYSTEMATIC ERRORS
equations concerning the fictitious observation. One should then only let
or
and proceed directly from the equation
to
omitting the equation for νaza. When performing these calculations in real examples we could therefore directly assume that no fictitious observation is included, knowing now how to derive x at an arbitrary time from the x already found and corresponding to real observation times. Naturally the equations (2.3) must be further treated in the way typical of the method of least squares, but here this is made easier by the comparatively simple form of the equations. To take advantage of this and to represent the solution in the most suitable way, auxiliary variables must be introduced. Firstly, instead of the weights νa and νa, a + 1 we introduce two other series of numbers determined by the recursion formulae:7(2.5)
and (2.6)
Under the assumption that u0 = 0, ua and ua, a + 1 are easily calculated using a table of reciprocals. (No greater accuracy is needed beyond what has real significance for the weights.)
7
Thiele is now deriving the recursive algorithm known today as the Kalman filter and smoother (Kalman and Bucy 1961): first a recursion for the weights, then a forwardpass for filtering, and finally a reverse pass for smoothing. tr.
ON QUASI-SYSTEMATIC ERRORS
17
Next a sequence of values y0, y1, …, yn are calculated from the z which in the further calculation will replace the z and which in addition will turn out to represent the corresponding value of x under the assumption that all observations with higher subscripts are ignored; thus we have at least yn = zn. Such y are found by the recursion formula (for increasing subscripts)8(2.7)
or more conveniently (2.8)
or (2.9)
in combination with y0 = z0. Substituting these into (2.3) one finds that the x are determined by the recursion formula (for decreasing subscripts)9(2.10)
or with more convenient calculation (2.11)
or (2.12)
in combination with xn = yn. That this is not just correct, but that (2.10) is indeed also the typical form of elimination according to the method of least squares, is proved by general induction. For a = 0 (2.10) coincides with the equation (2.3), and if we assume that (2.10) holds for all subscripts up to a, that is the elimination of x0, x1, …, xa − 1 has left a system of n − a + 1 equations, namely (2.10) and the n − a last equations
8
Here comes the forward filter. tr.
9
And this is the backward smoother. tr.
18
ON QUASI-SYSTEMATIC ERRORS
of (2.3), then from the two first of these equations
being the only equations in which xa is still occurring, the elimination of xa is (in the usual way) made through multiplying the first equation by
cf. (2.6), and adding the result to the second of these equations, whereby from (2.6) we obtain
and thus according to (2.7) and (2.5)
which coincides with (2.10) and thus proves the correctness of the statement when a < n. Next, it is easily verified that the last elimination leads to unyn = unxn. If an explicit expression for un is expanded according to (2.5) and (2.6) one obtains the continued fraction10(2.13)
and if the first couple of terms are ignored, the final continued fractions will alternate between ua and 1/ua, a + 1.
10
This continued fraction representation of the recursion for the weights has seemingly not been given elsewhere. Thiele had a strong interest in continued fractions (Thiele 1869, 1870, 1879). tr.
ON QUASI-SYSTEMATIC ERRORS
19
From the importance this continued fraction has to our problem it can be assured that simpler solutions will occur only when this continued fraction takes on a simpler form, and thus that approximative solutions with easier calculations are only possible in cases which are similar to these. One would first think of cases where the continued fraction was periodic, but this does not yield simplifications of the practical calculations. This is, however, the case when the continued fraction has been interrupted because a partial denominator becomes infinite, that is either νa = ∞ or νa, a + 1 = 0, and in the case where the continued fraction can be combined into fewer terms because a partial denominator vanishes, that is either νa = 0 or νa, a + 1 = ∞. When either an observation can be considered error free, νa = ∞, or the observational series is interrupted by a long time interval or by an interval in which it is feared that the instrument has been disturbed, νa, a + 1 = 0, then in both cases the continued fraction is interrupted or partitioned into two mutually independent continued fractions. This in turn partitions the computation of the x in the first case such that ua, a + 1 = νa, a + 1 and za = ya = xa, that is such that the two series are connected through the common value xa = za; in the second case such that ua + 1 = νa + 1 and xa = ya but ya + 1 = za + 1 so the parts become completely separated by at least one intermediate value of x which cannot be determined. Such partitionings would in many other cases imply great advantages to the practical calculations; here the benefit is not so great for one will almost always be able to partition the calculations arbitrarily without much disadvantage, other than a minor double calculation of the u, y, and x around the point of partitioning, which quickly will tie the calculations together so the new values turn out to be equal to those found through the partitioning. The present consideration just shows that, when there is a choice, one should partition such that a reliable value of z is included in both pieces and such that one is interrupting at a place where there is a large time interval between two consecutive observations of which the first terminates the first part and the second initiates the second part. For the cases in which the number of terms in the continued fraction can be reduced, it is enough to consider νa, a + 1 = ∞, as we have already considered the case ν = 0 from another point of view and seen that if the problem is to find x at a time where no observation has been made (or only observations so unreliable that their weights should be set to be vanishingly small), the most likely values should be calculated by simple interpolation between the closest previous and the closest subsequent time of observation, under the assumption of a steadily changing instrument. On the other hand, if νa, a + 1 = ∞, which is implied by the times of observation being so close to each other that the instrument cannot have changed
20
ON QUASI-SYSTEMATIC ERRORS
its state in the meantime, then it is obviously justifiable to contract consecutive observations to their appropriately weighted averages and thus consider
as a single observation instead of za and za+1; this is also confirmed by our theory which for νa, a+1 = ∞ gives
and
When I previously have been careful not to deviate from the typical process of elimination by the method of least squares, it was to take advantage of the important theorem which I only know by oral communication from Professor Oppermann (Science Meeting, Copenhagen 1873) that there exist systems of linear functions11 of observations, which in complete generality could be considered as mutually independent observations and as such could substitute the original observations, in particular during all calculations of mean square errors of the elements in the adjustment; and that the left-hand sides of the equations resulting from the process of elimination, here (2.10), represent such a system of functions that could be considered as independent observations. Thus in
11
In Thiele (1889), cf. page 144 of this volume, these are studied in detail and named free functions. They correspond to orthogonal linear combinations in modern terminology. tr.
ON QUASI-SYSTEMATIC ERRORS
21
the products u0y0, …, uaya, …, unyn can be considered as independently observed values such that the mean square error of uaya is equal to
and consequently the values of y could themselves be considered independent so that their mean square errors become (2.14)
with the single exception that (2.15)
If we now express the x explicitly as linear functions of the y it becomes very easy to express the mean square error of any result from the adjustment which in advance has been given as a linear function of the x. But as
the coefficients in the equations (2.16)
will be given as (2.17)
One then obtains for the mean square errors
22
ON QUASI-SYSTEMATIC ERRORS
and in general
From the first of these expressions one derives the recursion formula (for decreasing subscripts) (2.18)
We note in particular the special case of the general formula which gives the mean square error of a linear function of two consecutive x:
because xa+1, being a function of y of higher subscript only, can be considered an independent observation besides ya. Setting r = 1, s = −1 yields (2.19)
The equations (2.18) and (2.19) serve conveniently to calculate the mean square errors which are left after adjustment in the equations (2.1) and (2.2) emerging directly from the basic conditions of the problem. From these the more general equation for M2(rxa + sxa+1) can also be transformed to the following:
Recalling that, according to (2.4), the x with no simultaneous proper observation can be calculated from the x corresponding to the closest preceding and subsequent observation as a linear function of these, say rxa+sxa+1 with r+s = 1, it would seem obvious to use this equation for M2(rxa + sxa+1) to calculate the mean square error for the x which are not determined by simultaneous observation. However, this would be a serious mistake of a special type of which one should beware in similar investigations. Because although xb = rxa + sxa+1, one should not let M2(xb) = M2(rxa + sxa+1) in cases where the equation xb = rxa + sxa+1 itself is a result of adjustment and therefore has no absolute validity, but only is an approximation with a reliability which could be measured by a mean square error of M2(xb) − M2(rxa + sxa+1).12M2(xb) must be determined directly in the same
12
Thiele must give this rather complex explanation because he does not use different symbols for the true value x and its estimate x̂ In modern symbolism he warns against using the equation x̂b = rx̂a + sx̂a+1 , as if it were true that xb = rxa + sxa+l . tr.
ON QUASI-SYSTEMATIC ERRORS
23
way as xb, namely by the insertion of a fictitious observation. The consequence of such an insertion is probably best explained through changing the original scheme of the ν, u, and x
to the following
according to the relations
together with
where thus
whereas
As regards the y as independent observations, the only difference is thus that the original ya, having weight ua − ua,a+1, has been replaced with two, namely ya
24
ON QUASI-SYSTEMATIC ERRORS
with weight ua − ub and yb = ya with weight ub − ua, a+1, such that their values are identical and the sum of the weights equals the original weight. However small this difference would seem, it has genuine significance, but only in regard to the determination of M2(xb). From (2.18) one now obtains directly
double application of (2.18) further gives
yielding, as ub = ua,b,
which is unchanged from (2.18). One thus has
and (2.20)
consequently the result is that
If in these formulae we let va, a+1 = 0, the problem becomes to determine the x and their mean square error within a time interval embraced by the observations, or in an interval where the instrument may have been subjected to such a disturbance that the series of observations must be considered completely interrupted, then, according to (2.6), νa, a+1 = 0 implies that ua,a+1 = 0. Unless νa,b = 0 also, it
ON QUASI-SYSTEMATIC ERRORS
25
holds that p = 1, xb = xa, and M2(xb) = 1/ub = 1/ua + 1/νa, b which in itself is natural. Figure 4 on p. 26 shows a geometric construction of the adjustment.13
3 Determination of the units for the two types of weights When any adjustment has been made, the question perpetually arises whether the errors left in the observations satisfy the assumptions upon which the adjustment rests; if one cannot or will not test this, the result must be considered hypothetical. Among other things one should attempt to answer the question whether the assumptions about mean square errors or weights on which adjustment by the method of least squares is based agree with what a posteriori can be derived from the table of residuals. Often this question takes the special form that one a priori does not know the weights of the single equations but only some ratios between them, which would be sufficient to determine the weights if in addition one knew a small number of units for the weights, one for each system. Then one must include these units among the unknowns of the problem and because of the complex nature of the problems, one is obliged to use the indirect method, beginning with completely hypothetical units for the weights, and calculate new and generally improved values for the units, based on the residuals after completion of the adjustment, whereafter the adjustment is calculated afresh until the posterior units from the last adjustment agree sufficiently accurately with those calculated prior to the same adjustment.14 Such recalculation is only unnecessary in one particular case, namely when only a single common unit for all the observations is unknown. In this case the absolute value of the weight unit has no influence on the adjustment and is itself determined by the formula
where f is the residual in a single equation, n the number of equations, and m the number of elements, or possibly by
13
Thiele's original text has no reference to the figure so this sentence has been added by me. It is not clear why Thiele produces this diagram. Most likely he is simply enthused by the idea, as I am. To the best of my knowledge this geometric version of the Kalman filter and smoother has not otherwise been described in the literature. tr.
14
The procedure suggested, iterating eqn (3.1), is in effect an instance of the EM algorithm, cf. Lauritzen (1981). tr.
26
ON QUASI-SYSTEMATIC ERRORS
Fig. 4 Geometric construction of the preceding adjustment (1) The points ζ0, ζ1, …, ζn are marked on an axis of abscissae in order of subscript such that everywhere ζa + 1 − ζa = 1/ νa, a + 1, thus representing the times of observation if the mean displacement of the instrument has been constant over time intervals of equal length. (2) After addition of a common constant to make all observations positive, the points za are marked vertically above ζa with ordinate za − ζa = observation za. (3) Points υa are marked vertically below ζa with ordinates ζa − υa = 1/νa. (4) On the same ordinates, a series of points φ1, …, φn are marked vertically below υ1, …, υn and on the axis of abscissae a corresponding series of points η0, η1, …, ηn, all of them before the ζ with identical subscript. The construction is made sequentially by increasing subscript and begins with η0 being determined by ζ0 − η0 = ζ0 − υ0 and subsequently everywhere such that υa − φa = ζa − ηa − 1 and (ηaυa) ‖ (ηa − 1φa)15 (5) The points y0, y1, …, yn are now marked vertically above the η with identical subscript and situated on a system of straight lines, each ya on the straight line between ya − 1 and za, in particular y0 with an ordinate of the same size as z0, y0 − η0 = z0 − ζ0. (6) The points x0, …, xn representing the adjusted values are then determined from they such that every xa is situated on the same ordinate as the corresponding za (and ζa), xn − ζn = yn − xn, and subsequently xa on the straight line between xa + 1 and ya whereby the line segment between xa and xa + 1 represents the most likely movement of the instrument in the time interval between ta and ta + 1.
15
Thiele uses the symbol ≠ for parallel rather than ║. tr.
ON QUASI-SYSTEMATIC ERRORS
27
to the extent one should also include the weight unit among the unknowns on the problem, which, however, must be considered rather problematic. Examples of cases with more than one unknown weight unit are very common in astronomy: one just needs to think of the weights of the right ascension and declination under orbit determination. Examples of a correct treatment of cases with several unknown weight units are not common. In the special problems treated here it will most often be necessary to repeat the adjustment at least once with improved weights, because, while one would frequently know the ratios between each of
between each of
separately, one would far from always know the units for the weights νa with sufficient accuracy and most often be in an even more unfavourable position concerning the weights νa, a + 1. And notably, because of a fact to be mentioned immediately, it would be unavoidable to repeat the adjustment, although there is reason to believe that the convergence in the determination of the units for the weights most often will be so rapid that the work would not easily become prohibitive. When an adjustment has been calculated on the basis of a hypothetical unit Eν for the weights ν0, ν1, …, νn and another, Ew, for the group ν0, 1, ν1, 2, …, νn − 1, n one should clearly attempt to determine Ev from the differences za − xa and Ew from the differences xa − xa + 1. There are n + 1 differences of the first type, n of the second type, and n + 1 elements x0, x1, …, xn have been determined by the 2n + 1 equations. Possibly one should add 2 to the number of elements (i.e. for Eν and Ew), but by the large values which n will always have in this type of investigations, this question becomes of quite secondary importance and I shall therefore ignore this possibility in the sequel and stick to n + 1 as the number of elements. For the desired units one can then write
where at this point we only know about ξ and η that their sum is equal to the number of elements
28
ON QUASI-SYSTEMATIC ERRORS
and we must determine what fraction of this number to allocate to each of the groups. In other problems, where the number of equations is large compared with the number of elements, this does not matter much and it will often be sufficient to distribute the number of elements evenly. But here the number of equations is never quite double the number of elements and, in addition, an example will easily show that the distribution sometimes should be very uneven. For if we imagine that an instrument-constant was only effectively observed at two different times, but each time with with numerous repeated observations, one immediately after the other with no possibility of real change in the instrument, then it is easily seen that η would approach the limit n − 1 while ξ approaches the value 2. To be able to determine ξ and η in more complex cases and in general it must be investigated in detail how much each equation contributes to the number ξ or η which should be subtracted from the number of equations, or how much one should add to νa(za − xa)2 or νa, a + 1(xa − xa + 1)2 to be able to assume that the resulting sum on average should be equal to 1 for every equation. In the last formulation the question has an easy answer: because after adjustment xa and xa − xa + 1 are subject to uncertainty with mean square errors M2(xa) and M2(xa − xa + 1) respectively, see (2.18) and (2.19). Since this uncertainty, through the requirement that the sum of squared deviations times the weights of the equations should be driven to its minimum, is in general causing the sum of squares to be smaller than if the observations were compared with the true values, it is at least very plausible that the right value is found by adding the mean square error of the residual to each of the squared deviations. A similar procedure is used to show succinctly that the number of observations should be reduced with the number of elements in the formula for posterior determination of the weights. In our case it is shown without too much difficulty from (2.16) and (2.17) that
For the two groups of equations we find in this fashion16
and noting that these values of ξ and η) are completely independent of any common factor of Ev and Ew and usually rather stable under changes of the ratio Ew/Ev, we
16
The quantities ξ and η are what Wahba (1983) named the ‘equivalent degrees of freedom’ for error and signal respectively. In the numerical example of the paper, Thiele uses the expression ‘excess equations’, see page 34. tr.
ON QUASI-SYSTEMATIC ERRORS
29
obtain for the determination of the units of weights (3.1)
That we have made the last adjustment correctly including the right-hand sides of (3.1) with correct weights is thus indicated by Eν = 1 and Ew = 1. At this point we note that while M2(xa) cannot be simplified advantageously by (2.5) and (2.18), we have according to (2.6) and (2.19) that
which makes us realize that when νa, a + 1 is relatively large and consequently ua − ua, a + 1 small, νa, a + 1M2(xa − xa + 1) will typically only be a little less than 1, so in cases where the observational times ta and ta + 1 are only a little different from each other, the differences xa − xa + 1 will almost not contribute to the determination of Ew. This implies that in such cases the adjusted values xa and xa + 1 will not easily be much different and from this we infer the general rule that before an adjustment of this kind, one may and should contract all groups of observations which are close in time to their averages; with a little care one is not going too far with such a contraction, which obviously yields a considerable reduction of work. Of course one should not miss the chance to use the observations contracted to their averages in the process of determining the unit of weights Eν. In this way one would frequently obtain such a good determination of Eν that final adjustment will only modify it slightly. It is more difficult in advance to ensure a sufficient approximate determination of Ew. In this respect I recommend making a rough calculation after the principle that one first contracts consecutive observations to their means as far as possible and subsequently considers these as adjusted values for x, while totally ignoring scattered observations which could not be used in this process, thus obtaining a series of values
Correspondingly one must contract the weights νa, a + 1 tentatively specified with an arbitrary unit, such that
30
ON QUASI-SYSTEMATIC ERRORS
etc. If then the unit of weights Ew is calculated as
where N + 1 is the number of means xA used, then such a determination will tend to be somewhat too large but still usable as a first approximation. But even when one has confirmed that the weights used in an adjustment are correct under the assumption that the error laws are exponential, there is still a need for much more criticism than usually applied. ‘Skew’ error laws and error laws with multiple maxima may rarely appear, but demand a different treatment when they do. It is not easy to get rid of systematic errors.
4 Adjustment by linear functions of the unknown elements When an adjustment of a series of observations subject to a combination of two sources of random errors with different effects has been completed according to the rules laid out here, one must naturally still further investigate the two tables of errors. It could be that the results show deficiencies which indicate that the combination has been of a different kind than the relatively simple one treated here. It could also be the case that, after distinguishing the two types of error, signs of proper systematic errors are revealed in one or both of the tables of residuals. These errors might have been hidden by the combination between the regular errors of observation and the displacements of the instrument, but after this complication has been resolved, they may show so unmistakably that there is hope to remove them by calculating corrections based on some side condition of the observations. In addition we have until now only considered the case where the observations subjected to complicated errors relate to a phenomenon which itself could be assumed constant. It should thus still be shown briefly what to do when the observed phenomenon varies according to laws which could be assumed to be linear functions with a number of unknown coefficients and simultaneously is subject to error of the complicated type treated here. It suffices here to treat the case where the law for the observed phenomenon is ζa = pfa + qga where ζa, the true value for the observation made at time ta, depends on two unknown elements p and q, while fa and ga are known numbers depending on ta. The observations are assumed subject to two types of error: partly such xa (corresponding to ta) which owing to the changing state of the instrument would be the same for every observation made with this instrument at time ta, but vary at random from one
ON QUASI-SYSTEMATIC ERRORS
31
moment to the next so that one has equations (4.1)
with corresponding weights νa,a + 1 as above; and partly such errors which in a random and mutually independent way affect the observations such that the combined action of the two yields za rather than ζ a as observed values, that is equations (4.2)
with weight νa, corresponding to the equations (2.2) in the simpler problem. With n + 1 observations (indexed from 0 to n) we thus have 2n + 1 equations to determine n + 3 unknowns x0, x1, …, xn, p, and q. By the method of least squares we obtain17(4.3)
where generally [νrs] = ν0r0s0 + ν1r1s1 + ⋯ + νnrnsn. When eliminating these equations we still use ua and ua,a + 1 given by (2.5) and (2.6), that is the continued fraction (2.13) also plays a major role here. Also we use the same auxiliary variables as denoted by y above although here we use another symbol for these, namely {z}, that is {z0} = z0 and generally
and further two sets of values, {f} and {g}, calculated in an analogous way from f and g: {f0} = f0, {g0} = g0, and generally
17
Compared with Thiele's original an apparently redundant ⋯ + between νa,
x
a + 1 a
+ and νafap has been removed in this formula, tr.
32
ON QUASI-SYSTEMATIC ERRORS
and in addition some more complex albeit analogous auxiliary variables
all five of them being calculated by the same general recursion formula
beginning from {r0s0} = r0s0. Using these auxiliary variables the first n + 1 of the final equations become (4.4)
leading to two equations for determining p and q(4.5)
analogous to those one would have found in the usual way if za = fap + qap had been subject only to ordinary random errors, that is
Here one can derive simplified versions of the recursion formulae for the auxiliary variables. For {ra + 1} one obtains in analogy with (2.8) and (2.9)(4.6)
or (4.7) and for {ra + 1sa + 1} we have
ON QUASI-SYSTEMATIC ERRORS
33
or18
which by introducing yet another series of auxiliary variables can be used to obtain formulae which are adequate for practical calculations so that (RS)a is given by (4.8)
while using this to calculate {ra + 1sa + 1} as (4.9)
We finally note the formula
which can be of significance as a control of (4.8) and (4.9) as well as (4.6) and (4.7). After having derived the basic formulae for this problem, I should mention that when fa is constant and equal to f for all subscripts, it also holds that {fa} = f and {fara} = f{ra), which implies that the constant term fp combines inseparably with the x so that xa + fp replaces xa in the equations (4.4) and the penultimate term vanishes, while the first of the equations in (4.5) remains unchanged and the second simplifies. Thus, in so far as the equation for the observed phenomenon contains a constant term, this cannot be separated from the errors x. Consequently the given solution applies to equations of the form ζa = 0 + pfa + qga without changes.
5 Numerical example It would probably not be superfluous to illustrate the simplest case of this theory with a numerical example. For this purpose I choose a series of observations which have been taken during the determination of longitude between Lund and Copenhagen this summer to determine the index error of the micrometer on the meridian circle of Copenhagen. I have here used the same weight for all observations of the coincidence between a movable thread and a fixed thread, and I further
18
Two terms have been moved from the left-hand side to the right-hand side of this equation compared with Thiele's original, tr.
34
ON QUASI-SYSTEMATIC ERRORS
assume that the point of coincidence between the two threads, which appears not to have been constant, has varied in a completely random way from one moment to the next, and also that the mean change of the point of coincidence has been the same in each element of time, thus that νa, a + 1 is inversely proportional to ta + 1 − ta whether the micrometer has been used in the meantime or not; however, it will become apparent that I have attempted to test the last assumption by a special investigation. Of course I have assumed that both error laws are exponential and free of systematic errors, although it will turn out to be doubtful whether the latter has been quite justified. The observations are represented in units of of a revolution = 0.2633″ and to save space I throw away what is common to all observations of tens, hundreds, etc., of this unit. In accordance with this the unit of weights Ev and Ew are determined such that Eν = 1 would mean that the standard deviation of a single coincidence was 0.2633″ while Ew = 1 would demand that the standard deviation of the displacement of an instrument in a single day would be 0.2633″. From the observations contracted to averages, Eν was tentatively determined from 36 mutually independent equations each with weight of a single coincidence having its sum of squared deviations equal to 2.902, yielding Eν = 12.4. From a tentative adjustment, based on assumptions which partly proved to be incorrect, I had found Eν = 12.5 and Ew = 24.8. These weights were used to calculate the values for νa and νa, a + 1 which together with za form the basis for the following calculation. To ease the calculations I have in reality used the double weights everywhere, but for clarity I have reported the results of the calculations as if this were not done. An accuracy of two digits was sufficient. The fact that the sum ∑νaM2(xa) + ∑νa, a + 1M2(xa − xa + 1) is not exactly equal to 74 is essentially due to the uncertain determination of the terms in parentheses in the second part of the sum, corresponding to very small time intervals. To the determination of we thus have a sum of squares of 53.22 in 74 - 17.77 = 56.23 excess equations19 and further a sum of squares of 12.5 × 2.902 = 36.27 in 36 equations from the observations contracted to their average; that is, in total 92.23 equations with a sum of squares equal to 89.46, whereby = 1.03. For we have 73 - 56.23 = 16.77 excess equations and a sum of squares of 16.73, that is = 1.00. The agreement is thus good enough; and thus the result is that the absolute weight on the observation of a single coincidence is equal to 12.9 corresponding to a standard deviation of ±0.278 = ±0.073″, while the stability of the micrometer in one day of stellar time is measured with the absolute
19
These are the ‘equivalent degrees of freedom’, see page 28. tr.
35
ON QUASI-SYSTEMATIC ERRORS
weight of 23.8 corresponding to a standard deviation in the same time period of ±0.205 = ±0.054″. Concerning our assumption that the weights simply should have been proportional to the length of the time intervals, it is seen from the following more specified table Interval less than 0.5 day between 0.5 and 1 day more than 1 day
excess equations 3.69 7.08 6.00
sum of squares 4.91 6.15 5.67
E′w 0.75 1.15 1.06
that it seems that the weights should have been set at somewhat smaller values in the short time intervals where the micrometer has been heavily used, but this result is so uncertain because of the modest number of equations that it is not at all necessary to modify the assumption that the extent of usage of the micrometer Index a
1879 Juli va
ua
0
2.61
25.0
25.0 0.0100
1
2.85
25.0
3.65 3.86
25.0
4
4.61
37.5
43.0 56.5 20.5
25.0
6.63
78.0
25.0
6.64
35.5
25.0
7.64
60.0
25.0
8.62
42.0
12.5
9.62
28.0
25.0
10.66
38.0
25.0
10.90
39.5
25.0
12.39
53.0
25.0
12.42
.0150 12.0
37.0 .0012
14
.0150 28.0
.0626 13
.0186 14.5
.0100 12
.0232 13.0
.0436 11
.0182 15.5
.0420 10
.0134 17.0
.0412 9
.0136 35.0
.0420 8
.0114 10.5
.0004 7
.0108 53.0
.0832 6
.0126
58.0 .0016
4.65
.0132 31.5
.0316
5
.0156 18.0
.0088 3
0.0180
45.0
25.0
25.0
.0092 35.5
60.5 .0008
M2(xa)
20.0
.0336 2
ua,a+1
.0090 57.5
M2(xa - xa + vaM2(xa) ) 1 0.45 0.0086 .39 .0182 .33 .0070 .32 .0156 .41 .0014 .29 .0214 .34 .0004 .33 .0214 .46 .0242 .29 .0246 .46 .0222 .38 .0086 .37 .0198 .23 .0010 .22 .0008
vM2(xa-xa, a+1 a+1 0.86 .54 .80 .49 (.88) .26 (1.00) .51 .59 .59 .51 .86 .32 (.83) (100)
36
ON QUASI-SYSTEMATIC ERRORS
15
12.44
25.0
82.5 .0110
16
12.70
25.0
43.0 68.0
0.0404 17
13.66
25.0
18
14.74
25.0
14.77
14.5
50.0
15.32
12.5 .0016
21
15.36
25.0
15.58
25.0
23
15.59
12.5
24
15.70
.0420 25
16.70
12.5
26
16.73
25.0
27
16.90
12.5
28
16.93
25.0
29
17.67
25.0
30
17.70
.0088 31
17.91
25.0
17.93
12.5
33
18.65
12.5
34
18.66
25.0
35
18.91
12.5
59.5
.0004
.0008 .13
.0108 .0004 .0106
.49 .13 (1.00) .27
.0082 .0106
48.0
(1.00)
.0148
36.5 49.0
.80 .26
.0106
34.5
.0106
.13
.0104
35.0
(1.00)
.0070
22.5
.0004
.27
.0104
72.5
.47
.0012
60.0
.0302
.26
.0106
63.0
(1.00)
.0146
38.0
.0008 32
.0012
45.0 57.5
.83 .13
.0104
47.5
12.5
.26
.0104
22.5
.0012
(.67)
.0060
49.0
.0310
.14
.0106
74.0
.38
.0008
39.5
.0012
.26
.0108
52.0
.87
.0160
30.0
.0072
.11
.0102
55.0
(1.00)
.0040
18.5
.0012
.22
.0090
31.0
.78
.0004
55.0 80.0
(1.00) .24
.0090
74.0
25.0
.13
.0072
61.5
.0046
.59
.0016
38.0
.0004
.16
.0098
63.0
(1.00)
.0138
33.5
.0092 22
.0012
.0102
58.5
.47 .33
.0130
35.5
0.50
.0212
23.0
.78 10.82
.43
.0132 37.5
.0232 20
0.0202 .0170
39.5
12.5
5.50 0.30
0.0120
43.0
.0012
.23 .0086
18.0
.0454
19
.0090
.77 .13
.0004
(1.00)
37
ON QUASI-SYSTEMATIC ERRORS
36
18.92
25.0
73.0 .0306
37
19.65
25.0
22.5 47.5
.0004 38
19.66
12.5
19.91 19.92
12.5
41
20.65
12.5
42
20.66
25.0
43
20.90
12.5
44
20.92 21.78
12.5
46
21.79
12.5
47
26.66
25.0
48
26.69
12.5
49
26.90
12.5
50
26.92
12.5
51
27.66
12.5
52
27.67
12.5
53
27.91
12.5
54
27.92
12.5
55
28.67
25.0
56
28.68
12.5
.0162
44.0
.51 .28
.0004 .0114
37.5
(150) .16
.0114
56.5 .0092
.16
.0130
45.0
.80
.0006
20.0
.0004
.17
.0130
55.0
(1.00)
.0080
42.5
.0316
.16
.0132
43.5
.55
.0004
31.0
.0004
.17
.0130
44.5
(1.00)
.0170
32.0
.0100
.17
.0136
32.5
.87
.0008
20.0
.0004
.17
.0134
53.5
(1.00)
.0076
41.0
.0310
.35
.0138
42.5
.16
.0012
30.0
.0008
.26
.0142
41.0
(1.50)
.0322
28.5
.0088
.26
.0208
29.5
.59
.0006
4.5
.0012
.17
.0206
43.5
(1.00)
.0212
31.0
.2046
.17
.0134
31.5
.80
.0008
19.0
.0004
.28
.0132
60.5
.50 30.19 (1.00)
.0080
48.0
.0362 45
.0114
50.0
12.5
10.85 0.15 0.0004
37.5
.0008
.13
0.0116
59.5
(100)
.0152
34.5
.0100
.27
.0108
35.0
.77
.0004
22.5
0.004
.13
.0108
72.0
(100)
.0082
59.5
.0306
.27
.0106
61.0
.48
.0004
36.0
.0004 40
.0106
59.0
25.0
.26 .0146
46.5
.0106 39
.0106
(1.00) .14
.0074
.80
57
28.90
12.5
50.0 .0008
58
28.92
12.5
48.0 60.5
.0306 59
29.65
12.5
29.66 29.90
12.5
62
29.92
12.5
63
30.63
25.0
64
30.65
12.5
65
30.67
25.0
66
30.90
12.5
67
30.92
25.0
68
31.67
12.5
69
31.68
12.5
34.42
12.5
71
34.44
12.5
72
34.45
12.5
73
za 4.45
36.70
ya 4.45
12.5
xa 4.25
za - xa +0.20
.25
.0202
4.34
4.20
.0496 .0470
xa - xa+1
+ .05
4.11
3.99
- .04
.52 .60 17.77
va(za - xa)2 1.01
58.26 -2.03 56.23 v(xa-xa+1)2a, a+1 0.26
.07 + .21
3.95
(150) .25
+0.05 4.25
(150)
.0006
8.5 21.0
.25
.0200
44.0
.28
.0012
31.5
.0946
.23
.0200
32.0
(1.00)
.0320
19.5
.0004
.23
.0184
20.0
.59
.0004
7.5
.0008
.26
.0184
47.0
(1.00)
.0186
34.5
.1150 70
.0104
35.0
.77 51.87
15.57 .13 0.0008
22.5
.0004
.22
0.0104
80.0
(.75)
.0074
55.0
.0316
.11
.0088
57.5
(1.25)
.0006
45.0
0.0008
.23
.0088
79.5
.50
.0010
54.5
.0096
.16
.0092
56.5
(1.00)
.0148
44.0
.0008
.15
.0126
46.0
.80
.0008
21.0
.0008
.16
.0124
55.0
(1.00)
.0080
42.5
.0298
.16
.0128
44.0
.53
.0004
31.5
.0008
.16
.0130
45.5
(1.00)
.0162
33.0
.0100 61
.0124
33.5
12.5
.15 .0008
21.0
.0004 60
.0122
1.30 .04
+ .05 3.95
4.04
3.94
+ .01
.25 .00
+ .17 3.43
3.65
3.77
- .34
.92 4.36
- .01 4.05
3.78
3.78
+ .27
3.85
3.83
3.84
+ .01
.11 1.77
- .05
.04 .00
.00 3.85
3.84
3.84
+ .01
.00 .00
.00 3.90
3.87
3.84
+ .06
.00 .09
+ .06 3.70
3.80
3.78
- .08
.08 .08
+ .02 3.85
3.83
3.77
+ .08
.01 .18
+ .11 4.05
3.97
3.66
+ .39
.28 3.90
+ .13 3.25
3.63
3.53
- .28
1.46 2.02
+ .33 3.05
3.24
3.20
- .15
1.81 .54
.00 3.35
3.28
3.20
+ .15
.00 .60
+ .01 3.10
3.23
3.19
- .09
.02 .21
+ .03 3.05
3.16
3.16
-0.11
3.05
3.10
3.13
- .08
14.87 .28 + .02
.01 .17
-.07 3.30
3.23
3.20
+ .10
.11 .24
.00 3.30
3.24
3.20
+ .10
.00 .12
+ .05 3.20
3.23
3.15
+ 0.5
.11 .03
+ .01 3.05
3.15
3.15
- .10
.01 .24
.00 3.20
3.17
3.14
+ .06
.00 .20
.00 3.10
3.16
3.14
- .04
.11 6.65
.00 .02
.00 3.20
3.17
3.14
+ .06
.01 .10
+ .11 3.20
3.18
3.02
+ .18
.32 .40
.00 3.20
3.19
3.02
+ .18
3.30
3.22
2.95
+ .35
.02 .84
+ .07
.64 1.54
+ 0.2 2.55
2.99
2.94
- .39
.22 3.67
+ .13 2.85
2.92
2.81
+ .05
.54 .05
+ .01 2.60
2.85
2.80
- .20
.03 .49
+ .03 2.60
2.75
2.77
- .17
.07 .75
-.00 2.80
2.76
2.77
+ .03
.00 .01
- 0.03 2.80
2.77
2.81
3.20
2.95
2.81
- .01 - .00 + .39
.04 .00 .00 3.82
+ .09 2.70
2.89
2.72
- .02
.80 .00
+ .00 2.70
2.82
2.72
- .01
.04 .00
+ .24 2.25
2.52
2.47
- .22
1.94 1.20
.00 2.60
2.54
2.47
+ .13
.00 .22
+ 0.5 2.40
2.48
2.42
- .02
.19 .01
.00 2.30
2.45
2.42
- .12
.00 .18
+ 0.6 2.00
2.29
2.37
29.45 1.67
-0.36 0.00
2.25
2.27
2.37
- .12
.00 .34
- .06 2.50
2.33
2.42
+ .08
.32 .09
.00 2.60
2.39
2.43
- .18
.13 11.88
.00 .41
- .08 2.60
2.47
2.51
+ .10
.20 .12
.00 2.30
2.42
2.51
- .20
.00 .52
- .78 3.20
3.08
3.29
- .10
4.00
3.36
3.30
+ .70
3.00 .23
- .01
.03 6.09
+ .02 3.40
3.37
3.28
+ .12
.06 .18
.00 3.20
3.33
3.28
- .08
.02 .07
+ .09 2.50
3.01
3.18
- .68
.28 5.83
.00 3.70
3.21
3.19
+ .51
.01 3.32
+ .01 3.10
3.18
3.18
- .05
.01 .07
.00 3.40
3.23
3.18
+ .22
.00 .63
+ .09 3.05
3.13
3.09
- .04
.25 .03
.00 2.90
3.08
3.09
- .19
.00 .44
.00 3.20
3.11
3.09
+ .11
.00 .15
.00 3.20
3.13
3.09
+ .11
.00 .15
+ .07 2.90
3.04
3.02
- .12
.16 .18
.00 3.10
3.06
3.02
+ .08
.00 .08
+ .02 3.00
3.04
3.00
.00
.03 .00
.00 3.00
3.03
3.00
.00
.00 .00
+ .05 3.05
3.04
2.95
+ .10
.09 .26
.00 2.90
3.01
2.95
- .05
2.95
2.99
2.95
+ .01
.00 .03
.00
.00 .00
+ .04
.14
3.00
2.99
2.91
50.34 .11
+0.09 0.00
2.65
2.89
2.91
- .26
.02 1.65
- .05 3.00
2.93
2.96
+ .04
3.00
2.95
2.96
+ .04
.10 .02
.00
.00 .02
- .10 3.30
3.17
3.06
+ .24
.08 .75
.00 3.00
3.10
3.06
- .05
.01 .04
.00 2.90
3.04
3.06
- .15
.02 .28
- .02 3.10
3.08
3.08
+ .02
16.48
.02 .01 53.22
16.73
38 has been without influence on its stability. The question whether there are signs of systematic errors after this adjustment can for the same reason not be given a definitive answer. The phenomenon in the tables of residuals which could suggest a departure from the exponential error law, namely the modest number of sign changes in the differences xa - xa + 1, cannot be said to prove such a departure, but on the other hand it is in itself quite probable that the changes in the micrometer are due to fluctuations in temperature and therefore associated with these. However, even if the necessary measurements of temperature did exist, it would probably not have been worth taking these into account. Finally, just to illustrate the use of eqn (2.20) with an example, I will compute the point of coincidence and its posterior standard error for the long interval between 21 and 26 July. For the intermediate times tb one gets
39 that is the following determinations: t 21.79 22.00 23.00 24.00 25.00 26.00 26.66
x 2.51±0.14 2.54±0.17 2.70±0.23 2.86±0.25 3.02±0.23 3.18±0.19 3.29±0.12
likely limits 2.37 to 2.65 2.37–2.71 2.47–2.93 2.61–3.11 2.79–3.25 2.99–3.37 3.17–3.41
The uncertainty is maximal on 24.13 July with a standard deviation of ±0.246; that is, this determination of x is everywhere more accurate than what could have been obtained from a single observation of the coincidence.
CHAPTER THREE Time series analysis in 1880: a discussion of contributions made by T. N. Thiele 1
Presentation This article, a reprint of Lauritzen (1981), tries to explain the basic contents of Thiele (1880) in modern terms. Twenty years have passed since its publication, and there are many things which I would have written differently today. Still, it may help the reader to appreciate Thiele's ingenious and original work. S.L.L. Fig. 1 Photograph of Thiele at a mature age. From the archives of the Royal Danish Library.
1
Reprinted with permission from International Statistical Review.
TIME SERIES ANALYSIS IN 1880
41
1 Introduction In the present paper we shall give a description and a discussion of a paper by T. N. Thiele (1880) on a particular time series model used by him in a problem of astronomical geodesy, more precisely in connection with the problem of determining the distance from Copenhagen to Lund (Sweden). Although the paper has been overlooked by today's statisticians, it contains remarkable results, results that are interesting even today and not just from a historical point of view. A short discussion of Thiele's model and method has survived in the sense that it is described in the textbook by Helmert (1907) that has been used as a basis for teaching statistics to geodesists until recent time. Thiele proposes a model consisting of a sum of a regression component, a Brownian motion and a white noise for his observations, although he does not use these terms himself. He solves the problem of estimating the regression coefficients and predicting the values of the Brownian motion by the method of least squares and gives an elegant recursive procedure for carrying out the calculations. The procedure is nowadays known as Kalman filtering (Kalman & Bucy, 1961). The iterative procedure used by Thiele to estimate the variances of the Brownian motion and the noise is related to the EM-algorithm described by Dempster et al. (1977) or, more precisely, identical to the algorithm given by Patterson & Thompson (1971) for variance component models. Thiele did not derive the distribution of his variance estimates which is rather typical for statistical work at that time. In later work by Thiele he becomes interested in such problems but in this particular paper they seem beyond his horizon. It is perhaps even more remarkable that this paper is the first paper written by Thiele on the method of least squares! There are obvious reasons for his paper to have been more or less neglected by other statisticians. Thiele is certainly not friendly to his readers and assumes these to have quite an exceptional knowledge and understanding of Gauss's method of least squares. His ideas seem to be so much ahead of his time (100 years) that his contemporaries did not have a chance to understand the paper and, maybe more important, to grasp the significance of the work. When later the time was ripe, the development of statistics was so much concentrated in England and USA, where no one seemingly would dream of looking for essential contributions to statistics made by a Danish astronomer in 1880. My interest in this work arose partly from reading a version of the paper by Hald (1981) containing a short description of Thiele's paper, and partly because I for some time had been working with statistical description of hormone
42
TIME SERIES ANALYSIS IN 1880
concentrations in plasma during pregnancy, where the data seemed to be described perfectly by Thiele's model. In §7 we shall give a short description of the experiences in applying Thiele's procedure to that problem. The ‘quotations’ from Thiele's paper given here are not direct translations from the original paper, but made such as to convey the meaning, atmosphere and writing style of Thiele although modern notation and concepts are used.
2 The model Before we proceed to discuss the statistical analysis performed by Thiele, we shall briefly sketch how he arrives at his model. Thiele wants to give a model describing the observation errors from a sequence of measurements obtained through time. He discusses first the empirical fact that such errors often appear as if they had a systematic component but emphasizes that this is not true since no procedure of correction seems to remove the phenomenon. Thus another explanation must be appropriate and he attributes the phenomenon to the fact that a (random) component of the error is accumulated through time. More precisely he considers measurements made by an instrument where part of the error is due to fluctuations of the position of the instrument itself. If X(t) is the position of the instrument at time t, the most likely position of the instrument at time t + Δt should be the position immediately before, that is X(t), and deviations from this should be governed by the normal distribution law. He then concludes than any sequence of instrument positions X(t0), …, X(tn), where t0, …, (tn) are consecutive time points, should have the property that the increments are independent, normally distributed with
where ω2(u) is a function describing the average size of the square of the fluctuations at time u. In the special case where ω2(u) ≡ ω2 we note that X(t) is what today is known as a Wiener process or Brownian motion. The ‘quasi-systematic’ variation in the errors is then supposed to be due to a process of the above type. The observations themselves are now supposed to be independent, normally distributed around the X-values. More precisely, for i = 0, 1, …, nlet
TIME SERIES ANALYSIS IN 1880
43
where ε(t0), ε(t1), …, ε(tn) are independent and independent of the X-process, normally distributed with expectation equal to zero and
I have deliberately not specified the joint distribution of all the variables completely (X(t0)'s distribution is unspecified). This is because Thiele does not either. We shall later see that, in fact, X(t0) plays the role of what we today would call a parameter in Thiele's statistical analysis although Thiele does not make a clear distinction between a parameter and an unobserved random variable. Note the special case of the model obtained by assuming: (a) equidistant time points: ti = i; (b) constant variance in the time fluctuations: ω2(u) = ω2; (c) constant measurement error: We then get for the differenced process
that this is a stationary process with expectation equal to zero and covariance function
which is a moving average process of order one. In other words, Z(0), Z(1), …, Z(n) is a sample from an ARIMA (0, 1, 1) process. An ARIMA (0, 1, 1) process even with missing observations is therefore a special case of Thiele's model and can be treated with Thiele's methods, to be described subsequently.
3 Least squares prediction of the Brownian motion In Thiele's formulation the primary objective is to estimate the unknown values of X(t0), …, X(tn). As mentioned earlier he does not distinguish between a parameter and an unobserved random variable but treats these unknown quantities seemingly alike. Let us examine his procedure in some detail.
44
TIME SERIES ANALYSIS IN 1880
First, at this stage, Thiele considers and known, X(t0), …, X(tn) unknown and Z(t0), …, Z(tn) known, i.e. observed. He then writes: ‘We get the following system of 2n + 1 equations with n + 1 unknowns: (3.1)
Solving these by the method of least squares leads to the system of n + 1 equations: (3.2)
which we shall now show how to solve’. Thiele's argument is as short as this, showing how he (of course) assumed the reader to be absolutely familiar with the method of least squares. Before we proceed to describe Thiele's recursive procedure we shall discuss in which sense the estimates X̂(ti) given by (3.2) (or predictions, as we would say today) are the ‘right’ ones. A rapid check will show that the values X̂(t0), …, X̂(tn) given by (3.2) minimize the quadratic form
Apart from an additive constant we have Q = –2logf, where f is the joint density of X(t1), …, X(tn), Z(t0), …, Z(tn) where X(t0) is considered nonrandom, i.e. a parameter. For the sake of clarity we shall henceforth write X(t0) = α. The joint
TIME SERIES ANALYSIS IN 1880
45
density of X and Z where
can be factorized as
into the product of the marginal density of Z and the conditional of X given Z. The maximal value (for fixed α, Z) of the second factor does not depend on α whereby the entire expression can be maximized by letting (3.3)
where α^ is the maximum likelihood estimate of α based on Z(t0), …, Z(tn). Another argument, see e.g. Rao (1973, 4a.ll), shows that the ‘estimator’ of X(t0), …, X(tn) obtained by minimizing Q is least squares in the sense that for all linear combinations with coefficients λ0, …, λn
for all measurable functions k. It is worth noting that Thiele also considers the problem of ‘estimating’ the value of X(s) at a time s, where no observation has been made. He shows correctly that this is given as (3.5)
His method of obtaining this result is ingenious and elegant and also typical of his work. We shall therefore describe his argument. The situation where we have not observed X(s) must be equivalent to the one where we introduce a fictitious observation and claim that we have an observation of X(s) with infinite variance. That is we define Z(s) = z, when z is arbitrary but finite and assume that
Calculate now X̂(s) = E(X(s) | Z(t0), …, Z(tn), Z(s)) using the usual system of equations (3.2), where of course σ-2 = 0. This then leads to (3.5). This ‘method of the fictitious observation’ appears in different versions in other parts of his work as an elegant and useful trick (Hald, 1981).
46
TIME SERIES ANALYSIS IN 1880
4 Recursive solution to the prediction problem With the computer capacity of 1880 it was of extreme importance to find a computationally simple way of solving the equations (3.2). In the numerical example treated by Thiele, 74 observations are recorded and without a procedure utilizing the relative simple structure in the equations, the computational work involved would be prohibitive. Thiele solves the problem by giving an elegant recursive procedure for the computations. The procedure consists of two parts. Part I. Define the following set of coefficients: (4.1)
where i = 1, …, n. If we now let X*(ti) be the best predictor of X(ti) when only Z(t0), …, Z(ti) have been observed, i.e.
where α^(i) is the maximum likelihood estimate of α based on Z(t0), …, Z(ti), we have the following recursion formula: (4.2)
That this is correct is shown by Thiele by an induction argument demonstrating that X*(ti + 1) fits into the equations (3.2) if X*(tj) for j ≤ i do. The computations are indeed very simple to carry out using just a table of reciprocals and a calculator with multiplication and addition. As (4.2) says, the new predictor is a weighted average of the old predictor and the new observation. Thiele does not directly give an intuitive ‘justification’ of formula (4.2) but from other parts of his paper it seems clear that his argument basically must be as follows. At each stage X*(ti) is the best possible measurement of the quantity X(ti) that can be obtained from Z(t0), …, Z(ti). Since the increments of the X-process have expectation equal to zero, X*(ti) is also a best measurement of X(ti + 1) based on Z(t0), …, Z(t1). Z(ti + 1) is also a measurement of X(ti + 1) and the corresponding observation errors are independent. Thus the best way of combining these is to calculate the weighted average of X*(ti) and Z(ti + 1) with the reciprocal variances as weights. Let now
TIME SERIES ANALYSIS IN 1880
47
then
and thus (4.3)
Further we get for the variance of the error of this average that
Combining this with the obvious fact that X*(t0) = Z(t0) and = 1/ψi and wi = 1/φi and (4.2) and (4.3) are equivalent.
, we see that the coefficients given by (4.1) are just ui
This recursive procedure is the idea behind Kalman-filtering (Kalman & Bucy, 1961). The idea goes back to Gauss (1823, Art. 35). Part II. As a result of the first recursion we obtain for each value of i the best predictor of X(ti) based on Z(t0), …, Z(ti). We shall now perform a backwards recursion calculating X^(ti) from these values as (4.4)
That this is correct is again shown by an induction argument. Again the form of (4.4) indicates that a heuristic argument of the same kind as in part I can be given although it gets slightly more complicated. Thiele now proceeds to give recursive procedures for calculating variances of the prediction errors. This is again done elegantly using the following important argument. It follows from (4.4) that X*(ti) are best estimates of the quantities (4.5)
The variance of the corresponding errors is thus given as
48
TIME SERIES ANALYSIS IN 1880
and for i < n
Further, the errors
are independent. This is typical for the way we have solved the linear equations. To calculate the variance of any prediction error of the type
we just have to express the linear combination in terms of X*(ti) such that
and thus also
whereby (4.6)
He now attributes to personal communication with Professor Oppermann that it always will be so that as a result of solving the normal equations by the procedure given, one will end up with a system of functions that can be considered as independent and replacing the original observations, thus making the calculation of error variances etc. simple. Some years later Thiele has this idea spelled out systematically by deriving the canonical form of the linear normal model and introducing the notion of a system of ‘free functions’ which is what today is called finding an orthonormal basis of a suitable type (Hald, 1981). Using (4.6) Thiele gives now a recursion for calculating.
where it for later purpose should be noted that but not on α.
and
(4.7)
both depend on the entire set of values
,
49
TIME SERIES ANALYSIS IN 1880
Finally it seems worth mentioning that Thiele gives a continued fraction representation of the solutions to the normal equations (3.2) as well as a description of how to obtain the values X^(ti) by geometrical construction.
5 Thiele's estimation of the error variances The estimation (prediction) described in the previous sections is based on the assumption that the error variances are completely known. Normally, however, the variances are unknown but it will then often be of interest to consider the case where hi > 0, i = 0, …, n and ki > 0, i = 1, …, n are unknown and σ2, ω2 unknown. A typical situation could be hi = 1, ki = (ti − ti − 1)–1. We are then faced with the problem of estimating σ2 and ω2. Thiele discusses this problem and gives a heuristic argument for his solution. We shall describe his method and argument and investigate it from a more exact point of view. First, Thiele claims that it seems appropriate to base an estimate of σ2 and ω2 on the quadratic forms
where X^(t0), …, X^(tn) are calculated as described in the previous section from certain initial values
of σ2 and ω2.
The first problem is to decide how the degrees of freedom should be allocated to the two quadratic forms. The total number of freedom must be n since we have used the method of least squares on 2n + 1 equations with n + 1 unknowns. It seems that appropriate to let (5.1)
where f1 + f2 = n and (f1, f2) are chosen in a reasonable way. In analogy with usual least squares where only variance has to be estimated it seems plausible that one should subtract from the number of terms, the relative amount of variation due to the estimation of X(ti). More precisely let
where and are me prediction errors in (4.7) based on the assumption that Note that in fact f1 + f2 = n since
and
50
TIME SERIES ANALYSIS IN 1880
is the trace of the matrix of a projection onto an n + 1 dimensional subspace of R2n + 1 and thus equal to n + 1. Using (5.1) we obtain new values σ˜2 and ω˜2 of σ2 and ω2 and we then repeat the procedure in the sense that new estimates X^(ti) are calculated, new values for Q1, Q2, f1 and f2 etc. The procedure is to be repeated until stable 2 2 values of σ , ω are reached. According to Thiele one has to do that three or four times. The final stable values of σ˜2, ω˜2 are then used as estimates of σ2 and ω2. Thiele writes that ‘it seems at least plausible’ that the procedure is correct. Formulating Thiele's estimation method more precisely we see that his estimates X^(t0), …, X^(tn), σ˜2, ω˜2 satisfy the system of n + 3 equations obtained by taking (3.2), inserting into these the values , and supplementing with the equations (5.3)
where the sums are over i = 0, …, n for σ˜2 and over 1, …, n for ω2. Rearranging (5.3) we get (5.4)
where Δi = X^(ti) − X^(ti − 1) − (X^(ti) − X^(ti − 1)). The form (5.4) will be convenient in the next section. Thiele does not discuss the problems of existence and uniqueness of solutions to equations (5.4) and (3.2) combined, nor does he discuss convergence properties of the iterative procedure to solve these equations beyond the remarks mentioned earlier that it has to be performed ‘three or four times’. We shall return to this problem later.
6 Discussion of procedures for estimation of the error variances We shall first reformulate the problem slightly by forgetting the prediction problem and consider the problem of estimating the unknown values of α, σ2, ω2 based on
TIME SERIES ANALYSIS IN 1880
51
the observations
where (Z(t0),…, Z(tn)) has the joint normal distribution specified by Thiele's model. We shall treat the problem as a problem of estimation in an exponential family with incomplete observation (Sundberg, 1974). That is, we shall use that we can think of Z(ti) as
where Y(t) − X(t) − α is a Gaussian process with expectation equal to zero, independent increments with
Y(t0 = 0 and ε(ti) independent and independent of the Y's with expectation equal to zero and variance of
.
Suppose that we had observed not only Z(ti) = zi, i = 0, …, n but also Y(ti) = yi for i = 0, …, n, where y0 = 0. The likelihood of α, ω2, σ2 would then be given as
Thus we see that we deal with an exponential family with canonical statistics
From general theory, Sundberg (1974), it now follows that we get the likelihood equations in the case where only Z has been observed by equating the expectation of these statistics to their conditional expectations given the observed value of Z. We get for the expectations (6.1)
and the conditional expectations need to be worked upon a bit: (6.2)
52
TIME SERIES ANALYSIS IN 1880
where we have let
where Xα(ti) = Eα(X(ti) | z), such that Thiele's X^(ti)-values are given as (6.3)
The term E((Ŷ(ti) − Y(ti))2 | z) is just a conditional variance and does therefore not depend on z so that we can proceed as (6.4)
Similarly we get for S2: (6.5)
and finally for S;3: (6.6)
We now form the equations obtained by equating (6.1) to (6.4), (6.5) and (6.6) and get from the last of these that the solution α^ satisfies
Inserting this into the two first equations gives us the equations (6.7)
TIME SERIES ANALYSIS IN 1880
53
where we have made clear how the expectations depend on the parameters. To see that these equations are very similar to Thiele's (5.4) we use (6.3) and the fact that
and
whereby we see that the only difference between Thiele's equations (5.4) and the maximum likelihood equation (6.7) is the term
on the right-hand side of the first equation. In the limiting case with ki ≡ ∞, that is where the Brownian motion is vanishing and Z(t0), …, Z(tn) become independent we have
whereby (6.7) reduces to the equation (6.8)
whereas (5.4) becomes (6.9)
so that Thiele's equations correspond to the usual way of taking into account that a linear parameter has been estimated. By an argument similar to the above, Thiele's equations can be seen to be the maximum likelihood equations based on the marginal distribution of the n linear ‘contrasts’ Z(t1) − Z(t0), …, Z(tn) − Z(t0) (or any other set of n linearly independent contrasts).
54
TIME SERIES ANALYSIS IN 1880
Patterson & Thompson (1971) derived the same algorithm as Thiele for estimation of variance components and called the estimates ‘restricted maximum likelihood estimates’. Harville (1977) has discussed the properties of the algorithm although on a heuristic and empirical basis. Since the algorithm is a special case of the more general EM-algorithm discussed by Dempster et al. (1977) one can say a bit more about its properties. If the starting values do not happen to be a saddle point of the likelihood function, it either converges to a local maximum of the likelihood function or diverges in the sense that either σ2 → 0 or ω2 → 0. Each step of the algorithm increases the likelihood. There seem to be no simple conditions for the equations to have a unique solution. Thiele seemed unaware of these problems not even mentioning them. A guess would be that his long series of observations ensured that he never encountered the problem in practice. Also it seems as if he did not pay much attention at all to the variance estimation and only wanted to correct the initial values of (σ2, ω2) as far as this gave a significant change in the values of X^(ti) that were of primary interest to him. It is anyway quite typical for the time period that variances only have secondary importance. The last section of Thiele's paper contains a worked out example with a series of 74 astronomical measurements and here he has no problems.
7 An application In the final part of Thiele's paper he extends the model by the inclusion of a linear regression term in the sense that he considers the problem
where f1, …, fk are known functions and α1, …, αk are unknown constants. He gives a similar recursive procedure for this problem although he leaves the proofs to the reader. Working with hormone concentrations in pregnancy some time ago (Lauritzen, 1976) I used the model of the form
TIME SERIES ANALYSIS IN 1880
55
for the logarithm of concentrations of progesterone in plasma. The data showed good fit, but indicated that a model of the type
would be more realistic and give a better fit, although the quality and amount of data prevented to pursue the issue further. Recently Mogens Christensen, Aalborg Hospital, has provided me with measurements of concentrations of the hormone HPL on 69 pregnant women, taken at various time points during pregnancy and with 4–11 observations for each women. It seemed natural to try out Thiele's model and algorithm on this data set. Of course one could expect difficulties because of the very short observational series and this was indeed the case. The algorithm did not converge for 54 of the 69 series. Thus it looks as if the distribution of the solutions to the estimating equations have a lot of mass on the boundary and the only way to get around this was to assume the variances of the Brownian motion and the white noise to be identical for all women. Under this assumption the algorithm converged in the sense that after 24 iterations the estimates for the variances did not change with more than 1 per cent and after 42 iterations not more than 0.1 per cent. The stable values of σ2, ω2 reached were quite sensible compared to other knowledge and examination of residuals showed a good fit. On the other hand 42 iterations is a lot more than ‘three or four times’ as Thiele writes. Nowadays these iterations can be performed quickly and cheaply on a high speed computer, but in 1880 … ? Note that in this particular application it is quite important that one uses Thiele's ‘restricted maximum likelihood estimates’ rather than the maximum likelihood estimates themselves since otherwise the different values of α and β for different women would create a nuisance parameter effect and give rise to useless estimates of σ2 and ω2.
8 Final comments Even though Thiele did not fully discuss the difficulties concerning the estimation of the error variances, it is amazing that he got so far in the understanding of the time series model that he discussed. The most striking contrast between Thiele's ‘approach’ to time series analysis and time series analysis as of today seems to be the very detailed analysis of one particular model, as opposed to the more modern tendency of investigating large classes of models without really trying to understand or utilize the particular structure of each of them. Also the way Thiele establishes his model is closely
56
TIME SERIES ANALYSIS IN 1880
related to a particular practical problem where he takes very much into account, how the observations in fact have been produced. Sometimes one can get the impression that model building in time series analysis today is made completely independent of how the data have been generated. Could Thiele's success in 1880 encourage modern statisticians to make detailed investigations of particular models?
Acknowledgements I wish to thank A. Hald for several discussions on the topics in the present paper and for reading the manuscript and suggesting several important improvements.
Résumé Nous discutons un article écrit par T. N. Thiele en 1880 dans lequel il étude un modéle d'une série chronologique composé comme une somme d'une régression, d'un mouvement Brownien et d'un processus des variables aléatoires indepéndantes. Il dérive une méthode récursive pour l'estimation de la régression et pour la prédiction du mouvement Brownien. La méthode est connue aujourd'hui comme le nitre de Kalman. Il dérive une méthode itérative pour l'estimation des variances inconnues (essentiellement l'algorithme EM). Enfin nous mentionnons une application du mòdele concernant la production des hormones pendant la grossesse humaine normale.
CHAPTER FOUR The general theory of observations: Calculus of probability and the method of least squares Presentation The following chapter contains the translation of Thiele's most fundamental book on statistics from 1889. At times it can be difficult to read, both because of the rather complex and old-fashioned Danish language used in the original, and because Thiele did not have Fisher's clear terminological distinctions, such as parameter, estimate, statistic, etc., at his disposal. However, Thiele definitely had these distinctions clear in his mind. Here and there I have supplemented the text with notes, attempting to assist the reader. It may still be hard work going through these pages. But do so, it is worth it! Fig. 1 Thiele as a senior researcher. From the archives of the Royal Danish Library.
58
GENERAL THEORY OF OBSERVATIONS
Fig. 2 Cover page of the original Danish version of Thiele (1889).
GENERAL THEORY OF OBSERVATIONS
59
Contents of Thiele (1889) 1 On the relation of the law of causality to observation 1.1 Accidental and systematic errors 1.2 Actual and theoretical error laws 2 On actual error laws 2.1 Actual error laws and relative frequency 2.1.1 Relative frequency as a function of the observed value 2.1.2 Error curves 2.1.3 Grouping 2.1.4 Geometric properties of error curves and suitable types of series expansions 2.1.5 Series using binomial coefficients 2.1.6 Series using the exponential error law 2.2 Actual error laws expressed by symmetric functions 2.2.1 Coefficients of equations 2.2.2 Sums of powers 2.2.3 Halfinvariants 2.2.4 Examples 2.2.5 Halfinvariants for the binomial law 2.2.6 Halfinvariants for the exponential error law 2.2.7 Halfinvariants for series using exponential error laws 2.3 Actual error laws for functions of observed variables 2.3.1 The univariate case 2.3.2 The multivariate case 2.3.3 Actual error laws for linear functions in terms of symmetric functions 2.3.4 Examples 2.3.5 On the tendency of error laws to approach the exponential form
61 62 63 65 65 66 67 67 68 70 74 78 78 79 82 85 86 90 91 93 93 96 101 104 104
60
GENERAL THEORY OF OBSERVATIONS
3 The hypothetical or theoretical error law of the method of observation. The law of large numbers 3.1 Probability 3.1.1 Main theorems from the direct calculus of probability 3.1.2 Examples 3.1.3 Mathematical expectation 3.1.4 Examples concerning life insurance and life annuity 3.1.5 Indirect or posterior determination of probability from observed frequency 3.2 The error law of the method expressed by symmetric functions 3.2.1 The error law of the average 3.2.2 The error law of the standard deviation 3.2.3 The error laws of the higher halfinvariants 3.2.4 The empirical determination of the error law of the method 3.3 On adjustment 3.3.1 Reflections on the possibility and the conditions for a general method of adjustment 3.3.2 The conditions for the method of least squares 3.3.3 Weights for adjustment; normal values 3.3.4 Special rules for elimination and determination of the unknowns in linear equations for observations 3.3.5 Theorems about systems of mutually free functions 3.4 The principle of adjustment by the method of least squares 3.4.1 Adjustment by correlates 3.4.2 Determination of the adjusted means 3.4.3 Determination of the standard deviation after adjustment 3.4.4 Error criticism 3.4.5 The method of least squares as a minimization problem 3.4.6 An example
105 109 111 112 116 117 118 127 129 130 132 133 136 136 140 140 141 144 149 150 153 153 155 157 157
GENERAL THEORY OF OBSERVATIONS
3.5 Adjustment by elements 3.5.1 The table for adjustment by elements: simultaneously eliminating and making the functions free 3.5.2 Determining the adjusted values for the elements or for given functions of these 3.5.3 The adjusted values for the observations 3.5.4 Error criticism and the minimization theorem 3.5.5 Undetermined equations, tests, and the method of fictitious observations 3.5.6 Examples 3.6 On systematic errors 4 Tables and Figures
61 161 164 166 167 168 170 171 179 185
1 On the relation of the law of causality to observation Man can only acquire knowledge about Nature by assuming the general law of causation: that everything which happens is a necessary consequence of the preceding state. Isolated sensation is not sufficient for this knowledge, it is necessary to obtain a stable combination of several observations under circumstances which in various ways can be attributed to the relation between cause and effect. But observation is exactly the combination of sensations, between which there is or is conceived to be dependence. An observation would be perfect if it determined a phenomenon under specification of all the circumstances that have caused it; such observations would coincide with each other. Under a repetition with all causes being unchanged and the same, the determination of the phenomenon would have to be the same, since the law of causation implies that any discrepancy between the determinations must have a cause and thus be due to differences in one or more of the acting causes. Any understanding of what happens in the world must primarily be based on knowledge about which circumstances influence any given phenomenon and which do not; next it becomes important to know the laws which determine the effect of each of these circumstances on the observation. On the one hand such knowledge is necessary for the quality of the observations, on the other hand it is only possible to obtain this knowledge by observation. Because of this complicated relationship, hypotheses become necessary for all observation and for all knowledge about Nature and the world. At any real
62
GENERAL THEORY OF OBSERVATIONS
observation, one must assume that the phenomenon has specific circumstances that are essential, and no other additional causes. If this hypothesis is incorrect—and this is not just most commonly so, but in fact almost always the case—then repeated observations will be subject to error. In repetitions without any change in the circumstances considered to be essential, the phenomenon under observation will yield differing results. So the hypothesis should be rejected. However, it is a basic rule in the endeavour to achieve perfection that one should not reject the imperfect before it can be replaced with something better. So when a hypothesis does not provide complete coincidence but still leads to deviations which stay within narrow limits, errors remain small; and when among the numerous circumstances that have been ignored by hypothesis as being inessential one cannot determine a few which affect the observation to any significant degree, then one should not obscure the observations by impenetrable specification of additional conditions based upon loose conjectures. It is much better provisionally to maintain the simple hypothesis and as a consequence accept that the observations are subject to error. But as the observations become subject to error, it is necessary to prepare for the consequences of conclusions made on the basis of imperfect observations and—more specifically—for the influence of the errors on calculations made with observed numbers. And in this way the problems arise which, to the extent that these can be dealt with by general methods, will occupy us in the general theory of observations and their errors. The methods of this theory demand that one should begin by treating cases where the observations are simple repetitions without any change in the essential circumstances, and only after these have been treated, at least to a certain degree, should one attack more complex problems where some of the essential circumstances of the observations have changed, and where through additional hypotheses one must seek to determine the change of the phenomenon as a function of its essential circumstances.
1.1 Accidental and systematic errors When the phenomenon under observation can be described by a number o and similarly its acting causes by numbers v1, v2,…, vn for the essential2 circumstances and u1, u2,…, u∞ for its ignored, inessential circumstances, the law of
2
In the Danish language, ‘essential’ is ‘vœsentlig ’ and begins with a ‘v’. Similarly ‘inessential’ is ‘uvœsentlig ’ and begins with a u’. tr.
GENERAL THEORY OF OBSERVATIONS
63
causality in its strong form can be expressed as3(1.1)
but because of the actual, curtailed observation it is conceived as (1.2)
Thus, according to the basic hypothesis of observation, the inessential circumstances are assumed constant. The errors that occur as a consequence of this assumption are called accidental errors; they are unavoidable unless observations exist where all real circumstances are treated as essential. As the function F is unknown in general it cannot be avoided that the observations commonly be treated under a different and incorrect assumption (1.3)
about the shape of the function. Errors which are consequences of this have a less innocent character than accidental errors. They are distinguished as systematic errors. These should as far as possible be identified and corrected; when this has been unsuccessfully done, the remaining systematic errors bear witness about the shortcomings of the way the problem has been handled. It is an important, but difficult art to recognize these different types of error. This is most reliably done when studying simple repetitions, where v1, v2,…, vn, have been constant, and therefore only accidental errors show up in the mutual differences between repetitions. Where v1,…, vn, have been varying, systematic errors will reveal themselves by producing a trend in the errors, deviations that are seen to change as a function of one or more of the essential circumstances.
1.2 Actual and theoretical error laws The theory of observation is a practical science and as such it must distance itself from pure mathematics and from the ideal determinations used in theoretical physics. This must be pointed out and emphasized because the language used, also within the theory of observation, may easily conceal this fact. An elaborate description of the circumstances associated with the observations is not desirable, so in the description of the observations one omits and tacitly assumes everything
3
This formal equation might be the first explicit general structural equation (Wright 1921; Haavelmo 1943). Modern notation would combine the inessential circumstances into a single noise variable ε (Pearl 1998). Thiele does not consider general systems of structural equations, tr.
64
GENERAL THEORY OF OBSERVATIONS
which can possibly be omitted; however, in so doing, these descriptions appear to pretend to contain the ideal truth itself. This is certainly not the intention of any scientific observer; but one is not completely innocent when individuals with undeveloped judgement are appalled by one observer reporting the distance to the Sun as being a couple of million kilometres longer than another, in spite of this distance ideally being a definite number, or when dubious statisticians see an important change in any insignificant fluctuation of an observed probability. The human proneness to theorize is so rooted that I almost do not have to say that from the observations one can derive approximate determinations for the ideal numbers. There is a much stronger need to stress that the ideal truth itself is unreachable by observation, and one cannot give an experimental proof that the results of the observations can be considered an approximation to this. When an observation is repeated without any change in the circumstances considered to be essential, and the results still differ, then there is not a single true value for the result of all these repetitions such that the deviation of each observation from this value should be considered as experimental sins. Only imaginary results or manipulated reporting of circumstances and conditions for observations deserve this predicate. Each deviating observation is rather a necessarily correct answer to the question actually asked. The fact that observations do not coincide proves either that one has not completely known which question was asked of Nature, or that via a hypothesis one has shut one's eyes to one thing or another. Thus we have what we could call repeated observations when in
all the essential circumstances v1,…, vn have been constant, whereas the unknown u1,…, u∞ in general have changed from one repetition to another, and hence we do not find a single value, but a number of mutually different values, o′, o″, …. The basic hypothesis of observation that a single observation should be determined as
cannot give any answer to the question of how to calculate the true value o from o′, o″,…, even though one could derive the function F from f by inserting constant, but unknown values for u1,…, u∞. Prevented from direct inspection of u1,…, u∞ by necessity and hypothesis, the theory of observation can at most achieve to give limits for the consequent undeterminedness and to describe the way in which the values have been and could be expected to be distributed within
GENERAL THEORY OF OBSERVATIONS
65
these limits; this is what is called the error law.4 For the determination of this law no other material is available than a finite, more or less limited number of mutually deviating results of observations. Hence even this summary determination of the indeterminacy of the observations cannot be made without assumptions; a hypothetical law is necessary to allow deduction from previous experiences to expected ones. As it is important to be able to describe one's assumptions at any time, we shall in the sequel distinguish between actual error laws that without any assumptions only serve to describe the actual deviations in convenient ways, and theoretical error laws for future observations that must rest on assumptions for which no full and generally valid proof can be given. As a part of the error law a value could appear from each series of repetitions, which would have been the result, had these all been identical, and around which the observations were grouped in a way determined by other aspects of this error law. One would be very much inclined to substitute such ‘averages’ for the ideal values o that according to the theory should be equal to certain functions of v1,…, vn. But the theory of observation would not yield a general proof of the permissibility of such a substitution. It can only decide whether a hypothetical dependence between systems of such averages can be assumed to fit with the observations.
2 On actual error laws 2.1 Actual error laws and relative frequency When a number of repeated observations are available, grouping and counting identical observations will be the simplest and most commonly useful way of representing the actual error law; and to increase clarity it is natural to think of replacing the numbers n1, n2,…, nr for the different observations with their total
and their relative frequencies, namely the ratios
Reporting the total and the relative frequencies clearly gives a better insight into the pecularities of the series of observations and it is in particular favourable
4
Thiele uses the term ‘error law’ for a distribution. An ‘actual error law’ is an empirical distribution in modern terms, tr.
66
GENERAL THEORY OF OBSERVATIONS
when several collections of observations are to be compared. This remedy does, however, lead to the minor inconvenience that a relation will hold for the relative frequencies (2.1)
determining each of these frequencies in terms of the others. Usually one uses the word probability for what we have called relative frequencies here; this use of language is dubious and must be avoided, since the word probability, as it is normally used, in itself relies on assumptions that are unjustified when dealing with actual error laws; even the addition of the adjectives actual or posterior to the probability cannot eliminate the danger of unconscious import of theoretical assumptions. In cases where the observations cannot be represented by numbers, the listing of relative frequencies is the best way to specify the actual error law; this holds for certain statistical observations with more than one possibility, for example the frequency of marriage between unmarried, unmarried with widow, unmarried with widower, widow with widower. But whenever the observations are represented by different values of a variable and in particular a real number, it is possible to obtain an even more favourable representation of the error law. A convenient but only little informative representation of an actual error law is to report the highest and lowest value among a series of repeated observations. Even when supplemented with the average this gives only a rough summary of what can be learned from the whole series of observations, and especially one will miss a record of the relative frequency of the extreme values.
2.1.1 Relative frequency as a function of the observed value To determine the actual error law exhaustively in terms of relative frequencies, one should determine the functional relationship between the values of the observations and their relative frequency. This can be done partly graphically, partly by computation of the constants in suitable series expansions; but in both cases it is not redundant to emphasize that although the observations in some cases, in particular for counts, should be immediately understood as the observed value, then in other cases, in particular all observations of measurements, the numbers are rounded and they should then be understood as an indication that the observation is within certain limits, most often the number stated plus or minus half of the unit of the last decimal.
GENERAL THEORY OF OBSERVATIONS
67
2.1.2 Error curves For counts and other observations with exact numbers, in particular when these by nature of the case could have occurred as all the terms in a (bounded) arithmetic series, it can be recommended to construct points with the observed value as the abscissa and the frequency as the ordinate. One would do well in accentuating these points by drawing a curve through them, as simple and visually attractive as possible: an error curve. But one should not forget that this curve has no meaning outside its intersections with the equidistant ordinates. If one dares to introduce assumptions concerning which aspects of the error law are essential and which inessential, the error curve may not be drawn exactly through the observed points, but such that with few and small deviations from these it assumes a rather simple and beautiful shape (graphical adjustment); but one should be careful, in particular at the beginner's stage, and rather get accustomed to drawing the curves with all the wiggles needed to obtain exact intersection, and when looking at these, ignoring the associated exaggerations.
2.1.3 Grouping If the individual values of the observations—the abscissae—have been irregularly distributed, or when some other easily understandable reason has resulted in gathering relatively many observations at particular abscissae, or relatively few on quite many abscissae, it is most correct to collect and group neighbouring observations. For example, when the statistics of duration of contracts in years are as follows:
68
GENERAL THEORY OF OBSERVATIONS
For grouped observations the error curve is constructed such that the areas enclosed by the curve, the axis of abscissae, and the ordinates for the limits between the grouped observations, are proportional to their number. One must begin this construction by drawing rectangles with these areas with the given parts of the axis of abscissae as sides, and then easily transforming them to others of the same area and bounded by line segments, and such that the upper bounds have almost the same direction as expected for the tangents of the final error curve. Then one can easily draw a curve by hand that with sufficient accuracy fulfils the condition that the areas are proportional to the number of observations. If limits between consequtive grouped observations have a constant difference, one can also directly construct a kind of error curve in the same way as for counts by taking the grouped values as abscissae and their number or relative frequencies as ordinates; these two types of error curves would be related in such a way that the first kind represents differential coefficients and the second kind finite differences for the same third curve, having its ordinates proportional to the relative frequency of observations with smaller (or larger) values than indicated by the abscissa. Thus this becomes, if desired, a third kind of error curve.
2.1.4 Geometric properties of error curves and suitable types of series expansions As any relative frequency is indicated by a positive number, the curves of the first two kinds and the one for counts must lie totally on one side of the axis of abscissae; the third kind must lie between two parallel straight lines, touching one at −∞ and the other at +∞, and both its coordinates must either always increase or always decrease. And although it is very wrong to assume that there is a general form of error law for all observations, another property can be established for all error curves: the variation of observations under simple repetition stays mostly within limits that seem to have finite distance, as the frequency rapidly decreases towards these limits. This is partly due to the fact that the basic hypotheses, delineating essential from inessential circumstances, are not formulated casually, but based upon thorough previous investigations or at least on common sense, and it would be easy to discover that some circumstance has a genuine influence, when its presence gives rise to very large deviations. Also, the world that we observe seems unmistakably to be such that almost without exception, any circumstance which is not constant still has finite limits of variation or is even periodic. Nature has ‘horror infiniti’. As a consequence, the error curves both for counts and for the first two rounded measurements always have the property that only in a short interval do they deviate
GENERAL THEORY OF OBSERVATIONS
69
from the axis of abscissae and otherwise follow it exactly towards the infinite or rapidly approach it as an asymptote at both ends. The third kind of error curve follows one of the lines in the same way until, with a short turn and at least one inflexion, it reaches the second line. Without establishing this as a law of Nature, it is a fact that must determine the choice of functions serving in the series expansions which we now have to set up for the algebraic treatment of the error laws. Series expansions that conveniently should give an algebraic expression for the error law and represent the frequency of repeated observations as a function of the observed values could not have the otherwise most commonly applied forms, expansions in terms of integer powers or periodic functions, as these functions do not vanish for large, not even infinitely large, values of the independent variable. On the other hand, one can favourably apply the binomial coefficients5
or functions that are powers with exponents equal to ∞ for x = ±∞, in particular
and functions derived from these. The first one is exactly equal to 0 for all integer values of x apart from the n + 1 values from 0 to n, and it is particularly suited to represent counts, where only the values for integer arguments matter. The other exponential form is preferable for measurements; it is not exactly equal to 0 for extremely large arguments, but its asymptotic approach to this value is so rapid that the deviations are harmless. By choosing appropriate values of their constants one can easily make from both functions such a variety that rather many types of series can be produced. Among these we will only deal with those that in usual cases converge most rapidly. To represent an actual error law with m different observational values with full accuracy, a series with m terms is clearly needed, but for approximate representation it is important that the series is chosen and ordered in such a way that the individual coefficients of each term can be separately calculated, and such that the first terms of the series immediately determine the essential features of the function, whereas the later terms lose in significance and ultimately only relate to trivial matters without significance for anything but a completely exact reproduction of the actual error law. However, at this stage we are not able to distinguish essential and inessential aspects of the error law, whence we must preserve the possibility of an exact reproduction of one using the other, and yet it is our duty to present
5
Thiele uses βn (x ) for
and
for n !. tr.
70
GENERAL THEORY OF OBSERVATIONS
the series immediately in the form which we later can show to have advantages. Since I partly owe credit to the work and remarks of Dr Gram for the right choice, I will refer to his doctoral thesis ‘Om Rækkeudviklinger’,6 Copenhagen 1879, and state that the series below are series for interpolation with weights inversely proportional to the leading term of the series, the binomial coefficients, and the exponential error function.7
2.1.5 Series using binomial coefcients For the frequency ϑ(x) of counts of x we recommend the series8 of binomial coefficients and their finite differences (2.2)
where9
The calculation of the coefficients from the number of repetitions and conversely is most easily made using the following tables, given in full for n = 1, 2, 3, 4 and only one-quarter of the others, since the symmetries mentioned below make it easy to obtain the remaining ones. If the tables are considered as tables with two arguments such that the number to the right of φs and just under Δrβn−r is considered a function of s and r,
6
This is Gram (1879). tr.
7
The exponential error function is what is now known as the normal distribution. tr.
8
Hald (2000b) calls these the Thiele series of type B. tr.
9
To make sense of the tables, I rather get Δφ(x ) = φ(x ) - φ(x - 1).tr.
GENERAL THEORY OF OBSERVATIONS
71
the symmetry relation mentioned will be
To calculate the approximate and accurate values of the frequency of the observation x under the assumption that the coefficients br are known, the rule is
72
GENERAL THEORY OF OBSERVATIONS
Conversely, to calculate the coefficients bs by the known frequencies of the observations, we have (2.3)
Thus the tables in this form serve equally well to solve the direct and the indirect task: to substitute an algebraic expression for the single empirical values, and again to calculate these from the given expression. As an example of this expansion of the error law, we consider the 500 observations mentioned in Section 2.2.4.10 Here we have value cases
7 3
8 7
9 35
From this we find the values for
10
10 101
11 89 :
In the section mentioned, Thiele refers back to this page! tr.
12 94
13 70
14 46
15 30
16 15
17 4
18 5
19 1
73
GENERAL THEORY OF OBSERVATIONS
and dividing these by 211 = 4096 we get the expansion
When recalculating the frequency of the individual values, such that one gradually includes subsequent terms compared with the previous calculation, the following numbers are obtained instead of the actual observations: value 1 term 2 3 4 5 6 7 8 9 10 11 12 all
7 0 0 1 1 2 1 2 2 3 3 3 3 3
8 2 4 9 10 12 11 11 10 8 6 7 7 7
9 8 20 36 38 39 40 37 35 36 39 37 36 35
10 27 57 81 82 78 82 82 88 94 93 96 99 101
11 60 106 115 110 115 103 112 115 106 99 98 93 89
12 97 133 104 97 99 92 92 81 77 86 83 87 94
13 113 113 62 62 70 70 59 59 73 73 78 78 70
14 97 60 31 39 41 48 48 59 55 46 43 39 46
15 60 14 24 29 25 26 35 32 23 30 29 34 30
16 27 −4 20 20 16 13 13 7 13 14 16 13 15
17 8 −4 12 9 10 8 5 7 8 5 3 5 4
18 2 −1 4 2 3 4 4 5 3 5 5 5 5
19 0 0 0 0 0 1 1 1 1 1 1 1 1
It is not to be expected that the error law in question could be expanded with rather few terms. As later analysis will confirm, it has far too obvious unusual properties as compared with the cases that this method is aimed to deal with. Our error law in this case is no doubt skew, and the observed phenomenon would under frequent repetitions let all observations from 4 to 28 show up, whereas in the 500 repetitions above, only the values from 7 to 19 are seen. The first two terms of the series could not be expected to suffice, but with the third term the essential feature already appears, and with the first six terms the quality of the expansion culminates, so that the following terms do not do much apart from
11
The inequality sign has been reversed compared with Thiele's original. tr.
74
GENERAL THEORY OF OBSERVATIONS
forcing the strong, but obviously accidental deviation in the frequencies for the values 10 and 11.
2.1.6 Series using the exponential error law For measurements where the observations cannot be assumed to be rational, and where the irrationality forces rounding, it becomes important to use functions that can be integrated and differentiated. If F(x) denotes12 the frequency of observations between −∞ and x, F(x + p/2) − F(x − p/2) can be determined by counting the rounded observations which have denomination x with p being the length of the interval of values rounded to x. But as there are no convenient forms for the function F(x), and also the difference F(x + p/2) − F(x − p/2) as a function of two variables is unmanageable, we must emphasize as in the graphical representation of actual error laws such error functions
with definite integrals indicating the frequency of corresponding rounded observations, whereas f(x) itself only indicates the exact frequency of x under the unattainable assumption of infinite series of repetitions. As the definite integral satisfies (2.4)
it would be favourable to expand f(x) in a series13 of derivatives of a function ξ (2.5)
implying that F(x + p/2) − F(x − p/2) then has the same form, just with other constants in the series expansion. When the constants k are known, it can be decided whether the interval of rounding p has been sufficiently small to allow the two types of error laws, with differences and derivatives, to be mixed up. The simple exponential error function has a number of remarkable properties that will be exploited here, in particular the function has its maximum for x = 0, inflexion for x = ±1, and (2.6)
12
Thiele uses f for this cumulative distribution function, and φ(x ) for the corresponding density function. tr.
13
This expansion is known today as the Gram–Charlier type A series (Hald 1981, 1998, 2000b), apparently first used in Thiele (1873) (Hald 2001). tr.
GENERAL THEORY OF OBSERVATIONS
In general
explicitly
and conversely (2.7)
and so on, with the same numerical coefficients as above. This allows the series
to be written in the convenient form (2.8)
75
76
GENERAL THEORY OF OBSERVATIONS
and additionally to reduce any integral14
where g(x) is an entire15 algebraic function, to an expression of the form
where both g1 and g2 are entire algebraic functions. Thus the integrations to be performed can all be reduced to calculating the specific integrals ∫ξdx. A table of this would be rather voluminous for ordinary interpolation to be adequate; however, since
one can use (2.6) and form a series expansion that converges rapidly for small u, the leading term of which,
gives a useful approximation formula with an error which is always less than , such that one can calculate the integrals with high accuracy from a very compact table. Such a table is given at the back of this book.16 The main table, which gives with seven decimals for arguments in intervals of length 1/100, should be interpolated using the approximation formula mentioned, substituting u for , when using the closest x such that u is between the limits ±0.005. The auxiliary table gives
and Log is given with five digits for all the arguments of the main table. Since using the relation17
is easily calculated directly
is was unnecessary to take care of interpolation of the table for Log ξ. After the table mentioned, an expansion of the entire algebraic factors in products of the
14
The expression ∫ng (x )ξ dxn denotes a repeated indefinite integral, tr.
15
An entire algebraic function is nowadays called a polynomial function, tr.
16
See page 186 ff. of this volume, tr.
17
If Log is log10 , as it seems to be, I get Log(- Log ξ) = 9.336 754 3n + 2Log x, using the convention that 9.336 754 3n = 9.336754 3 - 10. tr.
GENERAL THEORY OF OBSERVATIONS
form x2 − r of the 14 first derivatives of we note that for m ≥ n we have (2.9)
77
is given. This enables easy calculation of their logarithms. Additionally
which is easily proved by induction. An important consequence of this theorem is that ξ−1DmξDnξ only has a term m(m − 1) … 1ξ for m = n, but otherwise only has derivatives of ξ in all terms. The function ξ will turn out to be closely related to the binomial coefficient function to determination of the coefficients in the series
and leads among other things
in a way completely analogous to the corresponding problem for the series
but at our current stage we cannot take advantage of this (see Section 3.5.6). Concerning determination of the constants k0, k1,…, we must be content with indicating that for every frequency, given as a definite integral of f(x), one can set up a linear equation
where the factors n0, n1, n2 on the right-hand side can be calculated such that the problem reduces to solving m linear equations with m unknowns. But the right-hand side of these equations can be continued indefinitely and the calculation of the coefficients k0, k1, … is therefore an undetermined problem, just as one can construct infinitely many error curves with the given frequency areas. Here one may freely choose as many terms or sums of terms in the series as the given frequencies permit to be calculated. And whether one should consider the terms with lowest index will depend on specific properties of the error law, in particular on a previous determination of the origin and scale for x in . To avoid immediate introduction of hypotheses without obvious necessity or consequence when representing actual error laws, it is best to search for yet other
78
GENERAL THEORY OF OBSERVATIONS
means. But in one case one must use what has been mentioned here: that is, when the available observations are grouped such that the interval p of grouping is not the same for all observations. The problem mentioned above must then be solved using adjustment by elements for the system of linear equations for k0, k1, ….
2.2 Actual error laws expressed by symmetric functions Although the forms of actual error laws were based on inspecting the frequency of the individual observations, the basic idea in the following is that the observations can be regarded as roots of an equation, with degree equal to the number of observations and with coefficients expressing the properties of the observations. The only thing that is necessary for this to be the case is that the individual observations should be regarded on a completely equal footing and all essential circumstances of the observation, including the rounding, must have been the same for all observations.
2.2.1 Coefcients of equations The equation (2.10)
where o1, o2,…, on denote observations, will through the coefficients a1, a2 …, an of powers of x
give us an expression for the actual error law, and every actual error law of the kind we consider can uniquely be derived from and reduced to this form, which satisfies the main condition of a complete actual error law: namely, it should enable reconstruction of the original individual observations. But otherwise the coefficients a do not represent the error law in a suitable way. The similarity of two actual error laws with a different number of repetitions is not easily recognized from the values of the coefficients. One cannot imagine greater similarity than in the case where each observation in one law appears as a constant multiple m of the frequency of the same observation in another law, but
has in its expanded forms coefficients that are rather composite functions of a1, a2,…, an and m.
GENERAL THEORY OF OBSERVATIONS
79
We are in no way forced to choose these coefficients to represent the actual error law. They could be replaced with any system of entire symmetric functions of the observations, just that one should be of degree 1, one of degree 2, etc., until one of degree n. This ensures that the coefficients can be calculated uniquely from the other symmetric functions and conversely.
2.2.2 Sums of powers The sums of powers18
are undoubtedly the symmetric functions that are most easily calculated from the observations o. The coefficients of the equations a and the sums of powers s are related as (2.11)
which gives the explicit relations (2.12)
18
These are the (empirical) moments, tr.
80
GENERAL THEORY OF OBSERVATIONS
and (2.13)
In practice the implicit formulae (2.11) would be preferable as one almost never needs to calculate just one of the terms in the series of a and s, but always also the previous terms of lower order. The sums of powers have the fortunate property that their ratios with s0 = n directly indicate similarities and dissimilarities between different error laws, as
maintain their values when the entire series of repetitions occurs with an m-fold frequency. But this property holds not only for the sums of powers, but also for any other system of symmetric functions, where all terms are functions of these ratios. Therefore it is possible to produce even more adequate expressions for error laws using symmetric functions. The sums of powers of any degree depend on both the origin and scale of the observations. It is necessary that one term in the error law must have both in common with the observations; yet another must depend on one of these, but could be independent of the first; all the others should, just as the shape of the error curve, preferably be independent of both. The arithmetic mean, μ1 = s1/s0, is the obvious candidate for the term that changes with both origin and scale, and in general to represent the observed phenomenon; apart from constant factors there is only one symmetric function of degree 1, and the mean is always larger than the smallest, and smaller than the largest value of the observations. The mean should necessarily be used as the main term of all forms of actual error laws in terms of symmetric functions; but the terms of higher degree are most easily made independent of the choice of origin by calculating the sums of powers from the observed values, using the mean as origin.
GENERAL THEORY OF OBSERVATIONS
81
The sums of powers reduced19 in this way thus become (2.14)
In general
and conversely
To the extent that one does not calculate the reduced sums of powers by raising the differences between values of observations and their mean to powers and adding them, but instead has calculated s0, s1, s2,…, sn and from these wants to calculate the entire system r1,…, rn, it is convenient to use the scheme (2.15)
and so on.
19
These are the empirical central moments, tr.
82
GENERAL THEORY OF OBSERVATIONS
The reduced sums of powers all depend on the scale for the observed values, but in such a simple fashion that this is hardly an inconvenience. The quantities
would be independent of both scale and origin.
2.2.3 Halnvariants One could, in many cases even favourably, stick to an expression for the actual error law based upon the reduced sums of powers:
the first four of these would in fact be commendable. Still it is even better to use what we could call the halfinvariants20 of the error law. These, including the mean, will be denoted by μt. They are given by the equations (2.16)
In general
which serve to compute the s uniquely from the μ and conversely. The halfinvariants are even easier to calculate from the reduced sums of powers (2.17)
as above, just with μ1 = 0 and s1 = 0.
20
These fundamental quantities were reinvented by Fisher (1929), who named them cumulative moment functions because of their additive properties (2.45). The name was later abbreviated to cumulants (Fisher and Wishart 1931), apparently following a suggestion by Harold Hotelling (Stigler 1999). See Hald (2000a) for further details. tr.
GENERAL THEORY OF OBSERVATIONS
We have explicitly (2.18)
and much simpler (2.19)
Conversely (2.20)
83
84
GENERAL THEORY OF OBSERVATIONS
and (2.21)
Further we have expressions for direct computation of the μ from the coefficients of the equations and conversely, although these are not simple; under the assumption that the mean is equal to zero, that is a1 = s1 = μ1 = 0, it holds that (2.22)
and (2.23)
For example, when a six-times repeated observation has given the error law μ1, μ2, μ3, μ4, μ5, μ6, the individually observed values must be roots of the equation (2.24)
We shall later become acquainted with certain favourable properties that distinguish the halfinvariants from the reduced sums of powers; here we only mention
GENERAL THEORY OF OBSERVATIONS
85
that the halfinvariants most often will be numerically smaller than the others. Whereas the reduced sums of powers of even degree are always positive, all halfinvariants except μ2 may be negative just as well as positive, and therefore they will often have values that are not far from zero. Just like the reduced sums of powers, the halfinvariants are independent of the origin for the observations and depend on the scale in such a way that its rth power is a unit for μr so that are independent of both origin and scale. We have already mentioned the significance of μ1 as 21 the mean. The term of second most importance for the error law μ1 the squared standard deviation
serves as a summary measure of the quality of the series of observations; measures the standard deviation with the scale of the observations. The higher terms in an actual error law given by the halfinvariants express finer properties of the distribution of the observations. The error law is skew (skew error curve) when the halfinvariants of odd index have values that differ substantially from 0, and μ3 indicates in particular the simplest skewness: that the observations cluster at one end, while rather few, but other relatively large values are at the other end, at the positive side when μ3 is positive. A negative sign for μ4 indicates that the error curve has a flat or even hollow shape in the middle, a positive sign indicates that the error curve is peaked, with unusually many observations clustering in the middle, but also with a few large deviations to both sides. The halfinvariants of higher degree indicate less obvious properties that do not easily lend themselves to a brief description.
2.2.4 Examples Example 1. The series of observations that was considered on page 72 is to be expanded in terms of symmetric functions. The calculations are most easily performed by reducing the arguments by 12 and first calculating the sums of powers using this as the origin.
21
In the following paragraphs, Thiele identifies the significance of the first four halfinvariants, describing the mean, variance, skewness, and kurtosis of the distribution, as introduced independently by Pearson (1895). tr.
86
GENERAL THEORY OF OBSERVATIONS Argument
Frequency
7 −5 8 −4 9 −3 10 −2 11 −1 12 0 13 1 14 2 15 3 16 4 17 5 18 6 19 7
3 7 35 101 89 94 70 46 30 15 4 5 1 s0 = 500,
μ1 = −0.14,
−15 −28 −105 −202 −89 0 70 92 90 60 20 30 7 s1 = −70, 0,
75 112 315 404 89 0 70 184 270 240 100 180 49 s2 = 2088,
−375 −448 −945 −808 −89 0 70 368 810 960 500 1080 343 s3 = 1466,
1875 1792 2835 1616 89 0 70 736 2430 3840 2500 6480 2401 s4 = 26664,
−9375 −7168 −8505 −3232 −89 0 70 1472 7290 15360 12500 38880 16807 s5 = 64010,
46875 28672 25515 6464 89 0 70 2944 21870 61440 62500 233280 117649 s6 = 607368
= 2078,
= 1758,
= 26869,
= 67743,
= 616329
r2 = 2078,
= 2049,
= 27115,
= 71505,
= 625813
r3 = 2340,
= 27402,
= 75301,
= 635824
= 79137,
= 646366
r4 = 27730,
r5 = 83019,
= 657445 r6 = 669068
Thus with the original origin we have μ1 = 11.86, r2/s0=
4.156
r3/s0=
and
μ2=
4.16
and also
4.680
μ3=
4.68
0.55
r4/s0=
55.460
μ4=
3.63
0.21
r5/s0=
166.038
μ5=
−28.50
−0.81
r6/s0=
1338.136
μ6=
−184.53
−2.57
Example 2. Infinitely many obervations are assumed to be uniformly distributed with one at each number between −1/ 2 and 1/2 and no values outside these limits. (The error law for tables.) The observation: s0 = 1, s2 = 1/12, s4 = 1/80, s6 = 1/448, s1 = s3 = s5 = 0. μ1 = μ3 = μ5 = 0, μ2 = 1/12, μ4 = −1/120, μ6 = 1/252, consequently and .
2.2.5 Halnvariants for the binomial law A number of (p + q)n observations are assumed to have attained only the integer values 0, 1,…, n, each as frequently as indicated by the corresponding term in the expansion of (p + q)n by the binomial formula, thus x in cases. As
GENERAL THEORY OF OBSERVATIONS
87
it holds that
and thus (2.25)
The general law for the halfinvariants is not quite so simple and to derive it we must temporarily return to the easiest case, with n = 1, that is when in p cases we have the observation 1 and in the remaining q cases 0 has been observed. In this case s0 = p + q, but otherwise s1 = s2 = ⋯, sn = p. If here we let p/q = ez, the definition of the μ, in eqn (2.16) yields
and so on. Since
it is natural to conjecture that in general (2.26)
88
GENERAL THEORY OF OBSERVATIONS
This equation is also easily shown by ordinary induction, by differentiating the equation
which, if the law (2.26) holds, leads to
If this is added to the first equation we get
and thus Dzμr = μr+1. The law (2.26) also holds for the halfinvariants derived above for the general error law given by (p+q)n which therefore should satisfy the rather simple relation (2.27)
This theorem is also true; but for the time being the proof breaks down owing to the fact that the expressions for the sums of powers are extremely unmanageable in the general case. Therefore we must be content with the incomplete inductive proof until in Section 2.3.3, eqn (2.45), we can prove a theorem that will allow derivation of the halfinvariants in the general case directly from the simplest case. If in (2.26) and (2.27) p and q are introduced instead of z, the recursion formula can be expressed in the form (2.28)
which makes it easy to derive the numerical coefficients of the higher halfinvariants (2.29)
and so on.
GENERAL THEORY OF OBSERVATIONS
89
According to (2.27), the coefficients of the halfinvariants can be determined by termwise comparison in the expansions after powers of y in
For the pure binomial function this case
we get the halfinvariants by letting p = q in this relation; if we let y = ui we find for
where B2r+1 are the Bernouilli numbers, so that
and (2.30)
90
GENERAL THEORY OF OBSERVATIONS
With the exception of those of lowest order, these halfinvariants could in fact be calculated with confidence using the approximation formula
2.2.6 Halnvariants for the exponential error law Infinitely many observations are assumed to be distributed according to the simple exponential error law, with the modification that the origin is m and the scale n such that the frequency is proportional to (2.31)
one immediately sets
such that
To calculate the sums of powers we must integrate here rather than add
but according to the formulae (2.7), expressing zrξ(z) by ξ(z) and its derivatives, we get, using that ξ(±∞) = 0 and also Diξ(±∞) = 0 (see the table in page 192),
and so on, while s1 = s3 = ⋯ = 0, whence the sums of powers by themselves appear as reduced sums of powers. To determine the halfinvariants we get from their defining equations (2.16)
GENERAL THEORY OF OBSERVATIONS
91
such that only μ2 = 1 but all the other μ vanish. Introducing the original origin and scale we thus have (2.32)
and otherwise μr = 0. Thus the constants of this error law indicate directly the mean m and the standard deviation n. The error curve has its maximal ordinate for x = m and inflexion for m ± n.
2.2.7 Halnvariants for series using exponential error laws An error law for infinitely many observations is assumed to be given by the previously mentioned general series expansion (2.5)
The halfinvariants are to be computed and conversely the constants kr, when the halfinvariants are given. Instead of the sums of powers it is most adequate here to form the integrals
According to the equation (2.9) and the equations (2.6) it then holds that
and so on, where the law for the numerical coefficients is the same as in the equations (2.6). If from these we derive equations for we get
and so on, again with the same coefficients, increasing the indices for all the s by 1. If the expressions for μ and s with lower index are inserted into the right-hand side
92
GENERAL THEORY OF OBSERVATIONS
of the equations (2.16), defining the halfinvariants, and if we everywhere divide by
, we get (2.33)
and so on. But if the second term on the left-hand side is combined with the second term on the right-hand side, these equations again take essentially the same form as the defining equations for the μ from the sums of powers, where μ2 − 1 is replacing μ2 and kr is replacing sr. As the same error law f(x) after a change of origin and scale can be reduced to series expansions of the given form (2.5) but with different coefficients, whereas the μ determined by the k clearly only change in such a way that one gets the same result by returning to the original origin and scale, then the computation of the μ from the k and conversely is most easily performed by choosing the origin and scale to give k1 = 0 and k2 = 0, whereby μ1 = 0 and μ2 = 1 and the equations take the form (2.34)
In a theoretical sense these relations solve our problem when combined with the explicit relations (2.16) between sums of powers and halfinvariants in both a direct and indirect way; but in practice this only applies when the observations are so numerous that one without hesitation can replace our integrations with summations, and when their grouping, if this is present, can be seen as imperceptible because of the intervals having small size. In both respects much stronger demands
GENERAL THEORY OF OBSERVATIONS
93
must be made when the μ and the k should be computed with higher indices than when only the first few terms must be taken into account. This also holds for our Section 2.2.6, considered as a special case of Section 2.2.7; but whereas one can apply intervals between the observations and their rounding which go as far as the standard deviation as the highest practical limit, one should probably not work with intervals that are larger than one-fourth of this when μ and k with indices 3 and 4 are investigated.
2.3 Actual error laws for functions of observed variables When other observations are derived from observations by assumed or given dependencies and the inessential circumstances are ignored, these will be subject to error just at the direct observations; and were the direct observations all repetitions, the derived ones would also behave as mutually deviating repetitions and have their actual error laws. This then raises the question whether it is possible to deduce the error law22 of the transformed observations from the error law of the direct observations.
2.3.1 The univariate case It can be difficult or even impossible to decide with certainty whether what we have indicated to be a direct observation should not rather be thought of as derived indirectly from the observation, and consequently what dependency one then should assume. When
then
and ignoring the inessential circumstances one may just as well set up o or F(o) as determined by the essential circumstances, that is as an observation. By one method of measurement the radius of a circle may quite clearly be considered directly measured, by another its area; but we clearly feel that such choices could be delicate, for example when determining distances in the field, the light intensity of stars, examination marks, etc. The main issue is that during the observation one
22
Here, and in the following, Thiele uses his halfinvariants as the basic representation of distributions, not density functions as we mostly do today. Thus by ‘deducing an error law’ he means identifying its halfinvariants. tr.
94
GENERAL THEORY OF OBSERVATIONS
sticks to a particular view, however arbitrary it may be, and in a consequent way determines the corresponding error law. As the point of view changes, so does the error law. To enable the calculation of an actual error law for a derived observation from the actual error law for the determining observations, it is in general necessary that the observation must be uniquely determined. If the same observation can give different observations then a choice between them would be more or less arbitrary. When the observation is uniquely determined from one observation, the actual error law of the observation can be determined from the error law of the observation. If the error laws are expressed in terms of relative frequencies, and if the dependence is such that different observations cannot give the same observation, then the determination is quite simple, as the relative frequencies should be transferred unchanged from each value of the observation to the corresponding value for the observation; and if several observations could give the same observation, its relative frequency becomes the sum of these relative frequencies. If the observations x and their results g(x) are expressed in numbers, and the error laws are thought of as error laws or error curves, this main rule still holds unmodified for counts. For rounded measurements one must transfer the frequency areas from the observations to their results in such a way that they can be constructed with unchanged areas over the changed abscissae, with the modification that areas or ordinates are added if the same result occurs from different observations. One must demand that the equality of areas be extended to the individual area elements. When the error law f(x) of the observation is given in functional form, and with y = g(x) one must find the form for the error law Φ(y) for the result, then, if y = g(x) after inversion determines x uniquely from y, it holds that
whereas when several values x1, x2, … correspond to the same value y
Thus when y is a linear function y = ax + b of x
or, as the relative frequencies should be determined by division with the total frequency, the error law is unchanged.
GENERAL THEORY OF OBSERVATIONS
95
Most often, difficulties will be encountered when attempting to express the error law of the function in a convenient form. If the error law of the observations is given in terms of a series of halfinvariants or other symmetric functions, then, as soon as the dependent values are uniquely determined, one can derive their symmetric functions from those of the independent variables. Also here one encounters complicated forms for the dependence of the error law as soon as one is dealing with other than the simplest possible functions. If y = ax + b and the error law of x is given by the series of halfinvariants μ1, μ2,…, whereas the error law to be determined for y has the halfinvariants M1, M2,…, it must hold that (2.35)
If the dependence is assumed to have the form23
we get (2.36)
In particular, if the observations follow a simple exponential error law with
and a suitable choice of origin and scale had given μ1 = 0 and μ2 = 1, we get (2.37)
23
This leads to what nowadays is known as the non-central χ2 -distribution, tr.
96
GENERAL THEORY OF OBSERVATIONS
Thus the error law for y cannot again become simple exponential since this would demand that c = 0. Whereas from this reason alone, as some have been inclined to think, one cannot assume that any error law with sufficiently many observations would take the shape of the simple and convenient exponential form, a given actual error law can still be reduced to exponential form by considering the observed as a suitable function of a fictitious observation. In particular, if it is the third halfinvariant that indicates departure from the simple exponential error law, whereas the higher halfinvariants are either unknown or accidentally conform with the assumption, then one can in y = a + bx + cx2 consider y as the observed and x as the fictitious observation with the desired property of the error law. The constants a, b, and c must then be determined such that c is a root of
while
This demands that M4 = 12c(M3 - 2cM2).
2.3.2 The multivariate case In the case where the observation depends on several variables, the possibility of determining the actual error law of the observation from the actual error laws of its determining observations is largely dependent on the way in which the individual variables are assumed to work together in producing the repeated values of the observation. If all observations of every determining variable are completely on the same footing, also in the sense that none of them has a particular relation with some of the repetitions for other determining variables, then the relationship is simple and analogous to what we found in the case of one observation; but it becomes different when there are special relationships between certain repetitions among the determining observations. For example, if the area of a triangle is to be determined by measurements of each of its three sides, each side being measured on its own and nothing is known about the relation between the measures, then every measurement of a must be combined with every combination of measurements of b and c and the total number nanbnc of observations for the area must form the basis for the actual error law. However, if the concern is about relative frequencies in a statistical report on marriages among bachelors and widowers with girls and widows, then it is far from making sense to relate each man to all girls and widows, since only specific pairs of combinations matter, and in these it
97
GENERAL THEORY OF OBSERVATIONS
may even be foreseen that the question of earlier marriages influences the choices. The consequence is that one can very well set up an actual error law for the distribution of marriages over the four possible cases and also error laws for earlier marriages within each gender, but it is completely impossible to derive one from the other;24 for the determination is in part based upon something beyond the two partitionings of every gender. A third example can be taken from the play of dice with the simultaneous use of two dice. Although no one would consider the face of one die as an essential circumstance for the simultaneous throw of the other, it is common to record only the sum of the eyes of the two dice as the observation of every double throw. An actual error found in this way for 100 double throws could then not be recalculated even if two observers each observing his own die had simultaneously calculated the actual error laws of the individual dice by counting how frequently each number of eyes had appeared, for example as: First die: 12 times 1 17 – 2 14 – 3 19 – 4 16 – 5 22 – 6
Second die: 19 times 1 15 – 2 16 – 3 20 – 4 18 – 5 12 – 6
The error law for double throws cannot be calculated from these numbers;25 it could have had numerous forms betweeen the two extremes below: 12 times 2 7–3 10 – 4 5–5 9–6 7–7 12 – 8 8–9 8 – 10 10 – 11 12 – 12
and
24
This also holds for other than actual error laws. Thiele's footnote.
25
Which would be the case for non-actual error laws. Thiele's footnote.
0 times 2 0–3 0–4 0–5 0–6 85 – 7 15 – 8 0–9 0 – 10 0 – 11 0 – 12
98
GENERAL THEORY OF OBSERVATIONS
to give the error law of the first type mentioned. We therefore emphasize that among actual error laws for observations that depend on several determining observations we only consider those which freely combine repetitions of every individual observation. Also, actual error laws for multivalued dependencies will not be considered. The determination of an actual error law for an observation that is dependent on other observations by relative frequency is in principle quite a straightforward matter. If one of the observed conditions with r possibilities has shown an outcome n1 times, another n2 times,…, and the last possible observation nr times during a total of n1 + n2 + ⋯ + nr = N repetitions, and another with μ possibilities shown each of these m1, m2,⋯, mμ times during a total of m1 + m2 + ⋯ + mμ = M repetitions, and so on if there are more conditions, then under all combinations
will be the relative frequency for an observation that could only occur at the specific combination of outcomes of the individual observations that had the relative frequencies which are factors in this product n1/N, m1/M …. But if the same observation can occur as a consequence of different outcomes of one or several of the observed conditions, the relative frequency of the observation will be given as
the sum extending over those combinations of subscripts of r and s that give identical observations.
GENERAL THEORY OF OBSERVATIONS
99
Example. The position of a point in a plane is determined by measurement of both of its orthogonal coordinates x and y. The error laws for these are assumed to be simple exponential error laws; the error law for one of the coordinates in another coordinate system is desired. The error law for x is assumed to be
where integral between ±∞ is equal to 126 so the expression gives the relative frequency with v being the standard deviation and the mean of x is taken to be zero; similarly it is assumed that
represents the error law of y with standard deviation equal to u. The relative frequency of the combinations of observations that position the point in the surface element (x, x + dx; y, y + dy) will then be (2.38)
If the coordinate x1 is given by
and the error law of x1 is desired, the general relative frequencies must first be determined by integration in such a way that x1 is held constant. For both coordinates in the new system one can let
be the relations to the old system; this just demands that
26
In the original, 2π appears in the numerator rather than the denominator. This has been corrected here and in the sequel, tr.
100
GENERAL THEORY OF OBSERVATIONS
such that
If we now substitute x1 and y1 for x and y in the differential, we get
and its integral with respect to y1 between the limits ±∞ gives (2.39)
as the expression for the relative frequency for the coordinate x1; thus its error law is also simple exponential and the standard deviation is equal to (2.40)
If the scale for x1 is the same as that for x and y such that the coordinate system has just been rotated and one thus can let a = cosθ, b = sinθ, then v1 will be the distance from the centre and φ the angle with the positive direction of the vaxis for a point with an ellipse with axes 2v and 2u and centred at the point (0, 0), the ellipse of error, as θ is the eccentric anomaly corresponding to φ. The differential above
indicating the relative frequency for the observed point being at the element dx dy, shows that this relative frequency is constant for (x, y) on the ellipse of error and in general for points on ellipses that are concentric, similar, and homologous. The determination of actual error laws for functions of observations by symmetric functions is generally done through the computation of the sums of powers as a function of the sums of powers for the independent variables. In some cases one should stick here to the sums of powers, as the expressions for the halfinvariants could become too complicated; this is the case when considering a product u of
GENERAL THEORY OF OBSERVATIONS
two observations o and o′.
implies in general, denoting the sums of powers for o by sr, for o′ by s′r, and for u by Sr,(2.41)
which, with a similar notation for the halfinvariants, leads to (2.42)
with even more composite expressions for M with higher indices.
2.3.3 Actual error laws for linear functions in terms of symmetric functions In the most important case, where the function is linear,
the halfinvariants are unconditionally advantageous. Here it holds that
in general (2.43)
where β(i, j) denotes the binomial coefficients
101
102
GENERAL THEORY OF OBSERVATIONS
From this one can show for the halfinvariants that (2.44)
Because of the uniqueness in the defining equations (2.16)
it would be sufficient to show that inserting the expressions (2.43) and (2.44) in this last equation produces an identity. When i, j, and k = m denotes three indices whose sum is constant
we get
where ∑ ∑ denotes the double sum between ±∞ for two of the three indices i, j, or k, and where
It follows that
GENERAL THEORY OF OBSERVATIONS
103
If we now let m = i + k, n − m = j in the first sum, but m + 1 = i, n − m = j + k + 1 in the second, we have
and if we recall that β(m, n − m) + β(m + 1, n − m + 1) = β(m + 1, n − m), we get as in (2.43)
As no other system of halfinvariants M1 … Mn can give the same sums of powers S0, S1,…, Sn + 1, we have in general that when
it holds that
More generally, if
it holds that27(2.45)
The mechanism in the apparently complicated proof above is easily understood by going through a special case, for example
27
This property was the motivation for Fisher (1929) to use the term cumulative moment function for Thiele's halfinvariants. tr.
104
GENERAL THEORY OF OBSERVATIONS
2.3.4 Examples Example 1. We have in Section 2.2.5 anticipated the important theorem (2.45) for the determination of the halfinvariants for the binomial error law using the halfinvariants for an error law with q observations equal to 0 and p observations equal to 1. if
where o1, o2,…, or all had the latter actual error law, with the halfinvariants μ1 = p/(p + q), μ2 = pq/(p + q)2,…, then this theorem gives that the halfinvariants for the function u would be these quantities multiplied by r,
but if one considers all the individual values that u can obtain by summation of zeros and ones, then the values from 0 to r occur with frequencies that are exactly equal to . Example 2. The error law for the logarithm of a product where the individual logarithms are taken from a table is, according to the second example in Section 2.2.4,
where s denotes the total number of the irrational logarithms taken from the table and found by interpolation; it is seen that for large numbers s both of
and so on, become rather insignificant.
2.3.5 On the tendency of error laws to approach the exponential form A very important consequence of the theorem (2.45) appears when we assume that the function of observations be their sum, such that u = ∑o corresponds to Mr = ∑μr. While μ2 is always positive, such that every new term in the sum leads to the increase of M2, this is by no means the case for the other halfinvariants that, owing to differing signs of the individual terms μr, often have considerably smaller values than the terms themselves. Additionally, the properties of the error law do not depend directly on the complete set of values of the halfinvariants, but on the ratio betweeen Mr and , and as the latter term always increases,
GENERAL THEORY OF OBSERVATIONS
105
the more so as r gets larger, it will hold that for any sum and more generally for any linear function the error law for the function will deviate less from the simple exponential form28 than had been the case for the individual term that has mostly done so with respect to some halfinvariant. Only when a single term has been far from satisfying that μr = 0 for r > 2 and also has dominated all the other terms by the size of its standard deviation, the error function of the sum can also show a similar departure from the exponential form. Thus, to find examples of error laws that clearly deviate from the exponential form, one must primarily consider observations that only depend on a single circumstance which is ignored as being inessential. For if
using Taylor's series, is replaced by
the combined effect of several of these could produce similar effects as for linear functions of several variables. It would not be untimely to recall repeatedly that when functions of observed variables are considered, and one has not used the repeated values of the independent observations indiscriminately but as in the above-mentioned game of dice has calculated every value of the function with particular values of the independent variables, then one loses the right to derive the error law of the function as an actual law from the actual error laws of the independent observations. But as one can also in this case directly determine the error law of the function, it is in no way proved that this should always be different from the error law for the function which would follow from the rules described here. On the contrary—although this demands relaxing the actuality of the error law—this will quite often be possible. We will emphasize this here as we have now concluded our treatment of the actual error laws and the greater freedom we can take in the sequel will particularly benefit us in this respect.
3 The hypothetical or theoretical error law of the method of observation. The law of large numbers Thus an actual error law describes only the given—real or conceptual—number of observations. While only with difficulty and by repeatedly suppressing a strong
28
This is the central limit theorem. However, Thiele's proof demands that all moments exist, and that the distribution in question is determined by its moments, which are non-trivial additional assumptions and in fact unnecessary for the central limit theorem to hold. tr.
106
GENERAL THEORY OF OBSERVATIONS
inclination could we refrain from regarding the individual observations as being correct answers to what we exactly intended with our questions, it becomes almost impossible to avoid seeing a determination29of the method of observation itself in the actual error law based upon all available observations. Exactly because we have used and have had to use all the available observations when calculating the actual error law, this not only has meaning for us as the combined expression for all the investigations made, but also includes everything that the observations teach us about the mattter in question, and about the quality and special properties of the method applied. It is true that when we continue observing after such a conclusion, and again provide just as many observations, and from these by themselves derive a new actual error law, then it cannot surprise us if it deviates from the first. Thus, when there are repeated determinations of the error law of the method, which thus itself appears as being observed, it makes sense to speak about the actual error law of each of the determinations of the actual error law. The deviations among the actual error laws must be considered a proof of the fact that had we wanted to follow exactly the same method in each series of observations, we could not perfectly achieve this with full accuracy—unless we admitted to denying that the method exists. One dares not in any way ignore the fact that when in numerous cases under the collection of large numbers of observations, one gradually recalculates over and over again the error law first calculated, modified with the recent observations, the error law will take a shape that is changing less and less, and finally for the first and most necessary determinations practically does not change at all with new observations. Such stability is partly a frequent phenomenon, but in no way a rule without exception, and partly even a complete coincidence of all previously made observations cannot give us decisive certainty that the next, yet unmade, observation also will conform, and an approximate firmness in the actual error law can even less give us a corresponding justification. But we could not at all abstain from making daring conclusions; mighty forces drive us to act and think, also in an imperfect way. First the instinct of self-preservation, which can make the most pedantic sceptic conclude about future events after a single experience which even has notfiing more in common with the matter of concern than a remote similarity in some of the circumstances. And as it happens with this instinct, so it happens with inessential changes and the derived
29
Thiele expresses the view that the empirical distribution can be seen as a measurement of the theoretical distribution, tr.
GENERAL THEORY OF OBSERVATIONS
107
urge of people to construct theoretical conclusions and religious assumptions. When firm ground is absent, one must build upon the loose; for build one must and will do. And one gets the right to do so when acknowledging that a complete and certain knowledge is in general unattainable. As there is nothing that we could know, strictly speaking, we obtain the right to assume much based upon uncertainty and a poor justification is better than none. The first part of the old phrase errare humanum est, sed non in errore manere is quite fitting, as all our research, also the best, never goes beyond what is uncertain and our movements from a single observation to its repetitions are quite similar to a delusive, erratic wandering around; but it is not enough to state that one does not dare staying under the delusion; it is not quite human to move from one delusion to a bigger one, just to avoid not moving; it is demanded that we constantly should reduce the delusion and the uncertainty, and that we therefore in every conclusion should always strive to measure the magnitude of uncertainty, if at all possible. If one were to demand strictly valid proofs or unconditional certainty about the assumptions, then there would be no true science. But the characteristics of science relative to loose opinion and dogmatism are the everywhere complete attempt to account for the validity of every conclusion and for the magnitude and type of uncertainty. And the error law is a most excellent instrument for this purpose. But this instrument must be taken beyond the region of the actual; one must, even though this can only be done by hypothesis with a validity that in itself is uncertain, establish error laws for the method of observation. But it is not feasible directly to announce the actual error law as being the error law of the method. The error law of the method must be independent of the number of observations, but it is obvious for some of our error laws that they depend strongly and in a complicated way on this number, and even with the better forms, it is not clear whether they might depend on this number. To overcome this difficulty one could consider the solution of arbitrarily choosing a fixed number of observations and letting the error law of the method be identical to the actual error law determined exactly by this number of observations. But it is easy to see the disadvantage of this solution; whatever finite number one would choose, one may obtain observations in numbers several times larger and thus observe mutually conflicting actual error laws, all having the same right to be the error law of the method. Consequently only an infinitely large number of observations can completely justify the transition from actual error law to error law of the method. This implies in particular that one would never be able to obtain an exact representation of the error law of a method, but our demands are also not
108
GENERAL THEORY OF OBSERVATIONS
reaching that far; we would be content with approximations that could gradually be improved. But in this connection we must not overlook that our definition of the error law of the method as the limiting form which the actual error law approaches when the number of observations grows towards infinity contains an unprovable assumption: that one and only one such limiting form for the error law exists; that is, it does not oscillate or diverge in the infinite. This is the main axiom of the theory of observation—the law of large numbers. To a certain degree we could regard this law of large numbers as a definition, distinguishing good methods of observations, for which the law is valid or plausible, from such unacceptable methods of observation that do not have a particular error law. However, such a distinction can never be fully justified as we never have infinite series of observations before us and we can therefore be mistaken both when accepting a method of observation because it shows stability over a large, but still finite number of repeated observations, and when we reject another one because with large, finite numbers of observation it is also in permanent oscillation, whereas it is at least possible that this instability would be reduced at an even larger number and vanish at the infinite. Even to obtain a rather definite rule for acceptance or rejection of the methods of observation, we need to base the investigation on the assumption of the law of large numbers, to conclude how a finite number of observations should behave to indicate stability at infinity. And when in certain cases we apply the conception of this law as a definition to pass the sentence of rejection, then the matter is not over. The rejection will, also when it is final and definite, unavoidably raise a new question concerning the reason that this particular observation is a failure while others can pass. We are forced to search for the shortcomings of the series of observations that have caused the bad result, to search for modifications of the method applied that could lead to better results. But as we embark on this, we apparently read additional things into the law of large numbers than what is compatible with considering it a simple definition; we assume its validity also to mean the normal and the good. It becomes a law of Nature that must hold everywhere where we have not ourselves compromised the matter by choosing incorrect procedures or acting against our own prescriptions. And as we criticize the methods of observation and try to improve those that have given useless results, we use the law of large numbers not only to show that there is something wrong with our methods, but also to indicate the magnitudes
GENERAL THEORY OF OBSERVATIONS
109
and properties of the deviations for us, thereby showing us what has been wrong and what can lead to an improved result. When we test a dubious or unacceptable method of observation, we turn our attention to its relation to our assumptions about it, and in particular whether what we have treated as repeated observations also really have been what we have agreed to understand by this; whether all the essential circumstances really have been constant during the observation and there has not been an unconscious or hidden confusion of essential and inessential circumstances. When such an investigation is guided by the attentive study of changes during new additional observations, it has good chances of succeeding. It happens not infrequently that although the time of observation has been considered inessential, it has, separating the new observations from the old, become an essential circumstance and in some way caused a dependence between the individual observations that makes it unjustified to consider them as mutually independently observed, and places them in a mutual relationship in a changing way, such that they should not have been considered as repetitions of the same observation. If such criticism does not succeed, nobody would think of placing the blame elsewhere than on own imperfection or missing information. But whenever it succeeds, the already strong confidence in the law of large numbers grows; that for every orderly and well-performed method of observation, there exists one and only one error law towards which the actual error laws for observations tend, albeit approximately and to some extent indirecdy. The law of large numbers thus has a remarkable character which it is important to realize. We have therefore initially briefly denounced it as an axiom. From our rather unprejudiced point of view it appears as a definition, and formally it should be considered as such, and thus strictly speaking inapplicable to our limitation to the finite. Our urge to use the observations drives us to give a different meaning to the law and it becomes a dogma about harmony in Nature and between this and human thought, a firm conviction about something invisible. And thus this law is closely related to the simple religious belief of a God that reveals himself in Nature, one of the ways in which this belief forms a basis for the sciences.
3.1 Probability When an observable relationship, the outcome of an experiment, has or is assumed to have an error law of method, then we use the word probability instead of relative frequency for actual error laws, to denote the relationship between the frequency of a particular outcome and all possible outcomes. Unless one is not directly
110
GENERAL THEORY OF OBSERVATIONS
considering an infinite number of observations, when using the word probability one must consider the different outcomes, successes, and failures, regularly placed among each other in such a way that also a finite number would allow that they occurred in relative frequencies according to their probabilities or as closely to these as possible. Based on theories or hypotheses, according to which one probability depends on the other or on the nature of the experiment, one can—often even quite well—determine probabilities for events without the direct availability of observations, that is a priori. The probability for an impossible event is 0, and 1 if the occurrence of an event is certain. If one knows or thinks to know that among the causes there is balance between those that act for and against the event, the probability must be set to 1/2; but if one is completely ignorant about any of the essential circumstances for the event, one must of course acknowledge that the probability is unknown. One must earnestly warn against a certain tendency to interpret the probability of 1/2 as a symbol for complete ignorance about the causal relationships concerning the event, a tendency that has gained access to the calculus of probability through incautious statements by many recognized authors, as they have confused complete ignorance with lack of exact knowledge about whether the remaining favourable or unfavourable circumstances dominate; the latter can justify assignment of the probability 1/2, but only until more reliable information and experience becomes available. In general one should assign a larger, smaller, or the same probability to one outcome of the same experiment than another according to whether knowledge of the circumstances justifies assuming the first to be favoured by the acting causes to a degree which is larger, the same, or smaller. And all prior judgement of the probabilities should be based on this. For example, the throw of a die is judged to depend partly on the construction of the die, partly on the force, direction, and falling height of the throw, etc. Concerning the latter of these circumstances, the judgement must be indifference, not because I do not know the way in which these circumstances act, but because it is known that the result can change and change again by very small changes of these circumstances, so small that my accuracy and that of usual people are far from sufficing to accomplish an accomodation to obtain a desired result. Therefore such circumstances could be considered inessential, a category that one does not dare to establish without motivation when probability is to be determined a priori. If the throw was to be performed by a famous conjurer, or by an accurate machine, one would have to give up determining the probability a priori. As regards the construction of the die, its regular shape and the assumed homogeneity of material
GENERAL THEORY OF OBSERVATIONS
111
favour any of the sides equally, leading to the conclusion than the probability of all outcomes must be equal and thus equal to 1/6. But dice can be false; it is not redundant to measure the sides and determine the centre of gravity to justify the expectation that the prior probabilities would coincide with those to be found in a very large number or actual throws.
3.1.1 Main theorems from the direct calculus of probability Probabilities of events that depend on other events whose probabilities are known obey the same rules as relative frequencies for actual error laws, with the modification that a certain limitation that were to be respected for actual relative frequencies can be dropped concerning probability. When the different outcomes of an event have probabilities p1, p2,…, pn and when the outcomes themselves are not the primary concern but rather other events that are uniquely determined by these, and when among these outcomes some, for example p1 … pm, lead to the same event, then the probability of this event is (3.1)
In particular, if all the individual outcomes are favourable for the final event, then this event is certain: p1 + ⋯ + pm = 1. The correctness of this theorem follows immediately from the definition of probability as the ratio of the number of favourable cases to the number of possible cases. The probability, v, for an outcome of an event that is solely determined by particular outcomes of a series of conditioning events, each of which has probability v1, v2,…, vn, will be equal to a product of these (3.2)
if none of these conditioning events influences any of the other. For if we rewrite the probabilities as ratios vr = pr/qr which by the assumption of uniform distribution indicate that out of qr cases, pr are favourable, all possible combinations would correspond to the denominator Q = q1q2 … qn, whereas the favourable result only occurs in P = p1p2 … pn of these cases. As the probability only has direct meaning for infinitely many observations, it does not matter whether one is pretending here that each of the individual events is repeated for every final result, or one is obtaining the final results by combining indiscriminately from the individual observations. If, for example, v1,…, vn denote the probabilities that certain people would be dead by a certain point in time, the probability that they would all be dead by then is in general equal to the product V. But if some of the persons by living
112
GENERAL THEORY OF OBSERVATIONS
close to each other are likely to be attacked by the same contagious diseases, then V > v1v2 … vn cannot be calculated in this way, but only by decomposing the total probabilities of death into probabilities of dying from certain diseases and probabilities of lethal infection. When the determining outcomes do not directly influence each other, but outcomes of previous events yet (assuming a particular succession) change the probabilities of the following, the probability of the final outcome can still be calculated as the product of the modified probabilities. For example, when drawing n cards from a deck of cards with n red and n black cards, what is the probability that the draws would give a particular but arbitrary sequence of draws with a total of p black and n − p red cards? Here the conditioning probabilities become ratios, with denominators descending in steps of size 1 from 2n to n + 1 and the numerators form two series with differences 1 mixed among each other, in the black case from n to n − p + 1, and the red case from n to p + 1, such that the probability becomes equal to If a result from an event that depends on others can occur from different combinations of outcomes for the individual determining events, the probability for such a result is calculated as the sum of probabilities for the different combinations
where (3.3)
the proof of this is again done by expressing the probabilities as ratios of the number of outcomes that represent their numerators and denominators, and then adding up the number of combinations that lead to the final outcome. Since this is only possible when it is the same series of experiences that with mutually different combinations of the outcomes lead to the common result, we warn against what could otherwise easily be forgotten: that one must ensure that the individual terms in our formula should represent probabilities for outcomes that exclude each other just as certainly as mutually contradictory outcomes of a single experience.
3.1.2 Examples Example 1. If I get tempted to play in two tombolas with 14 blanks out of 17 and 9 blanks out of 11, it will obviously be incorrect to apply the theorem (3.1) and compute the probability to 3/17 + 2/11 = 67/187, since winning or losing in one tombola does not exclude winning or losing in the other. If I could judge the forces of temptation that attract to each tombola and then determine the
GENERAL THEORY OF OBSERVATIONS
113
probabilities of choosing each of them, then the problem would have been brought within the region of validity of theorem (3.3) and even though I cannot determine the individual probabilities of the tombolas, I can, by calculating with unknown numbers p/(p + q) and q/(p + q) for these, learn about the probabilities of winning at all in the tombolas. For this has the form
and must, since p and q are positive numbers, have a value between the limits 3/17 and 2/11; and when as here these limits are not much different, one would not make a serious mistake by arbitrarily letting p = q and thus using the probability 1/2 for the unknown. In calculations pertaining to insurances one repeatedly needs these theorems, denoted as the direct calculus of probability, and often cases occur that show the necessity of precise criticism and a safe method for treating the problems, which is well illustrated by the following example. Example 2. What is the probability that three living normal persons A, B, and C would die in the order A, then B, and finally C, or in any cyclic permutation of this: first B, then C, and finally A, or first C, then A, and finally B; that is, the total probability of this as opposed to any of the three reversed orderings? It should be obvious to attempt to reduce the problem to the determination of the rather easily calculated probabilities (a, b) that B survives A, (b, c) that C survives B, and (c, a) that A survives C. But although the probability in question only depends on these three, and the problem in fact can be solved in this manner, there is a great risk for mistakes when following this route. For the probability of the specific outcomes (a, b, c) that A dies first, then B, and finally C, and the analogous (b, c, a) and (c, a, b), the sum of which is equal to the probability in question, are not obvious functions of (a, b), (b, c), and (c, a), even though this could seem to be the case, for these are not in full generality equal to the probability that one person survives the other; such probability is changing with the age of a person, also without any death, whereas the probabilities (a, b), (b, c), and (c, a) are expressly to be understood as survival probabilities at the time when the contract, the bet, or whatever one would call it, is made. One can reduce the problem to these elementary survival probabilities in the following way. The six outcomes with probabilities
114
GENERAL THEORY OF OBSERVATIONS
and those with the deaths in the opposite order
exclude each other and exhaust all possibilities, such that the sum of these probabilities is equal to 1, and if they were all known, one could from these calculate the probability of all orderings of the deaths, in particular
Adding the equations yields
and consequently
However, if the problem is understood such that these simple survival probabilities are not given, but rather the mortality tables of the persons, that is when la, lb, and lc denote the number of persons that have the same status and age as A, B, and C respectively, and la(z), lb(z), and lc(z) denote how many of these would be alive after z years, then this example instar omnium serves as information about a general and safe method that should be used in all similar problems. At some point in time z between the limits 0 and ∞ years, namely at the second of the three deaths, the order or death among the three persons must become determined. We initially ask what the probability would be that the order is first A, then B, and finally C if this determination happens in the moment between z and z + dz: that is, what is the probability that at this very moment A is dead, B is dying, whereas C is still alive after that time? As none of these outcomes is assumed to influence any of the other, this probability is equal to
Since B cannot die in any other moment if he dies in this one, these combinations of outcomes for different values of z are mutually exclusive; and if—what is ordinarily
GENERAL THEORY OF OBSERVATIONS
115
done without hesitation—it is assumed that all futures without taking the lives of these persons into account would lead to a determination, then the probability (a, b, c) is obtained by integrating this differential between the limits 0 and ∞,
But as analogously
we have
and by adding the analogous expressions for (b, c, a) and (c, a, b), we find
because lr(0) = lr, but in any mortality table lr(∞) = 0. But if we were pedantic, we would admit that it is not quite certain that every future scenario will resolve the matter. Our Earth could for example collide with a planet yet unseen, and even though our three persons would die during this incident, it would not be following the mortality tables. But there is no need to insist on an ordinary termination of everything; whatever special interest there might be attached to the probability under consideration, the future is full of probable disasters that are fully sufficient to annihilate the question about the order of death of the three persons. And we could easily make our formulae independent of such things by multiplying the probability for B's death in the moment from z to z + dz after A, but before C, with a function φ(z), which expresses the probability that the disaster has not yet appeared by the year z. As this probability in the next moment is φ(z) + φ′(z)dz = φ(z)(1 − ρ(z)dz), where ρ(z)dz denotes the probability that the disaster occurs between z and dz, we find φ(z) = e−∫ρ(z)dz; under the assumption that the disaster always has the same probability equal to ρ, we thus have φ(z) = e−pz, since φ(0) = 1. If we denote the survival probabilities
116
GENERAL THEORY OF OBSERVATIONS
according to this conception by square brackets instead of round ones, we have
and
or
3.1.3 Mathematical expectation The calculus of probability has its most practical application in cases where the outcome of the event can be expressed by numbers and most frequently when it represents an amount of money, that is in games, insurance, commerce, etc. The mathematical expectation is the product of the numerical value of the outcome and its probability which implies that in infinite observations, the sums of numerical values of the mathematical expectation of successive outcomes tend to be equal to the sums of the numerical values of the favourable outcomes. The direct profitability of a game or an investment is judged by comparing the amount invested with the mathematical expectation of winning. If the first value is larger than the other, the venture is inadvisable, unless there are special satisfactions or other circumstances attached, which cannot be taken into account when calculating the mathematical expectation but are of sufficient personal importance for the player or the investor. In lotteries or games of chance against a bank, the mathematical expectation of the individual player is still in most cases considerably smaller than his stake; and the arousal of excitement that drives so many to this kind of game would be considered a painful additional loss if the hope of rare and huge winnings, which could suddenly lead to freedom from economic pressure and improvement of social position, did not have such a big and disappointing influence. In insurance as in games, the mathematical expectation of the insured is always smaller than his payment, because a part of this must be used to pay the work of the insurance institution. Here the difference is amply compensated by the state of security which the insurance provides for the insured, and the increased ability to work that the person therefore acquires. The monetary value of security cannot be calculated, but the fact that it is considerable is seen from the ever increasing
GENERAL THEORY OF OBSERVATIONS
117
use of insurance, and it can approximately be measured through the fact that not only can it cover the loss already mentioned, to be used for the work of the institution (running costs), but under all types of insurance it must additionally cover an excess premium, a security charge, which the institutions, who gamble themselves, by nature of the case are forced to demand to avoid ruin by random losses that would benefit a few and damage all the other insured, and thus to give them a security which is as perfect as possible. In general commerce, the difference between the price that the producer receives and the price that the final user must pay is not just determined by the cost of transporting and selling the goods, but also to an essential degree by taking into account that, because the goods before their use can become partly or completely tainted or become unsellable, the producer cannot receive the utility value, but at most a value which is calculated as a composite mathematical expectation, based on the full and reduced prices and their probabilities.
3.1.4 Examples concerning life insurance and life annuity The computation of the price of a life insurance or annuity can serve as an example of application of mathematical expectation. According to the calculus of interest, a payment of a unit of value in z years will be equivalent to an immediate payment of (1 + r)−z, and the probability that a living person of age a will die in the moment between z and z + dz years will be equal to −la′(z)/ladz. The mathematical expectation that the life insurance is going to be active in that exact moment is thus equal to −la′(z)/la(1 + r)−zdz and the net price of the life insurance is the integral of this from z = 0 to z = ∞ so that
For a life annuity that is considered paid continuously with one unit per unit of time (the year), the discounted elementary share of annuity that is to be paid in the time unit between z and z + dz will have a value of (1 + r)−zdz. The probability that the owner of the annuity is alive at this time, such that the annuity must be paid, is equal to la(z)/la. The element of mathematical expectation is thus equal to la(z)/la(1 + r)−zdz and thus the
118
GENERAL THEORY OF OBSERVATIONS
3.1.5 Indirect or posterior determination of probability from observed frequency The most important way to determine probability, on which all other determination must rest, is the inference from actual observed frequency to probability. When a series of n = m + l observations in m cases has given the same outcome, the answer ‘yes’, but ‘not-yes’ in l cases, one wishes to determine the probability p/(p + q) for the outcome ‘yes’ as it would be in trials using the same method. This problem cannot be dismissed and it is repeatedly getting solved; but in a pure mathematical sense it cannot be solved apart from those (impossible) cases where n = ∞. Whatever special value one would assign to p/(p + q), except 0 and 1, one cannot deny the possibility that this could be the correct probability. No one can force a particular general rule upon himself or others for deducing p/(p + q) from m and l. The law of large numbers does not limit the freedom of choice in any other way that one should have p/(p + q) = lim(m/(m + l)). That the solution depends on a choice can often be very unpleasant as a major responsibility can be associated with this choice; and mathematically educated people would generally be very reluctant to quote numbers at will rather than based upon proof. It is therefore not strange that, as we shall describe below, one has stretched mathematics beyond its limits to establish a method with proof. If one is willing to yield to the impossibility and refrain from a mathematical proof, it is not difficult to choose a method, even one that can always be used. For lim m/(m + l) can be written as a general series expansion for the probability
where a and b are constants and φ and ψ are functions that vanish for m = ∞ and l = ∞, and if one simply lets φ = 0 and ψ = 0, an expression is obtained which can be used for arbitrary numbers m and l and gives an applicable expression for the probability when in
a and b are chosen positive and b greater than a; but if this expression should be usable in general, it must conform with the corresponding expression for the probability of the opposite outcome
GENERAL THEORY OF OBSERVATIONS
119
and therefore we must take b = 2a. Thus when a ≥ 0 (3.4)
will be among the rules that would be feasible. For a one can choose any positive number; nobody could convince me that the choice is inadmissible; however, the question is whether I can feel personally satisfied by my choice and that depends on my attitude towards the consequences of such a choice. If one is taking a very large number for a, it follows that the probability will be almost equal to 1/2, at least when the determination of the relative frequency does itself not involve a very large number of observations. A person who has often been disappointed by expecting that determinations of relative frequencies should converge quickly to a definite number would subsequently probably feel tempted to choose a large value for a; but few are so sceptical that they seriously and in the long run could bear the consequence of this and reject all determinations of relative frequencies that are not based upon plenty of observations, for example thousands; and it is obvious that this is what is done by forcing all probabilities towards the value 1/2, indicating that one may as frequently expect one outcome as its opposite. In the literature, the choice only varies between the limits a = 1 and the lowest possible limit a = 0. The difference between these is only minor and the objection that can be raised against any other choice than a = 0, that probabilities less than a/(n + 2a) or larger than (n + a)/(n + 2a) not just with difficulty but in fact not at all could result from n observations, has even for a = 1 no more weight than one could defend oneself by referring to the fact that this unconvenience is an unavoidable consequence of the chosen formula for approximation of the probability. But to choose between the assumptions a = 0 and a = 1 and everything in between these, one must realize what light does the assumed probability (m + a)/(m + l + 2a) throw30on the fact that there has been observed m times ‘yes’ against I times ‘not-yes’? The assumed probability implies the validity of an error law for the combined outcome of the m + l observations, according to which all conceivable frequencies, including the actual one, would appear more or less frequently by infinitely many new determinations and could be compared with respect to their frequency; and what should guide our choice of probability and the value of a must be properties of the relationship to this error law distinguishing the observed frequency from those that could have been—but were not—observed.
30
This argument leads to the idea of likelihood, see page 123. tr.
120
GENERAL THEORY OF OBSERVATIONS
We have above in Sections 2.2.5 and 2.3.4 treated a problem concerning frequencies with a regular distribution described by the terms of the binomial formula. The main significance of this problem is that it can be applied to the present question. For when the probability of ‘yes’ in a single trial is p/(p + q), thus q/(p + q) for ‘not-yes’, the main theorems of probability state that the probability that in n = m + l mutually independent observations of the same trial ‘yes’ occurs m times and ‘not-yes’ occurs l times in a specific order is equal to pmql/(p + q)m + l, so the probability of the same distribution in an arbitrary order will be equal to (3.5)
Here we thus have an expression for the error law that governs the various outcomes of the probabilities in n = m + l observations with assumed constant probability p/(p + q) in a single trial. From what we have developed earlier, we see that the same error law can be expressed by the halfinvariants of the frequencies: (3.6)
previously given in formula (2.25). An investigation of the error curve, where we recommend the reader to perform the calculations and draw the figures in selected examples, will show that this curve, which strictly speaking is limited to integer values between 0 and n, follows the axis of abscissae (m) for all values where either m or l = n − m negative and in the interval between bends away in a simple curve that in general is not quite symmetric and never has more than one maximum with ordinate φ(m/(m + l)). This also follows directly from the fact that
for two consequtive relative frequencies to be equally probable, we must have
and thus
GENERAL THEORY OF OBSERVATIONS
121
As long as m is smaller than this, the probability must increase with m but decrease with larger m; hence the only maximum for positive m and l must correspond to an m with
A more definite value for the maximum of the error curve can only be given under a somewhat arbitrary assumption about the course of the curve between the isolated points that are given directly, corresponding to integer values of m and l. For further illustration a figure is provided, 31 constructed for the case m + l = 10 and under the assumption that the continuous error curve can be determined by ordinary interpolation after Newton's formula or similar theories. It shows a series of 20 curves for the cases, where the probabilities φ(m/(m + l)) of outcomes have the constant values 1/ 20, 2/20,…, 1. For example, the probability of 1 ‘yes’ in 10 trials, each with probability 0.275, is found to be 3/20. The 11 thick vertical parallel lines correspond to constant values of the observed frequencies of m against l, whereas the 41 finer parallel horizontal lines represent the assumed probability for a single ‘yes’. As one can see, all the curves coincide at four points, two at each end of one of the diagonals in the figure (p/(p + q) = 0 with m = 0 and m = −1, and p/(p + q) = −1 with l = 0 and l = −1). Only the four most extreme of the curves, corresponding to the smallest probabilities 1/20, 1/10, 3/20, and 1/5, run along the diagonal from one end to the other; the other 16 curves branch out in two at the ends. The last one, the curve of certainty, falls of course completely outside the extreme boundaries (m = 0, l = 10 and m = 10, l = 0) of the part of the figure that really has significance for our problem. The figure corresponding to n = 10 has been chosen because on one hand it accentuates the symmetry around the diagonal which becomes the most apparent feature of the curves as n increases, whereas n = 10 is still small enough to display a characteristic skewness, breaking this symmetry. For our purpose one should think of the curve as level curves in a landscape, and think of the entire relief as being cut according to the horizontal parallel lines assuming constant probability; the profiles of these cuts would then display the error curves for the probabilities of the frequencies. Apart from the curves and the two systems of parallel straight lines, one will also notice three fine lines through the middle of the figure; they correspond to the equation
31
At the back of Thiele's book. Reproduced on page 197 of this volume, tr.
122
GENERAL THEORY OF OBSERVATIONS
for a having the values 0, 1/2, and 1. (Abscissa equal to m, ordinate equal to p/(p + q).) The most central of these straight lines (a = 1/2) intersects the level curves of the figure at least so close to the points of maxima of the curves that drawings like this do not show a proper deviation. This conforms with the above indication of the limits for this maximum since
and considering the undetermined definition of the continuous error curves at points with finite differences of abscissa, one can very well think of the straight line for a = 1/2 as the geometric locus for the maxima of the error curves. On the basis of this, it is easily found by measurement on the figure that these error curves for constant probabilities are not symmetric except for the probability 1/2. The error curves descend more steeply towards the two boundaries m = 0, l = n and m = n, l = 0. And this conforms with the sign of μ3 in (2.25). The equation (2.25) for μ1 shows that the mean of the relative frequency is equal to the assumed probability, corresponding to the determination a = 0. Thus the values of 0 and 1/2 for a commend themselves32 by directing the choice of the probability in question to the particularly interesting values for which the observed frequency is equal to the mean or centre of gravity of the error curve in the first case (a = 0), and to its maximal value in the second case (a = 1/2). The choice a = 1 does not commend itself through any such simple and conspicuous property, but although this choice does not lead to the probability which makes the observed outcome most probable of all and, as our figure shows, gives only a slightly greater probability than that which corresponds to a = 0, it does nevertheless lead to probabilities which make the observed outcome relatively highly probable. The assumption a = 1, which under the name of Bayes' rule33 has played an important role and at least has historical significance, has hardly any more distinctive property related to these error curves and their graphical representations as in ours for m + l = 10 than that which is associated with the contour-figure's transects along the straight lines for the observed frequencies; such transects, normal to those previously mentioned, would also represent lines with properties characteristic of skew error curves, and whereas the maximal heights34 in these
32
The translation of the next few pages has been made in cooperation with A. W. F. Edwards, Cambridge. The excerpt is also printed in Edwards (2001). tr.
33
In fact, this is Laplace's rule of succession rather than Bayes' rule. tr.
34
Although Thiele here stumbles across the fact that a = 0 is the maximum likelihood estimate, he does not seem to pay any particular attention to this fact. tr.
GENERAL THEORY OF OBSERVATIONS
123
cuts now correspond to a = 0, the centre of gravity of the curves, or the mean probabilities, are determined by their intersections with a = 1. If x denotes the abscissae of these curves and y = kxm(1 − x)l their ordinates, the abscissa of the maximum and the mean abscissa are determined by
and (3.7)
To relate these properties to the question of determining the probability from relative frequency, one must completely give up thinking about this probability as a particular unknown number; but one can clearly also think of the unknown probability as composite. One could imagine the observations ordered according to an unknown inessential circumstance, such that the individual probabilities v0,…, v1 became functions of this inessential circumstance, while again one could imagine these different values occurring with varying probabilities u0,…, u1, such that the resulting combined probability became
It is obvious that such a general consideration cannot help us determine the probability by the observed frequency; it is clearly no easier to determine the many unknown partial probabilities than the total probability. But the consideration could still be valuable if one could say something general about the dependence between v and u. A similar consideration lies at the bottom of the previously accepted theory of ‘a posteriori probabilities’. When a single relative frequency is observed and one is uncertain about which of the infinitely many possible probabilities between 0 and 1 to derive from this frequency, then the thought appears that although all probabilities are possible, they may apparently not all be equally likely as hypothetical causes of the observed frequency; the likelihood35 must for each of them be
35
The word likelihood is used here exacdy in the Fisherian sense. Indeed, Thiele introduces the word Rimelighed for likelihood, in contrast to Sandsynlighed (probability). See Edwards (2001) for further comments on Thiele's treatment of this concept, tr.
124
GENERAL THEORY OF OBSERVATIONS
expressible in terms of a number, and it is not obviously incorrect to measure this likelihood with the probability φ(m/ (m + l)) of the observed frequency, calculated on the assumption of p/(p + q) being the unknown value for the probability. That for these measures of the likelihood of the hypotheses people have used the term ‘the probability of the correctness of the individual hypothesis’ would not have done any harm unless they had additionally allowed themselves to calculate with these ‘probabilities’ using all the rules from the direct calculus of probability, even though the definition of probability does not apply to them. In Tidsskrift for Mathematik, 1879, Mr Bing36 has, however, shown that such an extension of the notion of probability leads to contradictions when applied without additional precautions. Yet I do not think that for this reason one should discard the method completely, in particular if it is only applied to cases where the observation is either/or. If one does assume that the likelihoods for the individual hypothetical probabilities in this case can be treated as probabilities or according to the rules of mathematical expectation and are considered proportional to the probability of the observed frequency as given by every hypothesis, then the probability under consideration should be determined after Bayes' rule,37 as we then have
But even in this determination it is apparent that the result is not a mathematical necessity, but rather just as arbitrary as a direct choice of the probability itself or as the choice of the constant a in the general formula. One could perhaps measure the likelihoods in a different and better way, and in particular it is not obvious why these likelihoods, (1 − x)lxm, should be calculated with the observed frequencies, as these only correspond to a finite number of observations, and not to the probabilities themselves. Enough about this! As I see it, the main thing is that all rules of me form
36
Bing (1879). Thiele writes ‘Direktar Bing’, using Mr. Bing's title as an actuary in the Danish State Life Insurance Company, which in Danish is the more polite way of referring to a person. The direct translation of ‘Direktør’ is ‘Manager’, tr.
37
Here ends the excerpt translated in Edwards (2001). tr.
GENERAL THEORY OF OBSERVATIONS
125
with a between 0 and 1 can be used as approximations, whereas none of them can be expected to give any mathematically certain value for the desired probability. Whatever value one chooses, it should not be considered an exact number, but only the main value in an error law for the probability. Any such determination needs to be supplemented with an indication of additional details for the error law concerning the likely limits for the result. This indication is best given through the series of halfinvariants and must in its general form look as follows:
and so on. It is now easily seen that the standard deviation38 almost always will be larger than the difference between two probabilities, calculated with different values of a and c (between 0 and 1). for
only demands that
Here it is obvious that as long as neither m nor l is equal to 0, but both are positive integers, this equality must hold. Only in the extreme case where all the observations have given the same outcome can the deviations reach or exceed the standard deviation and therefore be termed large; that is, only when the number of observations are infinite should one be certain of this outcome, whereas with finite numbers a rigorous justification for concluding certainty can never be obtained. If we then let m = 0 and l = n, the inequality becomes
the requirement
is even stricter, but this is achieved for are thus completely safe in this respect;
38
or approximately a > 3/8. Both the assumption a = 1 and a = 1/2
2
The halfinvariants above are for the counts m. For this remark to be correct, μ2 must be interpreted as the halfinvariant of the frequency μ2 (m/m + l )) = μ2 (m + l ) . tr.
126
GENERAL THEORY OF OBSERVATIONS
among the three choices discussed, only a = 0 is subjected to the criticism that in the extreme cases, where all observations have given the same result, one would subsequently conclude that ‘yes’ would be certain or impossible. But even then one would not dare to reject the choice a = 0 since though it gives a standard deviation 0 when either m or l is equal to 0, the higher halfinvariants still show that
so the error law is not just asymmetric, but asymmetric to an infinite degree, whereby one would still not dare to conclude certainty. Yet, to remove the danger of any conclusion to certainty, it can in isolated cases be recommended to use one of the other choices, and in particular
suffices completely. However, one can also quite well apply Bayes' rule, but as with any of these only in isolated cases where no other material is available to determine the probability and where this is to be used for an immediate judgement, but rather not for further calculations, and in particular not in cases where the determined probability is to form part of a calculation of adjustment by the method of least squares. As a preliminary result for use in further calculations one should prefer the case a = 0, that is
for the sake of consequence one should in particular do this for use in adjustment by the method of least squares, where all results of the observations are assumed to be averages. But since in these calculations one also applies the standard deviations and assumes that the error law is exponential and symmetric, one must be careful with cases where m = 0 and l = 0 and either omit such observations from the adjustment or better successively calculate standard deviations using values for p/(p + q) that eventually appear from further calculations in the adjustment. Hereafter it should in no way be overlooked that empirical determinations of probability rarely can achieve great accuracy. This is evident from the fact that the standard deviation for relative frequencies, no matter what probability is assumed for a single outcome, is inversely proportional to the square root of the number of
GENERAL THEORY OF OBSERVATIONS
127
observations. For as
we get
The first observations reduce the standard deviation considerably; but then it becomes slower and slower; while by 20 to 50 observations one can expect the first digit to be correct for the relative frequencies and thereby for the probability, between 2000 and 5000 observations are needed for the second digit, and between 1/5 and 1/2 million for the third. This should not be forgotten when the question is to test a series of observations to prove that the law of large numbers does not apply for it, nor when one is testing a hypothetical probability by experiments. Because a die has given the value ‘six’ in 22 cases out of 100 throws, another may be only 12 times, whereas is normal, there is only extremely little reason to suspect them to be false. For the standard deviation is and the limits between large and small deviations are thus 20.4 and 13.0 so rather large deviations are not rare. This also shows that the theorems of the calculus of probability about how one probability can be derived from another, and with some difficulty also a priori, are as important as all seem to agree. When it takes such a long time and it is that difficult and that expensive to obtain accurate empirical determinations of the probabilities, one must use those that have already been obtained in an economical way, not missing the opportunity to determine all other probabilities that can be derived from them. Even conlusions by analogy and rather uncertain hypotheses should not be scorned. In statistical determinations of frequency and probability, where one usually lets a = 0, that is p/(p + q) = m/n, one finds the squared standard deviation on the number to be μ2(m) = ml/(m + l), on the probability ml/(m +l)3. For absolute statistical counts of infinitely possible numbers m + l = ∞ one thus finds μ2(m) = m.
3.2 The error law of the method expressed by symmetric functions The calculus of probability is needed to find error laws of methods in cases where the results of the trials cannot be expressed in terms of numbers; it is also convenient for results with numbers, when the question is an either/or; but as all other
128
GENERAL THEORY OF OBSERVATIONS
cases can be reduced to determination of probability, it can, although with some difficulty, also be applied to the continuous type of observations in the theory of measurements. We have, however, already seen the advantages of applying the method of halfinvariants to the error laws of probability, and the advantages of these will also be confirmed when determining the error law of the method for an actual error law which is expressed by halfinvariants. While the main task must be to determine the error law of the method from the actual error law, we will initially assume that the error law of the method has halfinvariants λ1, λ2, λ3 …, whereby, according to the law of large numbers, we understand the limits of the halfinvariants for infinitely many observations:
For the error law of the method we will also substitute the name standard deviation with mean error, λ2 will be called the mean square error, and μ2 the squared standard deviation. A prior determination of the error law of the method will always rely on thinking of every observation as a function of several simpler observations, whose error laws are known or assumed to be known. As regards the shape of the assumed function, it is given directly in all the cases where the result, which is considered as an observation, is in fact obtained by a calculation based upon observed values of the independent variables. But as in this case it does not matter whether one is preserving or discarding the calculations, one can also, when the observation has been generated without any calculation by composite operations, think of each of these as represented by a number and to the extent one knows the error law for these, without their special values having been conscious to the observer, one can calculate the error law of the composite observation using the character of the function that would have been used to calculate the value of the observation with these special values. But, of course, in this way most often one only obtains a partial determination of the error law with unknown values for some of the numbers involved. What we have learned earlier about actual error laws for functions has an unchanged validity also concerning the error law of the method; we have by actual error laws not excluded the case where the number of observations could be infinite. But by the law of large numbers, the limitation which applied to the actual error laws ceases to be valid: that the values of the function expressly should be considered as generated by free combination of each of the independent variables. When the error law generated by infinitely many observations is a particular one, it does not matter how the infinite number is generated, the error law of the method
GENERAL THEORY OF OBSERVATIONS
129
must be the same both by restricted and free combination of the values for all the independent observations, when just the restricted combinations do not prevent us from viewing the determinations of values of functions as independent observations, and in particular when a result for one variable appears as an essential circumstance for the values of other variables with which it is combined. For the error law of the method as for the actual error laws, the main theorem about linear functions given in (2.45) holds: when
then
For non-linear functions one must also express the error law here by sums of powers in combination with repeated summation for every independent variable, assuming their free combination. The most important example of determining the error law of the method for functions is the determination of the error law for the individual halfinvariants in an actual error law by a finite number m of observations, because only the solution of this problem, when the error law of the method for the observations is given, can show us the extent to which the actual error laws in the form of halfinvariants can be considered to depend on the number of observations, and what form this dependence will have.
3.2.1 The error law of the average First of all it will be shown that the average μ1has a mean value independent of the number of observations and this is the more accurately determined the larger this number. For the average is a linear function
and the error laws of the method for o1, o2,…, om are identical, so (3.8)
Thus the standard deviation of the average and its relation to this is inversely
is always smaller than the standard deviation of a single observation
130
GENERAL THEORY OF OBSERVATIONS
proportional to the square root of the number of observations. Also the higher halfinvariants are smaller for the average than for a single observation, both absolutely and relative to the standard deviations, as
For large values of m the error law of the average will approach the exponential form, even though the law of a single observation is very far from being exponential.
3.2.2 The error law of the standard deviation As the square of the standard deviation
is not a linear function of o1 … om, the determination of its error law of method λr(μ2) must be done by the more difficult detour via the sums of powers; and initially such that the independent variables o1, o2,…, om are separated in the expression for in such a way that they or their powers are factors in every single additive term, exploiting the accompanying tables.39 Thus for example
The expressions for ordered in this way could be added by an m-double infinite summation over all values of the observations for o1, o2,…, om that enter into the error laws of the methods, and as these error laws here are mutually identical and could be expressed by the infinite sums of powers of an individual o,
whose relation to the infinite number σ0 is known as functions (see the equations (2.16)) of the given halfinvariants λ1, λ2,…, it holds that
where r is the number of the different o that are represented in the product oφoψoῶ …. As the order of summation is irrelevant and if we substitute
39
See page 194. tr.
GENERAL THEORY OF OBSERVATIONS
131
∑ by ∑ , then all terms will be identical in the last ∑-sum, their number will be equal to m!/(m − r)!, and in every term the m-double S-sum can be made first for o1, then for o2, and so on. In this way we find
and as
we get (3.9)
From this we see that the mean of the squared standard deviation is not completely independent of m and therefore not identical to λ2 except for very large numbers m. The squared standard deviation of μ2 depends not only on λ2 and m, but also on λ4, although not on λ3. When λ4 is negative, that is for flat or centrally hollow error curves, λ2(μ2) can become small, also for small values of m, and quickly approach 0; clearly it cannot become negative. The equation for the borderline case
will, apart from when m = 1 or m = ∞, demand that
the most negative possible value will then be equal to m make λ2(μ2) negative. In this extreme
, and more negative values would at least for some values of
132
GENERAL THEORY OF OBSERVATIONS
case, which happens for the game of ‘heads and tails’ (only two possible outcomes and both equally probable), we have
so here the standard deviation itself and not its square will be approximately proportional to the number of observations. But whereas in this and other similar cases one can find particularly accurate values for the standard deviation also with small numbers, it will mostly be a very slow process to obtain some accuracy in the squared standard deviation.
3.2.3 The error laws of the higher halnvariants Similarly as for μ2, the error laws for μ3, μ4, etc., can be calculated, just with more difficulty. We get (3.10)
Also (3.11)
GENERAL THEORY OF OBSERVATIONS
133
and (3.12)
(3.13)
When a series of m proper observations have been made by a method for which one wants to know the error law λ1, λ2 …, and the question is asked whether these experiences can confirm or reject the correctness of the hypothesis whether one can sustain the opinion that the error law of the method is as assumed, then one should compare each term of the actual error law with the corresponding term of the error law of the method, and in particular notice whether the observed
If some μr is far outside these limits without λ3(μr) and higher halfinvariants explaining this deviation as being admissible, then the hypothesis must be rejected. In contrast, one can undisputedly sustain the assumption that the true error law of the method is λ1, λ2 …, when all the equations
are satisfied within the limits of standard deviation, which of course is best when the agreement is perfect.
3.2.4 The empirical determination of the error law of the method In the main case, when the error law of the method is completely unknown and is to be determined from the m observations represented by their actual error law, then this problem cannot be solved exactly; but temporarily (until more observations become available) the best agreement is ensured by determining λ1, λ2 … such that they satisfy the equations μr = λ1(μr), expressing that each of the halfinvariants determined by all the available observations for want of something better must be
134
GENERAL THEORY OF OBSERVATIONS
taken as their own ideal,40 thus (3.14)
Assuming this determination of the error law is in itself a hypothesis and as such its accuracy must be judged according to the rules above, thus in particular by the values for calculated from the above values of λr, one could also have estimated λr to be as far to both sides as these mean errors indicate, without in general having to consider the deviations as contradicting the assumptions. For reasonably large m, the differences between μr and λr are insignificant and therefore the limits can directly be taken as limits for the uncertainty of λr. In the example used earlier (page 72 and page 85) with 500 observations of an observation (a modified ‘game of patience’ with cards) where the individual results 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, and 19 had occurred respectively 3, 7, 35, 101, 89, 94, 70, 46, 30, 15, 4, 5, and 1 times, we had found the actual error law to be
40
In modern terms, Thiele forms a non-parametric estimate of a distribution by equating the first r empirical halfinvariants (cumulants) to their expectation. Fisher (1928) turns (independently) the argument around and identifies the symmetric functions which are unbiased estimates of the theoretical cumulants, the so-called k -statistics; see also p. 212 ff. tr.
GENERAL THEORY OF OBSERVATIONS
135
leading to
Whereas the mean and standard deviation are well determined, and also λ3 to the extent that it can be considered proven not to be equal to 0 and positive, λ4, λ5, and λ6 are very uncertain, in particular because their standard deviation has been calculated by the abrigded formula (valid for the exponential error law when m is not too small) (3.15)
which in this case undoubtedly gives too small values; and despite the large absolute values for λ4, λ3, and λ6, one could in no way deny the possibility that they really might be neglibly small. This gross undeterminedness of the higher halfinvariants is to be considered a rule which has exceptions (e.g. heads and tails), but only rarely so in proper observations. As the factor r! in the last approximation formula shows, two observations do not count as more than one for the determination of λ2, for λ3 one must calculate the effective number of observations as m/6, for λ4 as m/24, for λ5 as m/120, and so on; thus as a rule one needs an extraordinarily numerous series of observations for an acceptably accurate determination of the higher halfinvariants of the method, and, for this reason, cases where deviation from an exponential error law can be documented are rather rare. One has argued that the exponential error law could be considered general for sufficiently numerous observations; it is more correct to say that in general one must be content with considering the error laws as exponential, because one rarely has a sufficient number of observations to prove deviation from this law. The rare occurrence on non-exponential error laws is partly due to the fact that error laws for linear functions of several observations must always approach the exponential form, and that the same holds when none of the inessential circumstances for the observation dominates, but several of these work together so that
136
GENERAL THEORY OF OBSERVATIONS
the value of the observation
approximately has the form
and its error law of method must be of a similar kind as for a linear function of (unknown) elementary observations. By finer observations this result is a consequence of the fact that the most obvious incidental circumstances are removed as they are discovered, by theoretical calculation of corrections, annihilating the effect of these circumstances on the individual observations. One can therefore, in particular concerning measurements, take the point of view of not just rejecting the observations that do not obey the law of large numbers and thus do not have any error law of method, but also of letting observations with non-exponential error laws share their fate.
3.3 On adjustment 3.3.1 Reections on the possibility and the conditions for a general method of adjustment The main task generally in the theory of observations, to determine simultaneously all the error laws and the relations between the observations from an arbitrary number of observations of which possibly none is a repetition, is definitely not solvable in a strict mathematical sense, and at least even very difficult under the practical assumptions that the theory of observations must accept. One must constantly make additional assumptions, but before we commit ourselves to these, it will be useful to inspect the difficulties around the general problem. In general the available observations will be functions, each of its essential circumstances, which again must be either given in terms of other observations or possibly determined from exact theoretical considerations. But in general these functions are completely unknown and we wish to determine them from the observations. As known from interpolation, it is theoretically impossible to determine an unknown function—let alone several—from a finite number of its values, even though these values are given exactly; and obviously it is even more impossible to squeeze such determinations out of the approximate values that we are confined to here. This deficiency can only be remedied by hypothesis. One must as already in interpolation assume the shape of the functions to be known and be content with empirical determination of a number of constants of the function.
GENERAL THEORY OF OBSERVATIONS
137
When the entire system of observations decomposes in as many parts as there are constants to determine, and each part presents itself as such a numerous series of repetitions of one observation that its error law of method can be determined from it, then the problem reduces to the usual determinations of constants or interpolation in connection with what we have developed above concerning error laws for functions. Determinations of this kind must be regarded as the best—most assumption free—in all empirical science; this should be sought and realized by the way the observations are made. But quite frequently the individual series of repetitions are too few in number, and one must take into account scattered observations or short series of repetitions in a number so much larger than the number of constants that one series can remedy the insufficiency of the other, and then the general problem appears which has been named adjustment to reflect the mutual support and indulgence of the observations. To search for the principles of all adjustment we provisionally postpone the question about how the constants are determined as a secondary, purely technical problem. We assume that in some way a system of values for all the constants of the problem are given and our question is just: how can we verify whether such a system of values represents a solution to the problem? In pure mathematics the conditions were that inserting the values of the constants would satisfy the equations exactly; the answer cannot be that simple within the theory of observations: when we recalculate each individual observation with the constants found, deviations must in general remain. We can only demand that these deviations conform with the theoretical error laws in the shape these assume when inserting the constants. However, demanding that the deviations conform with the available error laws is too vague to be applied immediately as a criterion for correctness of the solution. If all the observations were so frequently repeated that one could sharply determine the theoretical error law of each series of repetitions, one would be able to understand and apply this criterion. With isolated observations one may be able to verify whether each individual observation falls within its standard deviation limits or, from a somewhat larger collection, whether the deviation of the calculated result too frequendy exceeds the standard deviation limit or not, and whether this happens in a way that the skewness and other properties of the error law make acceptable. But all these verifications are uncertain and the question whether the extraordinary favourable outcome of some verifications could balance the questionable outcome of others can hardly be answered as long as one is speaking about different or varying error laws.
138
GENERAL THEORY OF OBSERVATIONS
Since the error laws of the individual observations are known when the constants of the solution are available, one can imagine a series of transformations whereby the individual real observations are conceived as functions each of its fictitious observations with one given error law, say the exponential, where λ1 = λ3 = λr = 0 and just λ2 = 1, where these functions are determined (see page 96) such that the error law of the actual observation is exactly the error law of this function. Thus if the equations that connect all the observations and the desired constants are arranged such that, if all the constants were exactly determined, the numbers which represent the observations would all have this error law, λr = 0 except λ2 = 1, then for each system of values of the constants one could determine the deviations of the fictitious observations from this theory, and calculate actual error laws for all the deviations or arbitrary parts of them. In so far as the corresponding error laws of the method can be calculated, as a criterion that the adjustment has been succesfully done, one can demand that these error laws of the method should be the assumed exponential error law with mean equal to 0 and standard deviation equal to 1. Using this principle for the adjustment reduces the problem to the mentioned transition from actual error law to an error law of the method, but this latter difficulty remains. One can obviously not apply the previously given rules without modification. First of all, it is not immediately obvious whether the deviations of the functions of the original observations which have replaced them can be considered as simple repetitions of the same observation just because they have the same error law of the method. But also when nothing prevents this from being justified, a distinction remains. Secondly, because we conceive of this earlier determination of the error law of the method from a numerous series of repetitions as a problem of adjustment, it appears that this also has a special nature as there is only one constant to determine for the mean λ1, one for λ2, etc. However, the transformations used to reduce the error laws to a typical form depend in general on several constants, some of which will be associated with the mean, some with the standard deviation, etc., and one can easily appreciate that this must affect the transition from actual error law to error law of the method. I do not dare to give a definite opinion whether these difficulties could at all be solvable in a perfect way for a theory of adjustment with general validity; for the time being one must be content with solving various simpler adjustment problems by the so-called method of least squares. And to approach the description of this method we will first consider a case which could seem to be very special, even
GENERAL THEORY OF OBSERVATIONS
139
non-practical, in which all the observations fall in two groups so that in one of these every observation determines its own of the unknown constants, whereas in the other group the observations are completely independent of these constants, as either theory or hypothesis gives a value for the mean of each observation. As an example of such a series of observations one can imagine having measured the three sides of a triangle simultaneously once, and also made a series of measurements of the angular sum of the triangle. The three measurements of distance represent the observations in the first group and they determine the constants of the triangle. The measurements of the angular sum represent the second group of observations and neither could give a contribution to a better determination of the distances, nor could they be affected by the determinations of the distances. As useless as these measurements of angular sums are for any particular determination, they could give excellent information about the error law of the applied measurement method. If the theory is correct that it is the angular sum which is measured, then its mean must be λ1 = 180°, and the deviations of the individual results from this value are, as long as this theory is maintained, the true and absolute errors of the observations. In such cases the latter part of the difficulties can be overcome rather easily.41 For it is clear that only the second group of observations have significance for determining the theoretical error law by the actual error law for the observations reduced to normal form; additionally, the μr of the actual error law can immediately be considered to be approximations to the λr of the theoretical error law, so that one should let λr = μr instead of using the formulae (3.14). In these cases one can only speak about the true value of the errors in exceptional cases. If the hypothesis is true, or as long as one is just considering it as being true, then the mean here is equal to 0 and the squared standard deviation is therefore determined as the mean of the squares of deviations from the hypothetical mean, λ3, as the mean of the third power of the deviations. After determining the theoretical error law, the concern is then whether the hypothesis can pass the test of agreement with the error law that is assumed, thus that the higher λ-values, in so far as these can be determined, should vanish, that λ2 = 1, and not least that λ1 = 0, and that all of this holds both for the entire group of observations and also for arbitrary (impartially chosen) subsets of these. This does not say anything about the first part of the difficulty, whether this group of reduced observations can be considered as simple repetitions of the
41
This argument leads to the invariance principle used later to justify the method of least squares, see page 150. tr.
140
GENERAL THEORY OF OBSERVATIONS
same observation, and it is clear that this doubt becomes further aggravated if the reduced observations obviously have not had the simple character assumed here, being independent of the constants desired from the problem of adjustment, but were only generated by a process of elimination that made every single, reduced observation a function of not just one, but of several original observations.
3.3.2 The conditions for the method of least squares The simplest case of adjustment that is to be treated in the sequel as an object of the method of least squares is the case where all the observations, reduced to having an error law of exponential form, can be considered as mutually independent, and where all equations between them are linear as well as the equations between the observations and the unknown constants of the adjustment, or where all these at least to a satisfactory degree of approximation can be expressed in a linear form by series expansion using Taylor's formula with all terms of second and higher order being sufficiently small in comparison with the standard deviation for them to be ignored. If all error laws turned out to be exponential, not only according to preliminary investigations of the methods of observation, but also after the criticism that always should be made at the end of any adjustment, and if all the equations between them can be regarded as linear or if just so—as usually would be the case—that there is not sufficient reason to reject these assumptions, then adjustment can be made and it is unneccessary to perform the earlier mentioned reduction of all error laws to the simple exponential form, where the mean is 0 and the standard deviation equal to 1, as this could be implemented just by changing the origin and scale for the measurements which would thus not disturb the linear form of the equations.
3.3.3 Weights for adjustment; normal values If some among the collection of experiences for a problem of adjustment are mutually simple repetitions, the part of the adjustment which is concerned with determination of the constants can be simplified, by using the mean of the actual error law for each such series of repetitions as an observation with mean square error λ1(μ2) = λ2/m; when the error law is exponential this identifies what is needed. The number m of individual repetitions for the summary observation is then called the weight of the observation. Such a summary observation should be treated exactly as a proper observation, just with the modification that its standard deviation becomes times as large as the standard deviation of each original observation which it replaces. Conversely one can thus also in this respect think of
GENERAL THEORY OF OBSERVATIONS
141
the individual observations as being summaries of certain numbers of fictitious, equally good observations, the good observations being summaries of many and the less good ones of fewer, according to the rule that λ2 is proportional to the number of these, fictitious, observations. Observations that refer to the same scale can in this way be referred to observations with standard deviation unity and thus 1/λ2 is the weight of the individual observation. Observations that only approximately could be considered as repetitions are often summarized in the same way, making first a preliminary easy adjustment, determining a common result with small standard deviation (thus a large weight). Ignoring less valuable side results from the preliminary adjustment, this can be used as a summarized observation in the final adjustment; such summarized observations are most commonly called normal values (normal places).42 However, in the part of the adjustment where hypotheses should be tested by determining the error law after the deviations, one cannot use this simplification in cases where the third or higher halfinvariants are investigated, and even where the investigation is limited to a study of the standard deviation, one must advise against using the summary observations or normal values, or at least demand that the determinations of mean deviations which could have been obtained by the adjustment are included in the analysis, and that the criteria which the adjustment should meet are set up accordingly.
3.3.4 Special rules for elimination and determination of the unknowns in linear equations for observations When an adjustment is to be made by the method of least squares, as the equations between the observations, assumed to have exponential error laws, have been expressed in a linear form and when one has arranged these with normal values and weights, then the first task is a process of elimination where one obtains practically usable elements for the values of every individual mean of observations or the linearly related unknown numbers. As in the elimination of linear equations between exact numbers, one can also with observations solve all such tasks by exclusively using linear functions. But otherwise there is a significant difference between the treatment of such tasks in pure mathematics and in the theory of observations. In the exact mathematics one has full and unrestricted freedom in the choice of transformations. When each unknown is just isolated from all others in its own equation, it does not matter how this goal is achieved; but when dealing with
42
See Farebrother (1999 pp. 214–16), for more about this ‘method of normal places’, tr.
142
GENERAL THEORY OF OBSERVATIONS
such systems of supernumerary equations as here, the exact mathematical analysis would lead to declaring the problem as being overdetermined, not being able to make a choice between the different numerical results which can be obtained by different routes. However, in the theory of observations the freedom of choosing transformations is partly restricted; and when this limitation is taken into account, one is lead to a single acceptable result in spite of the discrepancies between the observations. Thus when an author recently has raised the question whether in a given case one is justified in using the method of least squares rather than that of pure mathematics, and has answered in the negative, this must be due to a misunderstanding. If the available supernumerary linear equations are in mutual agreement, then it does not matter which route is taken; but if not, then one is outside the pure mathematics whether or not a personal observer has provided the equations, and a correct result can only be obtained by the method of least squares, the basic aspects of which are to be described now. There is a distinction between the error laws for real observations which are here assumed to be independent, and the error laws for linear functions of these. Let us consider a number of linear functions u, v,…, w of the observations o1, o1,…, on, and let us as everywhere in the sequel give the functions a homogeneous form by subtracting the constant terms from the function itself, that is
Then, according to (2.45), the error laws are
but if we now consider a linear function of u, v,…, w,
then the difference shows up in the way that we cannot calculate the error law of U from the error laws of u, v, and w using (2.45), but one has to go back to the
GENERAL THEORY OF OBSERVATIONS
143
original independent observations themselves
whereby one gets
which in general is different from
Linear functions of the same observations could each be considered as observations, but—naturally—they cannot always be considered as mutually independent observations. In general only the expressions for the means, r = 1, coincide for the two error laws. Here, where the error law of the individual observations is assumed to be exponential, that is with λr = 0 for r ≥ 3, both for observations and functions, the difference only matters for the mean square errors. But the linear functions u, v, …, w could have such a character that even in this respect there is no difference between them for any function of them U = eu + fv + ⋯ + gw with arbitrary coefficents e, f, …, g. To obtain that
144
GENERAL THEORY OF OBSERVATIONS
all that is needed is that the product sums vanish on the left-hand side, since the quadratic terms are identical. When (3.16)
there is complete agreement in all the direct and indirect ways of calculation with the error laws, and nothing then prevents such functions u, v, …, w being considered and treated exactly as mutually independent observations, with exponential error laws whose mean square error is determined by the sums of squares (3.17)
By any elimination, in any problem concerning error laws for linear functions, it becomes completely irrelevant whether the original observations or such a system of functions are considered observed. I will call such functions free43 or, if necessary, mutually free functions. And the restriction, which the theory of observations makes on the freedom of exact mathematics to choose the linear functions, can now be formulated such that by adjustments, one has the right to transform the independent series of observations to an arbitrary system of mutually free functions, but then only such systems, if one wants to use the transformed observations to determine error laws.
3.3.5 Theorems about systems of mutually free functions Mutual freedom between two functions. Two linear functions44 [ao] and [bo] of the same observations are thus mutually free when [abλ2(o)] = [abλ2] = [abλ] = 0. Two functions with a constant ratio [ao] and c[ao] cannot be mutually free since the product sum here becomes equal to c[a2λ2], and a sum of squares cannot vanish unless all its terms vanish. Functions of totally different observations are mutually free.
43
Thiele's free functions are what nowadays would be known as orthogonal linear combinations. The following theorems on free functions all have their counterparts in linear Euclidean geometry, tr.
44
In the following the word ‘function’ will always denote a linear function unless the opposite is stated explicitly. Thiele's footnote.
GENERAL THEORY OF OBSERVATIONS
145
Mutual freedom between two systems of functions of the same observations. Two systems of linear functions of the same observations
could be said to be mutually free when each function in one system is free of all functions in the other system, thus when
Every function of the functions in one system will then also be free of any function in the other system. Making functions free from a given system by decomposing into the sum of a free and a linearly dependent function. In general any function [fo] can uniquely be decomposed into a sum of functions
of which one [f ′o] is free of a given system of functions [ao], [bo], …, [do], while the other is a function of these. The condition that [f′o] is free of [ao], [bo], …, [do] is
146
GENERAL THEORY OF OBSERVATIONS
and according to these we get (3.18)
Any system of functions [ao], [bo], …, [do], [eo], [fo], …, [go] can in this way be replaced by two mutually free systems [ao], [bo], …, [do] and [e′o], [f′o], …, [g′o]. According to (3.18), none of the functions [e′o], [f′o], …, [g′o] which have been made free can be linear functions of [ao], [bo], …, [do] unless some of these are themselves functions of the others. But in this case the denominator on the right-hand side of (3.18) vanishes, and this equation becomes inapplicable for determination of the free functions until the system of functions has been cleansed for mutually dependent functions [ao], [bo], …, [do]. For if [uo] = x[ao] + y[bo] + ⋯ + z[do] should be free of [ao], [bo], …, [do], we would have
without x, y, …, z all being equal to 0, thus
and this demands that ξam + ηbm + ⋯ + ζdm = 0 for every m.
GENERAL THEORY OF OBSERVATIONS
147
In particular, between two functions, any one of these can be replaced by a function which is free of the other. For [ao] and [bo] we thus have that c([abλ][ao] − [aaλ][bo]) is free of [ao] and c([bbλ][ao] − [abλ][bo]) is free of [bo]. A system of functions which is itself free. A system of functions is said to be free itself when each of its functions is free of all the others. Any system of given functions of the same observations can, if it is not originally free, be replaced by a free system and thereby also be cleansed for all mutually dependent functions. One can also obtain such complete freeness by arbitrarily choosing a function [ao] in or outside the system and making all others free of it by the rule that (3.19)
among the functions which are made free from [ao] in this way, one can again choose a [b′o] arbitrarily and make all others free of it in the same way
and with these functions, now free of both [ao] and [b′o], one can continue in the same way, whereby the remaining function eventually are either free or have been reduced to being identically equal to zero. None of the functions [ao], [b′o], …, [d(n)o] made free in this way can be functions of the others; this follows directly from the fact that when
then its mean square should on one side be equal to 0, and on the other side equal to α2[a2λ] + β2[(b′)2λ] + ⋯, which is obviously impossible when the observations have not been given exactly. Consequently the number of free functions in a system can never exceed the number of original observations upon which the system is based. If the number of free functions is equal to this maximum, the system is said to be complete. Theorems about complete systems of free functions. Orthogonal substitutions. In a complete system of n mutually free functions its n2 coefficients will fulfil n(n − 1)/2 conditions. Thus n(n + 1)/2 coefficients would get arbitrary values, but not any arbitrary selection of that many coefficients. The properties of free systems, in particular complete systems, of mutually free functions are essentially
148
GENERAL THEORY OF OBSERVATIONS
treated in the theory of orthogonal substitutions. When all the orthogonal coordinates are mutually independently observed and with identical mean square errors, orthogonal coordinates in any other arbitrary coordinate system are an example of a corresponding complete system of free functions and could be treated directly as independent and equally good observations. However, we could not be content with referring to the textbooks in this way, since several of the main theorems need to be given in the slightly more general form when not all observations are equally good. For the complete system of functions [ao], [bo], …, [do] one must have the following equations of conditions, where each should be conceived as being written twice and supplemented with the equations for the mean square errors:
From this arrangement one easily finds a theorem about the determinants of all coefficients
since
and thus (3.20)
It is also easy to determine the minors of the single coefficients of R. If these are denoted by Greek letters with the same index as the Latin letters denoting the corresponding coefficients, such that γi, with a suitable sign is the determinant of
GENERAL THEORY OF OBSERVATIONS
149
the cth row and ith column, we have (3.21)
since then and only then one obtains that
Using these expressions for the minors, the transformation back to the original observations as a complete systems of free functions of [ao], [bo], …, [do] is found to be (3.22)
As further
and
we get (3.23)
3.4 The principle of adjustment by the method of least squares In Section 3.3.1, where we were concerned with the general problem of adjustment, a particular case was mentioned which had an obvious solution. While it is dubious whether any general problem of adjustment can be reduced to this case, it can be done without difficulty when the equations have the linear form that justifies use of
150
GENERAL THEORY OF OBSERVATIONS
the method of least squares, and we have now developed the tools that are needed for doing so. For we can now replace any given collection of observations by a complete system of free linear functions, and choose this system in such a way that any given linear function can be made free of all other linear functions of the same observations. When theory or hypothesis implies that there are a number of linear relations between the means of the observations, we can separate these during the process; and while the functions in the complete system, where theory gives no information about their values, must be determined directly by the observations, for those, where theory gives a new value different from the observed, one must adhere strictly to the theoretical value. The correctness of this statement appears to me as self-evident;45 but this is not the principle which originally established the method of least squares, it is not the principle that lies behind the name of the method. This is derived from a theorem which requires the sum of squares of the remaining deviations to be at its minimum, a theorem that we shall prove below on page 157, and which moreover has not been considered suitable for assumption without proof in earlier expositions. But as all proofs made to establish the principle rest on the calculus of indirect46 probability, on the theory that one can calculate with the so-called probabilities for hypotheses in the same way as with probabilities conforming with their definition, I have had to reject the principle after Mr Bing had proved the incorrectness of this theory.
3.4.1 Adjustment by correlates When the given collection of observations is to be summarized and used in such a way that there are no unknown constants in the given equations between the observations or in the functions to be determined which should be determined or eliminated first, then such a task is to be solved with adjustment by correlates.47 The complete system of functions of the observations is then arranged to be made free; both such functions whose values are theoretically given as well as others, including the observations themselves which are desired to be improved
45
Thiele thus argues for the method of least squares by invariance, reducing every linear adjustment problem to a canonical form, which has a self-evident solution. tr.
46
The statement is incorrect. Laplace (1811, 1812) and Gauss (1823) derive the method of least squares by a miminum variance argument (Hald 1998, sec 20.3 and 21.4). tr.
47
‘Adjustment by correlates’ refers to the case when a linear model is given in the form Hη = γ, where H and γ are known. The alternative is ‘adjustment by elements’, cf. page 161. tr.
GENERAL THEORY OF OBSERVATIONS
151
according to the theoretical relations. And the functions should be arranged in such a way that those which are theoretically known are placed in the first rows; next such functions that it is interesting to determine, and finally and in fact quite pro forma, so many others that one can obtain a complete system of free functions; it will turn out that the observations themselves never need to be in this list. Let us represent these functions as
The first thing to be done is the preparation for the functions to be made free and the investigation whether these functions or some of them are mutually free by calculating the mean square errors and the sums of products; this is in practice done by adjoining the sums of squares and products to the theoretical and observed values of the functions in a table as (3.24)
In so far as the first function does not turn out to be free of the others by having [abλ] = [acλ] = [adλ] = 0, all numbers in the first row are to be multiplied by β, γ, and δ respectively, and the products then subtracted from the corresponding numbers in the rows of these factors. In the table that then appears (3.25)
it holds that [b′o] = [bo] − β[ao], [c′o] = [co] − γ[ao], and [d′o] = [do] − δ[ao] are all free of [ao] as shown in (3.19) and b′i = bi − βai, c′i = ci − γ ai, d′i = di − δ ai
152
GENERAL THEORY OF OBSERVATIONS
are their coefficients; but the sums of products and squares of the coefficients of these new functions can also be found directly calculated at their places in the second table; because, when for example [b′2λ] has been calculated as equal to [b2λ] − β[abλ] = [bb′λ], then this value is identical to [b′2λ], because [b′o] is free of [ao] and thus [ab′λ] = 0. And similarly for the sums of products, since [b′c′λ] has been calculated as equal to [bcλ] − γ[abλ] = [bc′λ], which, for the same reason, is equal to [b′c′λ]. If now [b′o], [c′o], and [d′o] are not mutually free, we proceed in the same fashion, calculating the factors γ′ and δ′, which are then used to separate the next theoretically known function [bo], which is free of [ao], and free of both free functions [c″o] and [d″o] in the table (3.26)
since
which eventually gives a table with only one line (3.27)
After executing these operations the original functions [ao], [bo], [co], and [do] can be recovered as (3.28)
expressed as homogeneous linear functions of the mutually free functions [ao], [b′o], [c″o], and [d‴o]. The main thing in this operation is that the first part of the series for the free functions will represent only the theoretically given functions and represent all of them. It is not permitted either to forget any theoretically given condition among the observations, or to mix the theoretically given functions with those that are to be determined by the observations, because only in this way will the theoretically known functions become free of the others; in contrast, it is far from always necessary to continue to make a complete system of free functions, and one can omit the supplementary unimportant functions represented by [do].
GENERAL THEORY OF OBSERVATIONS
153
3.4.2 Determination of the adjusted means As the free functions are conceived of as mutually independent observations, the observations [ao], …, [b′o], as many as there are theoretical values A, …, ℬ′, must give way to these and in principle they only supply the means for criticizing the theory and the error laws used. If any of the free functions [c″o], [d‴o] have otherwise been of direct significance then their value as calculated from the observation must definitively be adhered to and is therefore not affected by adjustment and theory; otherwise the results to be obtained from the adjustment are calculated as linear functions of the free functions that are only observed and of the theoretical values A for [ao], ℬ′ for [b′o], for the remaining free functions. Thus we get (3.29)
as adjusted values. In general, for any function of the observations, the adjusted value which agrees with theory is found by expressing it in terms of free functions and then replacing observation with theory as far as this reaches. In particular we get for the individual observations (3.30)
and thus for the adjusted value (3.31)
and consequently simpler (3.32)
showing that when the functions are made free during adjustment of the observations, one needs to go no further than to those given theoretically. One can still calculate the adjusted mean of any arbitrary function, even if it has not been placed explicitly in the table.
3.4.3 Determination of the standard deviation after adjustment To calculate the standard deviation of the adjusted values one must notice that the system [ao], [b′o], [c″o], and [d‴o] preserves its property of being a free system
154
GENERAL THEORY OF OBSERVATIONS
when some of these are replaced by theoretical values, with the modification that they have mean square error equal to 0 after adjustment. Thus (3.29) implies that (3.33)
and (3.30) that (3.34)
Even for these formulae it is unnecessary to pursue the calculation of c″i, d‴i and so on, to carry though the transformation to free functions, any further than beyond the region of theoretical functions; only it must somehow be ensured that all functions for which the standard deviation is needed after the adjustment are made free of the functions theoretically given, and that the coefficients are known to those parts which according to (3.18) are linear functions of these. This demand can usually only be fulfilled for the observations themselves in a simpler way than by representation in the table. Using the standard deviation of the functions before the adjustment, but supplemented with the coefficients γ, γ′, δ, δ′ in (3.29) and in general h and k, which necessarily must be determined, we have
and in general, when O = h[ao] + k[b′o] + l[c″o] + m[d‴o], the mean square error can be calculated as (3.35)
But for the mean square error of the observations themselves we have from (3.23) and (3.34)(3.36)
As these formulae show, the adjustment achieves in general a reduction of the standard deviations. But after the adjustment the improved observational values ui will in general no longer be mutually free, because, although both the observed and adjusted functions [ao] … [d‴o] are free as well as oi, the reduction in the standard deviations of the adjusted functions changes the condition for the free functions to be free themselves.
GENERAL THEORY OF OBSERVATIONS
155
3.4.4 Error criticism From the equations (3.32) we find, since [abλ] = 0, by squaring and adding that (3.37)
The individual terms in this equation are thus the ratios of the squared differences between the observed and adjusted values, the observed and theoretical values of the free functions, and the corresponding mean square error. These are the numbers that form the basis for testing the theory and hypothesis, the error criticism.48 Where real observations or free functions are compared with the pure theoretical values which, if the hypothesis is true, should be equal to their means λ1 in their error law of method, the squares of their differences should on average be equal to the λ2 of their error law of method, which are the denominators of these ratios. Thus on the right-hand side of the equation (3.37) each term must on the average have the value 1, and as the number of terms of this kind is equal to the number of theoretical conditions, m, the right-hand side of this equation must be close to m if the hypothesis is to be maintained. But the left-hand side of the equation has more, n, terms of a similar type, and these must therefore on the average be less than 1, clearly because the differences in the numerator are not between pure theoretical values and observations, but between these and the adjusted values, which are influenced both by theory and observation, and which are not mutually free. It is often important to know which value each individual term (ui − oi)2/λ2(oi) ought to have to maintain theory and hypothesis, but for this purpose one can in no way use the number m/n, which is the average value of all the terms on the left-hand side of (3.37). If the formulae used for calculating the mean squares λ2(oi) prior to the adjustment are given in such a way that they do not contain any unknown constant, an error criticism for confirmation of the hypothesis can be based upon the agreement between the theoretically known functions and their observed values, on having (3.38)
However, if the outcome of this test is not satisfactory so that, through the character of the deviations, one must seek information about the formation of a better
48
Today we would use terms such as ‘model criticism’, ‘model diagnostics’, or ‘residual analysis’. Thiele has a very strong emphasis on this activity in all his writings, which was quite ahead of his time. tr.
156
GENERAL THEORY OF OBSERVATIONS
hypothesis, or include unknown constants in the λ2(oi), which is extremely frequent in practice as one far from always has pure repetitions enough to calculate the standard deviations from these, or if eventually the question becomes the deviation of the error law from the exponential form, then this summary equation is not adequate, and the possibility of the necessary determination of constants and the testing rests on the fact that for each individual observation one can not only calculate λ2(ui) and λ2(oi), but also λ2(ui − oi). Since, as A, …, ℬ′ are theoretically given values in the equation (3.32) for ui − oi, it holds that
then according to (3.36)
and consequently the average value of the individual ratio of the squared error to the λ2(oi) of the observation49 will be (3.39)
For quite isolated single observations one cannot trust this determination, since its uncertainty is considerable; but it is then possible to partition the series of observations into arbitrary groups and test the sums of squares (3.40)
in each of them, thereby obtaining what is needed to determine the unknown constants in λ2 (oi) and also for the test of the hypothesis. However, as long as it is not proved that the individual (ui − oi)2/λ2(oi) can be considered as mutually independently observed in relation to the determination of constants and adjustments concerning the mean square errors, one should be careful with such a partitioning of the sums of squares and if at all possible one should only do it when the series of observations in a natural way can be partitioned into subgroups of essentially different character (e. g. quasi-systematic50 errors).
49
We would have set an expectation sign on the left-hand side of this equation. Thiele identifies here what we would call squared, standardized residuals, tr.
50
Thiele is most probably thinking of the model used in his first article on least squares (Thiele 1880), Chapter two of this volume, tr.
GENERAL THEORY OF OBSERVATIONS
157
3.4.5 The method of least squares as a minimization problem If the sums of squares
in a correcdy calculated adjustment—the equation itself is a test of the correctness of the calculations—shows too large a value, it would be fruitless to seek to reduce it by replacing the set of u-values with any other set of values vi that satisfies the theoretical conditions. For the system ui which has been found has the property that it sets this sum of squares at its minimum. This property, which hitherto always51 has been used as the principle of adjustment, has given the method its name: the method of least squares. According to one of the previous main theorems about free functions, the equations (3.23), we have identically for any arbitrary vi that
but if Vi should fulfil the theoretical conditions
then, comparing with (3.37), we get that the only and absolute condition for the minimum is that further
whereby all non-constant terms in the first line of the identity are made to vanish, but then the minimum must be attained for ui = vi.
3.4.6 An example As an example of adjustment by correlates we could treat the case where there are mutually independent measurements of all distances between four points on a straight line. Each of the measurements is assumed to be performed with a standard deviation of . Between the six observations which we could denote
51
This is incorrect. See footnote on page 150. tr.
158
GENERAL THEORY OF OBSERVATIONS
by double subscripts after the endpoints
there are theoretical relations which most naturally are expressed in the form
Without using the obvious dependence between these equations that, not just in theory but also for arbitrary observations, we have ω1 + ω2 + ω3 + ω4 = 0, we form the table Coefficients
Sums of squares and products
Th.
Obs.
12
13
14
23
24
34
0 0
ω4 ω3
1 −1
−1 0
0 1
1 0
0 −1
0 0
9q −3q
−3q 9q
−3q −3q
−3q −3q
0
ω2
0
1
−1
0
0
1
−3q
−3q
9q
−3q
0
ω1
0
0
0
−1
1
−1
−3q
−3q
−3q
9q
As the adjusted results are not sought for any other function than the observations themselves, we can refrain here from completing the table and only include the most necessary elements; multiplying the first row by β, γ, and δ, and subtracting from respectively the second, third and fourth row yields 0
1
−1
0
8q
−4q
−4q
0
−1
0
1
−4q
8q
−4q
0
0
1
−1
−4q
−4q
8q
Making the second function free gives 0
0
1
6q
−6q
0
0
1
−6q
6q
and now the nullity of the computational process discloses the dependence between the relations which was initially overlooked. Using the top rows of each table we
GENERAL THEORY OF OBSERVATIONS
159
next get, using (3.32),
Setting, for the sake of simplicity,
whereby x1, x2, x3, and x4 would be the abscissae of the points calculated from their centres of gravity, and inserting the expressions for ω2, ω3, and ω4 in terms of the observations, gives
According to (3.36) we next get
This result could have been anticipated because of the symmetry between the points; as there are three conditions between such six observations, the mean square error of all of them must be reduced by one-half. This way of solving the problem is not the most elegant, but therefore more informative than the following, which is based upon a simple trick which unfortunately is not universally applicable, but on the other hand can illustrate other
160
GENERAL THEORY OF OBSERVATIONS
details when we try to search for the adjustment result for the general linear function [ao] of the six measurements. As the theoretically given functions one should here have chosen three non-identical sums of ω4, ω3, ω2, and ω1,
together with the function of interest
The table then becomes Th. 0
Observation Coefficients o12−o14+o23+- 1 o34
0
−1
1
0
1
Sums of sq. and prod. 12q 0
0
A1
0
o12−o13+o24−- 1 o34
−1
0
0
1
−1
0
12q
0
A2
0
o13−o14−o23+- 0 o24
1
−1
−1
1
0
0
0
12q
A2
0
[ao]
a13
a14
a23
a24
a34
A1
A2
A3
3[aa]q
a12
where
The three theoretically given functions are then immediately mutually free and the function [ao] can therefore be made free in one operation by subtracting the first three rows, multiplied by , , and from the fourth row, which, again introducing the abscissae x1, x2, x3, and x4, gives
with the mean square error
or
GENERAL THEORY OF OBSERVATIONS
161
From the two expressions for λ2U[ao] one can derive the conditions for [ao] to belong to either the system of theoretical functions or to the free part of the system that is not affected by the adjustment. In the first case we must have λ2U[ao] = 0 and thus
of which each can be derived from the other three. In the second case the adjustment must not have reduced the mean square errors of [ao], thus we have λ2U[ao] = λ2[ao] = 3q[aa] and consequently
These conditions are for example fulfilled by the abscissae x1, x2, x3, and x4, which were chosen for the sake of simplicity; and their common mean square error is found to be λ2 = 9/16. These abscissae are not mutually free functions. Such freeness would have been the case had the abscissae been measured from an arbitrary origin and not from the centre of gravity, under the assumption that the abscissa o of the centre of gravity had been observed independently of the measurement of the lengths with a mean λ1(o) = 0 and λ2(o) = 3q.
3.5 Adjustment by elements If the theory on the observations is given in the form that the mean λ1(oi) is a linear function of a number n − m of unknown elements to be determined, this number being in general smaller than the number of observations, n, then for convenience one needs a modified method of adjustment which is called adjustment by elements.52
52
‘Adjustment by elements’ refers to the case when a linear model is given in the form η = Xβ, where the design matrix X (the matrix of elements) is known. The alternative is ‘adjustment by correlates’, cf. page 150. tr.
162
GENERAL THEORY OF OBSERVATIONS
If the desired elements are x, y, …, z and the equations take the form (3.41)
with n(n − m) theoretically given constants p1 … rn, it is clearly seen that the adjustment could have been performed as adjustment by correlates by making an arbitrary selection among the equations (3.41) to determine x, y, …, z as functions of n − m observations, inserting these expressions into the remaining equations, expressing these in the simple form which adjustment by correlates demands from its theoretical equations of conditions, and finally including the explicit equations for the elements in the table. One would have to proceed in this way in the most common cases of adjustment (including adjustment by both elements and correlates) with the method of least squares, namely in such cases where instead of the left-hand sides in the equations (3.41) there had been functions of at most all n observations. In such cases, which can luckily often be avoided, one could be forced to transform the entire system of observations to a complete system of n free functions. One could then keep a number n − m of original equations [c′o] = f1(x, y, …, z), necessary for determining the elements, in unchanged form; the remaining equations would then by elimination of elements be changed to a system of m theoretical equations of conditions [ao] = 0. From these two systems one should then, as described in Section 3.3.5, page 144, make the first [c′o] free of the other [ao] and replace it with n − m of the free equations of the theory [co] = f(x, y, …, z), from which the elements could be calculated from their observed values without theoretical changes. But in both the special cases this is only an unnecessary detour. In adjustment by correlates the equations [ao] are given and, in so far as no other elements or functions are to be determined, this all reduces to making this system free; and in adjustment by elements the particular simple form of the equations (3.41) introduces a possibility of arranging the functions [co], without explicitly representing the theoretical equations [ao] = 0, in such a form that they become free of [ao] and therefore unaffected by the substitution of their observed values with the theoretical values [ao] = 0. The theoretical equations of conditions appearing from the elimination of the elements from (3.41) must be linear and have the form
GENERAL THEORY OF OBSERVATIONS
163
and if such equations should be produced by elimination of elements in (3.41), then the identity
shows that
must hold; conversely any system of n values a that satisfies these equations can be considered as belonging to the system of theoretical equations. Such equations [ao], for which the theoretical value was given, would have been placed up front in an adjustment by correlates. In the second row one would have placed such systems of functions [co], to be used for calculating the elements, which were free of these and should be determined by observations only. In adjustment by elements one bypasses the coefficients a and proceeds directly to the determination of the coefficients c in the functions that do not depend on those theoretically given, but only on the observations. But if a function [co] should be free of all the functions [ao], we must for all a have that
is identical to
and thus it must hold that
for some system of values for φ, ψ, …, ω. Conversely such ci would make [co] free of every [ao]. Every such function [co] will thus be computable just after the observations without taking into account the equations of conditions [ao] which therefore do not have to be specified. The simplest complete system [co] is the following, which appears by letting one of φ, ψ, …, ω equal 1 and the others
164
GENERAL THEORY OF OBSERVATIONS
equal 0: (3.42)
and this would comprise as many equations as there are unknown elements and thus in general be sufficient for their determination. And exactly to the one correct determination, since whereas the theoretical demands are fulfilled as soon as the elements are given just one value, this and only this particular solution agrees with the observed with respect to everything in the observation that is free of the theory. For any other function which would not belong to the class [co], the observed value should be corrected according to the theory; only [co] remains unchanged by the adjustment.
3.5.1 The table for adjustment by elements: simultaneously eliminating and making the functions free As the equations (3.42) give what is needed for determining the elements in the unique way that completely satisfies the justified demands of the observations, we have through these the complete basis for adjustment by elements; the table for implementation of adjustment by elements is made to calculate the coefficients in (3.42) and is best represented by precisely these equations. It is important to notice that the coefficients in these principal equations for adjustment by elements directly are the sums of squares and products that give the mean square errors and the conditions for the functions
to be free, namely (3.43)
while [pq/λ2] will show that [po/λ2] is free of [qo/λ2]. If all these sums of products vanish, each equation determines its element as observed free of the others and
GENERAL THEORY OF OBSERVATIONS
165
having mean square error
In general one would not have the elements themselves but only certain functions of these observed as free; for the determination of mean square errors of arbitrary functions of the elements it is thus necessary to replace the elements and the equations (3.42) by functions that are observed free and their corresponding equations, though this can be done directly by solving the equations (3.42). As free functions one can always choose one which depends on up to all n − m elements, one where one element, one where two elements, and so on, are missing. If for example one chooses [po/λ2) to be the first of these functions, then its relation to the element x is such that this must be the element which should be eliminated from the other functions, and this is done by multiplying the first equation in (3.42) by respectively (3.44)
and subtracting from each of the others; if we then write (3.45)
the following functions are formed (cf. the similar operation in adjustment by correlates (3.24), (3.25), (3.26), and (3.27)),53 which are free of [po/λ2]: (3.46)
Here for example [q′o/λ2] is free of [po/λ2] because
53
The equation numbers in Thiele's original correspond to one non-existent equation as well as eqns (3.26) and (3.27). tr.
166
GENERAL THEORY OF OBSERVATIONS
If for example [q′o/λ2] is then chosen to be made free, y; is eliminated by multiplying this by the factors
and if we write (3.47)
and so on, we eventually obtain as free functions and equations (3.48)
so that (3.49)
3.5.2 Determining the adjusted values for the elements or for given functions of these By establishing the equations (3.48) we have performed what is common to any adjustment by elements; but if luck or skill in choosing the procedure has not ensured the immediate presence of all the elements and functions of elements which are so interesting that one would like to know both their adjusted means and standard deviations, then it is necessary to expand every such function F = Ax + By + ⋯ + Dz as a function of the free functions (3.50)
GENERAL THEORY OF OBSERVATIONS
167
and thus calculate the coefficients α, β, …, δ from the equations (3.51)
the mean is then determined by (3.50) and the mean square error by (3.52)
This method for applying the results (3.48) and (3.49) of adjustment by elements will in particular be useful in cases where one has circumvented a difficult adjustment by correlates, with almost as many theoretical conditions m as observations n, by considering n − m of the observations as elements, such that the others are theoretically given functions of these, and then adjusting by elements. Conversely one would use adjustment by correlates instead of elements when m is small compared with n. But in proper problems of adjustment by elements one wishes to determine not only a single function of the elements, but most often the entire system of elements, and sometimes even more functions of these, and then the procedure given in (3.51) becomes inadequate. In such cases one should compute all the elements x, y, …, z from (3.48), but preferably, and definitely if the mean square errors should be determined, the calculations should not be done with the numerical values, but with the algebraic symbols for the left-hand sides of the equations, the mutually free functions of the observations, and in such a way that when calculating the coefficients one obtains all the elements expressed as functions (3.50) of free functions. These expressions can then be inserted into the other F = Ax + By + ⋯ + Dz to put these also into the form (3.50), whereafter the numerical calculations of the means and the standard deviations after (3.52) can be performed.
3.5.3 The adjusted values for the observations However, with numerous repetitions this work can become quite cumbersome. It is much easier to determine x, y, …, z, or rather their means, numerically from
168
GENERAL THEORY OF OBSERVATIONS
the equations (3.48), and then from these one can calculate the adjusted means for any function of the elements, in particular the adjusted values of the observations ui; but in this way one is not getting information about the standard deviations after adjustment. This becomes particularly oppressive in adjustments where error criticism must be made, and then in particular where λ2(oi) have only to be given as depending on unknown constants which, as we have seen under adjustment by correlates in Section 3.4.4, demands determinations of partial sums, ∑(oi − ui)2/λ2(oi), where λ2(ui) are indispensable for calculation of the corresponding theoretical values. In this case, that is for the observations themselves, some relief can be achieved; for when F = oi we determine
so that (3.53)
and (3.54)
One can obtain tolerably simple calculations by computing
from each individual coefficient of the observations.
3.5.4 Error criticism and the minimization theorem Since the particular point of adjustment by elements exactly is to avoid a detailed consideration of the functions that are both observed and theoretically given, one cannot as in the case of adjustment by correlates directly test the theories about
GENERAL THEORY OF OBSERVATIONS
169
means on the system of free [ao] functions. But as both methods of adjustment are identical in principle and one in both cases knows the differences ui − oi between the adjusted and original values of the proper observations, the sums (3.55)
remain the starting point of the error criticism, and for the total sum, for which the theoretical value is , one can also establish here its minimal property (and thereby obtain an alternative proof of the identity of the methods) and give convenient formulae for the calculation of this sum of squares, not just in terms of the observations, but also in terms of the elements and free functions. For arbitrary values of x, y, …, z and the corresponding values of vi, for the observed functions we have
By differentiation with respect to x, y, …, z the equations for the mimimum are obtained in complete agreement with the main equations (3.42) for adjustment by elements. If these are assumed fulfilled, so that vi = ui, were adjusted values, then (3.56)
which can thus be used for calculation when [o2/λ2] has been computed in addition to the sums of squares and products desired earlier, on the assumption that one will carry the adjustment through to calculation of the elements x, y, …, z
170
GENERAL THEORY OF OBSERVATIONS
themselves. However, one can immediately calculate [(o − u)2 λ2] when the operation of making the functions free has been performed and the equations (3.48) are available. Simultaneously, while making the functions free by the coefficients φ, ψ, …, ω one can eliminate x, y, …, z from(3.42) and (3.56), the elimination of x giving
continuing the elimination eventually yields (3.57)
where the individual terms show how much each term in (3.53), expressing the observations as a series expansion after free ‘expansion functions’, contributes to reducing the sum of squares of the residual deviations.
3.5.5 Undetermined equations, tests, and the method of ctitious observations In general all the elements are determined by the adjustment. It is only in special possible cases that the equations (3.42) could give undetermined results so that one or more of the equations (3.48) reduce to identities. The determinant of the coefficients in (3.42) can after a well-known theorem be written as a sum of squares of determinants for arbitrary combinations of n − m of the original equations (3.41) with positive factors. It is only when the expressions for all the individual observations in terms of the element could have been expressed linearly as functions of fewer elements that undeterminedness can occur; thus in such cases, by performing the adjustment via transfer to the method of adjustment by correlates one would have had the possibility of forgetting one of the theoretical conditions. In the form of adjustment by elements this exceptional case does not do any harm other than having the effect of only determining those elements which in fact can be determined from the observations. The process of making the functions free as in (3.48) weeds out the superfluous equations and transfers them from the class of observed equations to those given theoretically; and from the equations (3.48) one finds all functions of elements which could be determined and which have a dependency relationship with the observations.
GENERAL THEORY OF OBSERVATIONS
171
One even uses artificially produced undeterminedness by adding elements for which one later will insert the value 0, to get an effective check on some of the long calculations. Such an element can just be inserted in all the equations (3.41), multiplied by a coefficient for which one has chosen an easily computable linear function of the coefficients of the other elements p, q, …, r; one then has an effective and easily computable control that can be performed during the entire process of calculation. An involuntary undeterminedness does not lead to quite the same advantages. Just as in exact mathematics one sometimes introduces additional unknowns to give the equations and their solutions a more symmetric form, this is also done in the method of least squares and in particular under adjustment by elements. Through introduction of additional elements one can often give the equations (3.41) and thereby (3.42) relatively simple forms which, apart from the control which is gained, makes the entire calculation easier. And not only can one remove the undeterminedness when desired, by giving arbitrary values to the superfluous elements, but also this can often be used to make the solution simpler. In this respect one can take advantage of what could be called the method of fictitious observations:54 assigning arbitrary values to both mean and standard deviation for such functions of the elements that are neither observed nor derivable from the expressions for the real observations in (3.41) but treating them exactly as real observations. When such inventions are only used to remedy proper undeterminedness, this method cannot make any real difference to any result of the adjustment that could be obtained by the original, undetermined equations. And as a sign that licentia poetica has not been misused in any particular case: that no more or any other functions of the elements have been invented than what the problem allows, the error criticism yields the following method of control. When O denotes a fictitious observation with the invented mean square error λ2(O), it should hold after adjustment that U(O) = O and that λ2(U(O)) = λ2 (O). According to (3.39) such inventions will have no influence at all on the numbers that are important for the error criticism.
3.5.6 Examples Example 1. A number n of observations or are to be adjusted after a function of degree 1 of a circumstance r and with given weights; thus under the assumption
54
This ‘method of fictitious observations’ is an alternative to using generalized inverse matrices (Rao and Mitra 1971). tr.
172
GENERAL THEORY OF OBSERVATIONS
that their means are equal to λ1 (or) = x + yr and λ2 (or) = qr We then have
If the first equation is kept (with the coefficients of x) and the second is made free by subtracting [r/qr]/[1/qr] times the first equation, we get
and for the individual observation
and consequently
and
In this and similar cases one can replace the entire series of observations with the necessary number, here 2, of fictitious observations, for an arbitrary pair of mutually free functions, independent of theoretical conditions, would have the form α + yβ and thus for r = α/β not be essentially different from observations of the same kind as others. If we denote such fictitious observations by o′ and o″
GENERAL THEORY OF OBSERVATIONS
173
and their λ2 by q′ and q″, and if r′ and r″ are the corresponding values of the affecting circumstance, it must hold that
r′ or r″ can be chosen arbitrarily but there is involution between r′ and r″
By r′ and r″ one finds the values for o′ and o″ just as by r one found ur and λ2 (ur). The value corresponding to r″ = ∞ is r′ = [r/qr]/[1/qr] corresponding to the o′ value [or/qr/[l/qr] which is apparent from the main equation formed with the coefficients of x. The graphical representation of ur as a function of r is a straight line; the diameter of the hyperbola that indicates the limits for the values which are acceptable after determination of the standard deviation is equal to . Example 2. A number st of observations on,m indicate the times at which s stars pass the same t parallel threads. If the nth star is assumed to pass an ideal central thread in the moment gn with known speed hn and the position of the mth thread in relation to the central thread is denoted by fm, theory thus implies that55(3.58)
where the elements to be determined are the s-values gn and the t-values fm. The mean square error is assumed to be the same for all threads, but dependent on the different stars through their speed (3.59)
The s + t elements gn and fm are distributed here such that in each of the theoretical equations only two elements appear, one gn and one fm. As these
55
This example is an instance of what is known today as the additive model for two-way classification, apparendy first studied by Edgeworth (1885), see Stigler (1978). tr.
174
GENERAL THEORY OF OBSERVATIONS
equations can be arranged in clearly separated rows, t equations with g1, t others with g2, and so on, a rather important and almost obvious theorem applies: the summation by which the principal equations (3.42) of adjustment by elements are formed from the theoretically given equations (3.41) can be executed in parts; and to the extent that the theoretical equations in one or several groups are the only equations which contain one or several elements, the elimination of such elements by the process described in (3.44), (3.45), and (3.46) can be made just as well before as after the grouping. If we make here a group for each star, all the elements gn can be eliminated and free equations for each can be obtained prior to the addition to form principal equations for the other elements fm. From the theoretical equations for the nth star we thus form the principal equations by separate additions (3.60)
(3.61)
By dividing the first equation (3.60) by hnt and subtracting it from each of the others (3.61), gn is eliminated and the first equation (3.60) is made free of the others, which then by summation yields the following principal equations for fm, for convenience introducing the mean vn = (on, 1 + · · · + on,t)/t of the passage of the same stars over all threads and [1/h2qn] = σ, while [ ] denotes summation of s similar terms with indices n = 1, 2, …, s.(3.62)
It is easy to see that these equations are not linearly independent. Their sum is identically equal to 0, and this could also be foreseen as our observations could not possibly determine origins for fm and gn. However, it is not difficult to make them mutually free, on the contrary. The right-hand sides of the equations are
GENERAL THEORY OF OBSERVATIONS
175
essentially unchanged, only t is replaced by t − 1, …, 2, 1. This way of proceeding would, however, be somewhat tedious and the calculations much less simple than they can be shown to be in another way. The number of theoretical conditions is obviously st − s − t + 1. A homogeneous linear function of f1 … ft which is just not allowed to vanish for f1 = ⋯ = ft, can be chosen arbitrarily here after the method of fictitious observations. In the current example it is easily seen that the assumption (3.63)
and the inclusion of such a fictitious observation in the formation of the principal equations, would change the equations (3.62) to the mutually free system (3.64)
This is sufficient reason to choose this form of the fictitious observation and thereby the entire system of mutually free equations becomes (3.65)
and (3.66)
so that under this fiction one would have
and
for calculation of the standard deviations after adjustment on every function of the f and the g. In particular we get for the adjusted value U = f1 + ⋯ ft of O that
so that the adjustment really has not affected our fictitious observation. This fact is sufficient evidence that our fiction (3.63) is algebraically completely independent
176
GENERAL THEORY OF OBSERVATIONS
of the adjustment equations. Therefore it can of course not influence any determination that could be generated by the adjustment without it. We can therefore use the equations (3.65) and (3.66) as mutually free and with standard deviations of the indicated quantities for calculation of any function of the observations and elements that do not depend algebraically on O. Thus for the difference of any two of the f or the g we find
and
In general it holds that
if just a1 + ⋯ + at = 0; and
if just b1 + ⋯ + bs = 0. But if these conditions are not fulfilled then both ∑ af and ∑ bg depend on O and are thus completely unknown, whereby their correct mean square errors are equal to ∞. For the adjusted values of the observations
we similarly find independently of O and as a function of mutually free functions
and thus by inserting the observations and their mean vn(3.67)
GENERAL THEORY OF OBSERVATIONS
177
while for the mean square error one gets (3.68)
In the special case where all the stars have moved with the same speed, or more generally where and then
, we have
With λ2(un, m) one finds the theoretical values of (o − u)2/q(3.69)
In particular, for the summary error criticism one therefore has in general that
But from (3.69) one can also very easily get more detailed tests. We have for example after summation over m for the passage of each individual star
If the task is to investigate the correctness of the dependence assumed in the adjustment between the speed hn and the mean square error, or if one wants to modify this law of dependence using the result of the adjustment, one must demand these last equations to be satisfied, if not exactly for all of the s individual stars, then for all major groups of stars mat move at an approximately equal speed.
178
GENERAL THEORY OF OBSERVATIONS
Example 3. An error law that only deviates a little from the exponential form is to be expanded in a series
It is assumed that counts are available that indicate the number of repetitions in each of a series of consequtive intervals of equal length, for every r ≤ x ≤ r + 1. Initially as in Section 2.2.6, eqns (2.31) and (2.32), a mean m = μ1 and standard deviation for f(x) have been determined as a simple exponential error law. m has been taken equal to 0 and n as a unit for the interval limits thereby ensuring the error law to have the simpler form
and according to (2.4) this gives a similar series for the observed values of the definite integrals between the interval limits
which is the equation to be used for adjustment. The mean square errors λ2 should be determined here after formula (3.6) in Section 3.1.5 as those μ2 = λ2 which correspond to the probabilities p/(p + q) of having oz out of m + l repetitions falling in the interval in question. But when the intervals are sufficiently numerous, so that the probabilities of each interval are small, one can without much error let be λ2 be proportional to . The assumption about the intervals being small is further extended so far (see Section 2.2.7) that one can replace without danger the sums by integrals from −∞ to ∞. The equations (3.42) for adjustment by elements give for
the very simple form
and so on, such that the elements c0, c1, c2, … are free functions of the observations. Had one used the form (2.5), where kn = (−1)nn!cn as the basis for
GENERAL THEORY OF OBSERVATIONS
179
adjustment, that is
and written [znoz] = sn one would have found an even simpler form
and so on. This is the property of the error law function that we were referring to on page 77. It should be noted also that the rules for development of the error law functions after binomial coefficients and their differences (eqns (2.2) and (2.3)) are based upon the fact that the elements in this expansion are free functions of the observations.
3.6 On systematic errors At the end of every adjustment the error criticism takes the floor, unfortunately most frequently to refrain from its rights because of the difficulty of the problem. Its most essential tool is the sums of squares for which we have shown above that one can calculate both the values they actually get according to the deviations o − u between observed and adjusted values, and the values they should have according to the theoretical assumptions on which the adjustment is based. It is easy enough to use this tool for error criticism in the simplest cases where one can or must be content with considering the total sum [(u − o)2/λ2] for all the observations, but surely this one test can only give one piece of information, namely about the average size of the deviation. In the cases where the test turns out unfavourably, and where one would then wish to learn about the particular types of error, thereby becoming able to judge in which directions the hypotheses should be improved, this is clearly not enough. It is most often considerably more difficult to calculate several such tests using the formula (3.40) or (3.55) after partitioning the series of observations. It is therefore valuable that there is a second type of test for which one at least cannot complain that it gives too little information: the criticism based upon whether what is called ‘a trend in the
180
GENERAL THEORY OF OBSERVATIONS
errors' occurs. After having calculated the deviation of every observation from its adjusted value56—and preferably also the corresponding standard deviations—these are partitioned and ordered in tables for such series of observations that are the same function of the varying essential or known circumstances, and with these as arguments. One can advantageously use here graphical representations instead of tables. By inspection of these tables or figures one must then judge whether the errors seem to vary at random about the constant 0 and then everything is fine; or whether there are indications of some movement or trend, presenting the errors as a reasonably simple function of certain known circumstances for the observations. In this connection one should naturally pay attention to the sign of the observations. Positive and negative deviations should occur with about the same frequency, as the symmetry of the assumed exponential error laws gives the same probability of positive and negative errors, and since this mostly holds to an even higher degree for the deviations between observation and adjustment than for proper errors. With n similar observations and m theoretical conditions between them, one should expect positive deviations. If the entire series of deviations is arranged in a table with a known circumstance as argument, changes of sign and neighbouring pairs of the same sign should be equally well represented with of each, when deviations that are exactly equal to 0 are counted as half a sign change and half a pair of the same sign and otherwise omitted from the counting. But what is obtained in this way is not worth much compared with the test of the sum of squares and gives as this one not much guidance in identifying the location of the error. When there are proper systematic errors present and instead of the most correct equation
between the observation and its essential circumstances one adjusts according to another equation
this will show up as a trend in the errors, indeed such a trend which would be represented by the difference F(v1, v2, …, vn) − G(v1, v2, …, vn), ignoring random observational errors.
56
A pity that Thiele did not invent the term ‘residual’. This would have made the translator's job easier. However, here Thiele gives a rather modern and extensive guidance for the analysis of residuals from linear models, tr.
GENERAL THEORY OF OBSERVATIONS
181
To correct for such systematic errors by means of error criticism one is thus referred to using the trend in the errors to infer the difference between the functions F and G. How to do this will most often depend on the particular nature of the problem, on the position of the specific theory or hypothesis within its science, and in this respect it is not the concern of the general theory of errors, within which all theories are equally good if they conform equally well with the observations. Here one can only consider such general methods and precautions by which, assuming that the failed adjustment can still be seen as an approximation, one is trying to improve it or at least represent the deviation in such a functional form that its significance becomes easier to see than by just referring to the table of deviations itself. During the unavoidable necessary initial inspection and evaluation of the tables of deviations, one must first and foremost be aware of the size of the standard deviation and the unpredictable way in which the individual deviations group according to the exponential error law, so that one is not too easily tempted to see a trend in the errors also where only randomness prevails. If giving in to such things, one could even for the best adjustments assign an ‘improvement’ consisting of a periodic function with almost half as many waves as there are sign changes in the table of deviations. When it comes to the point of expressing the deviations o − u in a functional form F(v) − G(v) in relation to a certain variable circumstance v, graphical adjustment is an excellent preliminary tool. Using v as abscissa, one plots either o − u as a single ordinate or better or as ordinates for two boundary points for each observation. Then one draws by hand a simple and beautiful curve which is either as close as possible to the first mentioned points or within a considerable majority of the limit determinations; and finally one measures the ordinates of the curve for equidistant abscissae, thus producing a table of F(v) − G(v) that by ordinary interpolation can be expressed in a functional form to guide the new adjustment towards an improved theory. To relate this to the earlier adjustment, the procedure must first of all depend on whether one has used adjustment by correlates or elements. To take as much advantage as possible of the previous calculation, with adjustment by correlates one must give up one or more of its conditioning equations, preferably one that was last in the process of making the equations free; therefore it is not unimportant that in the first adjustment one already has placed equations in the first rows for such conditions that definitely would never be abandoned. It can often be very difficult to determine which conditions to drop and apart from special theoretical considerations one can only be guided by the variation in the coefficients ai, … b′i … in eqn (3.32) and partly by the size of the corresponding terms on the right-hand side
182
GENERAL THEORY OF OBSERVATIONS
of eqn (3.37). If for example one can see that the coefficients bi vary in proportion to the deviations oi − ui; and in this connection
corresponds in size to the amount in which the actual sums of squares exceed the theoretical value m, then this is a strong indication that the condition equation ℬ = [bo] is the one which should be abandoned to obtain a satisfactory adjustment. It can therefore be recommended to include the values of the coefficients b′i on the figure for the graphical adjustment. The improvement of an adjustment by elements has a somewhat better possibility for being successful. For it must be considered easier to make suitable additions for the functions of the elements that are used to express the observations than it is to find free equations of conditions which are suitably dropped in adjustment by correlates. In the theoretical equations for the adjustment by elements (3.41) one or more terms siu are just added:
where si denotes the shape of the chosen supplementary function of the circumstances relevant for the observation while the factor u, determining the magnitude of the correction, is adjoined to the elements to be found. In other words, the theoretical expression for the observations is treated as a series expansion where one adds an extra term when the series would otherwise not fit. When choosing the functional form for si one must apart from possible special theoretical considerations be guided by the investigation of the difference F(vi) − G(vi), but one is in no way forced to follow this. If one can simplify si by choosing an arbitrary linear function of pi, qi, …, ri then the result is unchanged since such an addition will again disappear during the process of making the functions free from (3.42) through (3.48) to (3.53). In this calculation one can keep all calculations from the previous adjustment and it is only necessary to add the calculation of the influence of the new term on the observations after (3.53). Finally (3.57) shows how much this term contributes to reducing the empirical value of the sum of squares. No matter how one has chosen the additional term si according to this equation the result will always be a reduction and never an increase of this value. However, one must in no way deduce that addition of a new term always means an improvement of the adjustment, because by adding u to the system of elements, the theoretical value of the same sum of squares is reduced definitely with one unit and rather with more than that, if the term apart from the unknown factor
GENERAL THEORY OF OBSERVATIONS
183
contains constants with values determined from what one in the particular case has been able to estimate from the graphical adjustment or similar things; and thus the reduction of the empirical sum of squares of deviations must at least amount to that much for the addition of the term to be justified. By such a successive addition of new terms one will get in (3.53) the adjusted observations expanded in a series of the particular type which Dr Gram57 has treated under the name of a series for interpolation after expansion functions, which are exactly distinguished by the property that all the terms are mutually free functions. But when the demand here is not just that the series should at all be convergent, but that they should converge so quickly that only a few terms are necessary, it will naturally always be a big question whether one succeeds in making a satisfactory adjustment without eventually ending up with a graphical adjustment and a table of deviations as an expression for the remaining trend in the errors. However, where the rejection of a previous adjustment is based upon deviations with respect to the summary tests, in particular that of the sum of squares, while there is no trend at all in the errors or at least only an insignificant one, the reason should not be sought in an incorrect assumption concerning the theories which connect the means of the observations, but rather in an incorrect determination of their standard deviations. If these have been based upon previous treatment of repeated observations it will often just be the case that the number of repetitions has been too small to give reliable values for the standard deviation of the individual observations. If it is possible, for this purpose one should reinforce the proper repetitions of these observations with such observations where the essential circumstances have only varied very little, so a preliminary adjustment will suffice for a reduction to simple repetitions. In any case one should then use the equations (3.40) or (3.55) for a subsequent determination of the standard deviation of the observations and base the determination of standard deviations for the adjusted observations on this. But if one must acknowledge that the previous determinations of the standard deviations have been reliable and thus admit that the departures from theory are real, it is most straightforward to blame the deviations of the error laws on the assumed exponential form. This can be justified; in particular gross errors, errors of writing, calculation, or typographical errors could play such a role, since these are easily discovered and corrected when they are many times larger than the standard
57
Om Rækkeudviklinger, bestemte ved mindste Kvadraters Methode. København 1879. (On series expansions, determined by the method of least squares). Thiele's footnote.
184
GENERAL THEORY OF OBSERVATIONS
deviations, but extremely difficult to see when they are of the same magnitude, and they are always more easily avoided during simple repetitions, where the attention can be confined to the last decimals, than by isolated observations under varying circumstances. But whereas repetitions cannot immediately show departures in the third and higher halfinvariants, it must be characterized as being hopeless to establish such departures from error criticism of an adjustment. The explanation that the method of observation can change much more than expected under variation of the essential circumstances has more practical significance, since there can easily be dependencies between essential circumstances and circumstances that are ignored as being inessential which let the standard deviation by real repetitions be much smaller than by apparently similar observations with the essential circumstances varying; the value of λ2 which is calculated from the repetitions will therefore only be a fraction of what corresponds to the scattered observations in the adjustment. Also under this assumption one must then use the equations of the sums of squares of the adjustment itself for calculation of the standard deviations of the observations and weights to be used for a renewed adjustment, where in particular the proper repetitions are no longer considered completely mutually independent, but must be reduced as to normal values with standard deviations that are easily calculated by the formula
when the number of repetitions is n, λ′2 is the mean square error for pure repetitions, and λ′2 + λ″2 is the mean square error for scattered observations. When there is a dependence between essential and inessential circumstances, not only the pure repetitions but also more or fewer of the scattered observations could lose their right to be considered as mutually independent. This can become a crucial obstacle for all adjustment, as it is an indispensable assumption of the method that the observations to be adjusted are mutually independent; where this does not hold, the functions that are assumed to be free cease to be so in reality. Under such circumstances there will surely be in most cases a trend in the errors that would destroy every attempt to establish a valid adjustment by modifications of the theory through adding additional terms in the way described above. It would not help that one in general acknowledges that there are dependencies present between the observations; it will be necessary to show exactly what these dependencies are; and even then there will often not be anything else to do other than to consider the real observations as linear functions of an extended system of observations of which some represent circumstances that have been ignored as
GENERAL THEORY OF OBSERVATIONS
185
being inessential, are having a mean equal to 0 and standard deviation according to an acceptable hypothesis, expressing what most likely, but not with theoretical certainty, can be assumed to vanish. This often leads to very difficult problems. But the aimed mutual freeness of the observations, which preferably should be achieved by renewed observations after improved methods, seems in many important cases (quasi-systematic errors) to be achievable by the method briefly indicated here.
4 Tables and Figures According to what has been mentioned above on page 77, the following tables for the values of the integrals of the error function can be interpolated according to the formula
186
GENERAL THEORY OF OBSERVATIONS
x
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
x
0.000 .009 .019 .029 .039 .049 .059 .069 .079 .089 0.099 .109 .119 .129 .139 .149 .159 .169 .179 .188 0.198 .208 .218 .227 .237 .247 .257 .266 .276 .285 0.295 .305 .314 .324 .333 .342 .352 .361 .371 .380 0.389 .398 .407
0000 9998 9987 9955 9893 9792 9640 9429 9147 8786 8336 7786 7126 6348 5440 4394 3199 1847 0327 8630 6746 4667 2381 9882 7158 4201 1001 7551 3840 9859 5601 1056 6215 1071 5614 9836 3729 7284 0494 3350 5845 7971 9720
0.00 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.99 9.98 9.98 9.98 9.98 9.98 9.98 9.98 9.98 9.98 9.97 9.97 9.97 9.97 9.97 9.97 9.97 9.96 9.96 9.96 9.96 9.96
000 998 991 980 965 946 922 894 861 824 783 737 687 633 574 511 444 372 296 216 131 042 949 851 749 643 532 417 298 174 046 913 776 635 490 340 186 027 864 697 526 350 170
0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 0.92
xu
0.479 .488 .497 .506 .514 .523 .532 .540 .549 .557 0.565 .574 .582 .590 .598 .606 .615 .623 .631 .638 .646 .654 .662 .670 .677 .685 .692 .700 .707 .714 0.722 .729 .736 .743 .750 .757 .764 .771 .778 .785 0.791 .798 .805
9252 7280 4860 1986 8652 4850 0576 5823 0586 4858 8635 1911 4680 6937 8678 9898 0591 0754 0381 9468 8012 6008 3452 0340 6670 2436 7636 2267 6326 9809 2714 5038 6779 7933 8500 8476 7860 6650 4844 2441 9438 5836 1631
9.94 9.94 9.94 9.93 9.93 9.93 9.93 9.92 9.92 9.92 9.92 9.91 9.91 9.91 9.91 9.90 9.90 9.90 9.89 9.89 9.89 9.89 9.88 9.88 9.88 9.87 9.87 9.87 9.86 9.86 9.86 9.85 9.85 9.85 9.84 9.84 9.83 9.83 9.83 9.82 9.82 9.82 9.81
571 352 128 900 668 431 190 945 695 441 183 920 653 381 106 826 541 252 959 662 360 054 743 428 109 785 458 125 789 448 103 753 399 041 678 311 940 564 184 800 411 018 621
−0.025 24 23 22 21 −0.020 19 18 17 16 15 14 13 12 11 −0.010 9 8 7 6 5 4 3 2 1 0.000 1 2 3 4 5 6 7 8 9 +0.010 11 12 13 14 15 16 17
0.00
0.00
0.00 9.99
9.99
544 522 500 479 457 435 413 391 370 348 326 304 283 261 239 217 196 174 152 130 109 087 065 043 022 000 978 957 935 913 891 870 848 826 805 783 761 740 718 696 675 653 631
187
GENERAL THEORY OF OBSERVATIONS
0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39
.417 .426 .435 .444 .453 .462 .471 0.479 0.855 .861 .867 .873 .879 .885 .890 .896 .902 .907 0.913 .918 .924 .929 .934 .939 .944 .950 .955 .959 0.964 .969 .974 .979 .983 .988 .993 .997 1.001 1.006 1.010 1.014 1.019 1.023 1.027 1.031 1.035 1.039 1.043 1.047
1084 2056 2629 2795 2548 1879 0783 9252 6244 6594 6337 5474 4005 1931 9251 5967 2080 7590 2498 6805 0513 3622 6135 8053 9377 0109 0251 9805 8772 7156 4958 2180 8824 4894 0392 5320 9681 3478 6713 9390 1512 3081 4101 4575 4506 3898 2754 1078
9.95 9.95 9.95 9.95 9.95 9.94 9.94 9.94 9.78 9.77 9.77 9.76 9.76 9.76 9.75 9.75 9.74 9.74 9.73 9.73 9.72 9.72 9.71 9.71 9.70 9.70 9.69 9.69 9.68 9.68 9.67 9.67 9.66 9.66 9.65 9.64 9.64 9.63 9.63 9.62 9.62 9.61 9.61 9.60 9.59 9.59 9.58 9.58
985 796 603 405 203 997 786 571 285 849 408 963 513 060 601 139 672 201 725 245 761 272 780 282 781 275 765 250 731 207 680 148 611 071 526 976 423 865 302 735 165 589 009 425 836 244 646 045
0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 1.50 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.60 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.70 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79 1.80 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88 1.89
.811 .818 .824 .830 .837 .843 .849 0.855 1.085 1.089 1.092 1.095 1.098 1.101 1.104 1.107 1.110 1.113 1.115 1.118 1.121 1.124 1.126 1.129 1.131 1.134 1.136 1.139 1.141 1.143 1.146 1.148 1.150 1.152 1.155 1.157 1.159 1.161 1.163 1.165 1.167 1.169 1.170 1.172 1.174 1.176 1.177 1.179
6825 1415 5400 8781 1556 3725 5288 6244 8533 0756 2495 3756 4542 4858 4707 4094 3024 1500 9527 7109 4250 0956 7229 3076 8499 3504 8094 2275 6051 9426 2405 4991 7191 9007 0445 1509 2204 2533 2502 2114 1375 0288 8859 7091 4988 2556 9798 6720
9.81 9.80 9.80 9.79 9.79 9.79 9.78 9.78 9.51 9.50 9.49 9.49 9.48 9.47 9.47 9.46 9.45 9.45 9.44 9.43 9.43 9.42 9.41 9.40 9.40 9.39 9.38 9.37 9.37 9.36 9.35 9.35 9.34 9.33 9.32 9.31 9.31 9.30 9.29 9.28 9.28 9.27 9.26 9.25 9.24 9.24 9.23 9.22
219 813 402 988 569 145 717 285 142 488 830 168 501 830 155 475 791 103 410 713 012 306 596 882 163 440 712 981 244 504 759 010 257 499 736 970 199 424 644 860 072 280 483 681 876 066 251 433
18 19 +0.020 21 22 23 24 +0.025 −0.025 24 23 22 21 −0.020 19 18 17 16 15 14 13 12 11 −0.010 9 8 7 6 5 4 3 2 1 0.000 1 2 3 4 5 6 7 8 9 +0.010 11 12 13 14
9.99
9.99
0.00
0.00
0.00 9.99
9.99
610 588 566 545 523 502 480 458 544 522 500 479 457 435 413 391 370 348 326 304 283 261 239 217 196 174 152 130 109 087 065 043 022 000 978 957 935 913 891 870 848 826 805 783 761 740 718 696
188 1.40 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49 1.50 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25 2.26 2.27 2.28 2.29 2.30 2.31 2.32 2.33 2.34 2.35 2.36
GENERAL THEORY OF OBSERVATIONS
1.050 1.054 1.058 1.061 1.065 1.069 1.072 1.075 1.079 1.082 1.085 1.196 1.197 1.198 1.200 1.201 1.202 1.203 1.205 1.206 1.207 1.208 1.209 1.210 1.211 1.212 1.213 1.214 1.215 1.216 1.217 1.218 1.219 1.220 1.221 1.222 1.223 1.224 1.225 1.226 1.227 1.228 1.229 1.230
8872 6141 2888 9117 4832 0036 4733 8928 2623 5824 8533 2880 6279 9411 2281 4892 7248 9354 1213 2829 4205 5347 6257 6939 7397 7635 7655 7463 7061 6453 5643 4634 3429 2031 0445 8673 6719 4585 2277 9796 7145 4328 1348 8206 4908 1456 7852 4099
9.57 9.56 9.56 9.55 9.54 9.54 9.53 9.53 9.52 9.51 9.51 9.13 9.12 9.11 9.10 9.09 9.08 9.07 9.06 9.06 9.05 9.04 9.03 9.02 9.01 9.00 8.99 8.98 8.97 8.96 8.95 8.94 8.93 8.92 8.92 8.91 8.90 8.89 8.88 8.87 8.86 8.85 8.84 8.83 8.82 8.81 8.80 8.79
439 829 214 596 972 345 713 076 436 791 142 141 270 395 516 632 744 851 955 053 148 238 324 405 482 555 624 688 748 803 854 901 943 981 015 044 069 090 106 118 126 129 128 123 113 099 080 058
1.90 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 2.00 2.50 2.51 2.52 2.53 2.54 2.55 2.56 2.57 2.58 2.59 2.60 2.61 2.62 2.63 2.64 2.65 2.66 2.67 2.68 2.69 2.70 2.71 2.72 2.73 2.74 2.75 2.76 2.77 2.78 2.79 2.80 2.81 2.82 2.83 2.94 2.85 2.86
1.181 1.182 1.184 1.186 1.187 1.189 1.190 1.192 1.193 1.194 1.196 1.237 1.238 1.239 1.240 1.241 1.242 1.243 1.244
1.245
1.246
1.247
1.248
3324 9616 5600 1279 6659 1744 6537 1043 5266 9211 2880 7488 1827 6059 0185 4208 8131 1954 5681 9314 2853 6303 9664 2938 6127 9234 2260 5207 8076 0870 3590 6237 8814 1323 3764 6139 8450 0698 2885 5013 7082 9094 1050 2953 4802 6600 8347 0046
9.21 9.20 9.19 9.19 9.18 9.17 9.16 9.15 9.14 9.14 9.13 8.64 8.63 8.62 8.61 8.59 8.58 8.57 8.56 8.55 8.54 8.53 8.52 8.50 8.49 8.48 8.47 8.46 8.45 8.44 8.42 8.41 8.40 8.39 8.38 8.36 8.35 8.34 8.33 8.32 8.30 8.29 8.28 8.27 8.26 8.24 8.23 8.22
610 783 951 115 274 430 581 727 870 008 141 283 195 103 066 905 800 690 576 458 335 208 077 941 801 657 508 355 198 036 870 700 525 346 162 975 782 586 385 180 970 757 538 316 089 858 622 382
15 16 17 18 19 +0.020 21 22 23 24 +0.025 −0.025 24 23 22 21 −0.020 19 18 17 16 15 14 13 12 11 −0.010 9 8 7 6 5 4 3 2 1 0.000 1 2 3 4 5 6 7 8 9 +0.010 11
9.99
0.00
0.00
0.00 9.99
9.99
675 653 631 610 588 566 545 523 502 480 458 544 522 500 479 457 435 413 391 370 348 326 304 283 261 239 217 196 174 152 130 109 087 065 043 022 000 978 957 935 913 891 870 848 826 805 783 761
189
GENERAL THEORY OF OBSERVATIONS
2.37 2.38 2.39 2.40 2.41 2.42 2.43 2.44 2.45 2.46 2.47 2.48 2.49 2.50 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.30 3.31 3.32 3.33
1.231 1.232 1.233 1.234 1.235 1.236 1.237 1.237 1.249 1.250
1.250 1.251
1.251
1.252 1.252
0201 6160 1978 7660 3206 8621 3906 9064 4098 9010 3802 8478 3039 7488 9304 0399 1461 2491 3491 4460 5401 6313 7198 8055 8887 9693 0475 1233 1967 2678 3368 4036 4683 5310 5917 6506 7075 7627 8161 8678 9178 9662 0131 0585 1024 1448 1859 2257
8.78 8.76 8.75 8.74 8.73 8.72 8.71 8.70 8.69 8.68 8.67 8.66 8.65 8.64 8.04 8.03 8.01 8.00 7.99 7.97 7.96 7.95 7.94 7.92 7.91 7.89 7.88 7.87 7.85 7.84 7.83 7.81 7.80 7.79 7.77 7.76 7.74 7.73 7.72 7.70 7.69 7.67 7.66 7.64 7.63 7.62 7.60 7.59
031 999 963 923 879 830 777 719 657 591 521 446 367 283 567 262 953 639 321 999 672 341 005 666 322 973 620 263 902 536 165 791 412 029 641 249 853 452 048 638 225 807 384 958 527 091 652 208
2.87 2.88 2.89 2.90 2.91 2.92 2.93 2.94 2.95 2.96 2.97 2.98 2.99 3.00 3.50 3.51 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59 3.60 3.61 3.62 3.63 3.64 3.65 3.66 3.67 3.68 3.69 3.70 3.71 3.72 3.73 3.74 3.75 3.76 3.77 3.78 3.79 3.80 3.81 3.82 3.83
1.249
1.249 1.252
1.252
1.253 1.253
1.253
1696 3300 4858 6372 7842 9271 0659 2006 3314 4585 5818 7015 8177 9304 7310 7525 7733 7933 8127 8313 8493 8667 8835 8997 9153 9304 9449 9589 9724 9855 9980 0101 0218 0331 0439 0544 0644 0742 0835 0925 1012 1095 1176 1253 1328 1400 1469 1535
8.21 8.19 8.18 8.17 8.16 8.14 8.13 8.12 8.11 8.09 8.08 8.07 8.05 8.04 7.33 7.32 7.30 7.29 7.27 7.26 7.24 7.23 7.21 7.20 7.18 7.17 7.15 7.13 7.12 7.10 7.09 7.07 7.05 7.04 7.02 7.01 6.99 6.97 6.96 6.94 6.93 6.91 6.89 6.88 6.86 6.84 6.83 6.81
138 889 636 379 118 852 581 307 028 744 457 165 868 567 995 472 946 415 880 340 796 248 695 138 577 012 442 867 289 706 118 527 931 330 725 116 503 885 263 637 006 371 731 088 439 787 130 469
12 13 14 15 16 17 18 19 +0.020 21 22 23 24 +0.025 −0.025 24 23 22 21 −0.020 19 18 17 16 15 14 13 12 11 −0.010 9 8 7 6 5 4 3 2 1 0.000 1 2 3 4 5 6 7 8
9.99
9.99
0.00
0.00
0.00 9.99
740 718 696 675 653 631 610 588 566 545 523 502 480 458 544 522 500 479 457 435 413 391 370 348 326 304 283 261 239 217 196 174 152 130 109 087 065 043 022 000 978 957 935 913 891 870 848 826
190 3.34 3.35 3.36 3.37 3.38 3.39 3.40 3.41 3.42 3.43 3.44 3.45 3.46 3.47 3.48 3.49 3.50 4.00 4.01 4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30
GENERAL THEORY OF OBSERVATIONS
1.252
1.252 1.253
1.253
1.253
1.253
2641 3013 3373 3720 4057 4382 4696 4999 5293 5577 5851 6115 6371 6618 6857 7088 7310 2347 2380 2412 2442 2471 2499 2526 2552 2577 2601 2624 2645 2667 2687 2706 2725 2743 2760 2776 2792 2807 2821 2835 2848 2861 2873 2885 2896 2907 2917 2927
7.57 7.56 7.54 7.53 7.51 7.50 7.48 7.47 7.46 7.44 7.43 7.41 7.40 7.38 7.37 7.35 7.33 6.52 6.50 6.49 6.47 6.45 6.43 6.42 6.40 6.38 6.36 6.34 6.33 6.31 6.29 6.27 6.26 6.24 6.22 6.20 6.18 6.16 6.15 6.13 6.11 6.09 6.07 6.05 6.04 6.02 6.00 5.98
759 307 849 388 922 452 978 499 016 528 037 540 040 535 026 512 995 564 825 081 333 581 824 063 298 528 754 975 193 406 614 818 018 214 405 592 774 952 126 296 461 621 778 930 078 221 360 495
3.84 3.85 3.86 3.87 3.88 3.89 3.90 3.91 3.92 3.93 3.94 3.95 3.96 3.97 3.98 3.99 4.00 4.50 4.51 4.52 4.53 4.54 4.55 4.56 4.57 4.58 4.59 4.60 4.61 4.62 4.63 4.64 4.65 4.66 4.67 4.68 4.69 4.70 4.71 4.72 4.73 4.74 4.75 4.76 4.77 4.78 4.79 4.80
1.253
1.253 1.253
1.253
1.253
1.253
1599 1661 1720 1777 1832 1885 1936 1985 2032 2077 2120 2162 2202 2241 2278 2313 2347 3056 3060 3064 3067 3071 3074 3077 3080 3083 3086 3088 3091 3093 3096 3098 3100 3102 3104 3105 3107 3109 3110 3112 3113 3115 3116 3117 3118 3119 3120 3121
6.79 6.78 6.76 6.74 6.73 6.71 6.69 6.68 6.66 6.64 6.62 6.61 6.59 6.57 6.56 6.54 6.52 5.60 5.58 5.56 5.54 5.52 5.50 5.48 5.46 5.44 5.42 5.40 5.38 5.36 5.34 5.32 5.30 5.28 5.26 5.24 5.22 5.20 5.18 5.16 5.14 5.12 5.10 5.07 5.05 5.03 5.01 4.99
803 134 459 781 098 411 719 023 323 618 909 196 478 756 030 299 564 277 320 360 394 425 451 473 490 503 512 516 517 512 504 491 473 452 426 395 361 322 278 231 179 122 062 996 927 853 775 693
9 +0.010 11 12 13 14 15 16 17 18 19 +0.020 21 22 23 24 +0.025 −0.025 24 23 22 21 −0.020 19 18 17 16 15 14 13 12 11 −0.010 9 8 7 6 5 4 3 2 1 0.000 1 2 3 4 5
9.99
9.99
0.00
0.00
0.00 9.99
805 783 761 740 718 696 675 653 631 610 588 566 545 523 502 480 458 544 522 500 479 457 435 413 391 370 348 326 304 283 261 239 217 196 174 152 130 109 087 065 043 022 000 978 957 935 913 891
191
GENERAL THEORY OF OBSERVATIONS
4.31 4.32 4.33 4.34 4.35 4.36 4.37 4.38 4.39 4.40 4.41 4.42 4.43 4.44 4.45 4.46 4.47 4.48 4.49 4.50
1.253
1.253
2937 2946 2954 2963 2971 2978 2986 2993 2999 3006 3012 3018 3023 3029 3034 3039 3043 3048 3052 3056
xu −0.050 49 48 47 46 45 44 43 42 41 −0.040 39 38 37 36 35 34 33 32 31 −0.030 29 28 27
5.96 5.94 5.92 5.90 5.89 5.87 5.85 5.83 5.81 5.79 5.77 5.75 5.73 5.71 5.69 5.68 5.66 5.64 5.62 5.60
625 751 873 990 103 212 316 416 512 603 690 772 851 925 994 059 120 177 229 277
4.81 4.82 4.83 4.84 4.85 4.87 4.88 4.89 4.91 4.93 4.95 4.97 5.00 5.02 5.06 5.09 5.14 5.21 5.31 5.54
1.253
1.253
3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 1337 3138 3139 3140 3141
4.97 4.95 4.93 4.91 4.89 4.84 4.82 4.80 4.76 4.72 4.67 4.63 4.57 4.52 4.44 4.37 4.26 4.10 3.87 3.33
606 515 419 320 215 994 877 755 499 226 935 627 131 780 025 413 306 573 729 541
6 7 8 9 +0.010 9.99 11 12 13 14 15 16 17 18 19 +0.020 9.99 21 22 23 24 +0.025
xu 0.01
0.01 0.00
0.00
0.00
090 068 046 025 003 981 959 937 915 893 871 850 828 806 784 762 740 719 697 675 653 631 609 588
+0.000 1 2 3 4 5 6 7 8 9 +0.010 11 12 13 14 15 16 17 18 19 +0.020 21 22 23
0.00 9.99
9.99
9.99
000 978 957 935 913 891 870 848 826 805 783 761 740 718 696 675 653 631 610 588 566 545 523 502
870 848 826 805 783 761 740 718 696 675 653 631 610 588 566 545 523 502 480 458
192 26 25 24 23 22 21 −0.020 19 18 17 16 15 14 13 12 11 −0.010 9 8 7 6 5 4 3 2 1 −0.000
GENERAL THEORY OF OBSERVATIONS
0.00
0.00
0.00
566 544 522 500 479 457 435 413 391 370 348 326 304 283 261 239 217 196 174 152 130 109 087 065 043 022 000
24 25 26 27 28 29 +0.030 31 32 33 34 35 36 37 38 39 +0.040 41 42 43 44 45 46 47 48 49 +0.050
9.99
9.99
9.99 9.98 9.98
480 458 437 415 393 372 350 329 307 285 264 242 220 199 177 156 134 113 091 070 048 027 005 983 962 940 919
GENERAL THEORY OF OBSERVATIONS
Table of differential quotients of the function
193
194
GENERAL THEORY OF OBSERVATIONS
Auxiliary tables for the calculations on pages 130–132 The following seven tables display the coefficients in the expansions of the functions on the left-hand sides of degree 2 to 8 of the sums of powers of the observations in terms of sums, ∑ of simple products of the powers of the observations. The ∑-sums here are made such that the indices of the observations o are assumed to be arranged in an invariant order, that is as ‘repeated sums’, which is most convenient for our particular application. If one were to apply sums, [ ], ‘without repetitions’, the coefficients would be multiplied by factors r!, s!, t!…, where r, s, and t are the number of mutually identical powers in the products under the ∑-sign in the heading of the corresponding column. For example, we have from the first table that
and from the third table
∑ O2 1 1
= S2 = ∑ O3 1 1 1
= S1S2 = S3 = ∑ O4 1 1 1 1 1
= S2 = = S1S3 = S4 =
∑ O5 1 1 1 1 1 1 1
= S2 = S1 = S3 = S2S3 = S1S4 = S5 =
= S2 = =
∑ OO 1 ∑ O2O 3 1
∑ O3O 4 2 0 1 ∑ O4O 5 3 1 2 0 1
∑ O2O2 3 1 1
∑ O3O2 10 4 2 1 1
∑ O3OO 10 3 0 1
∑ O6
∑ O5O
∑ O4O2 ∑ O3O3 ∑ O4OO ∑ O3O2O
1 1 1
6 4 2
15 7 3
10 4 2
15 6 1
60 16 4
∑ OOO 1
∑ O2OO 6 1
∑ O2O2O 15 3 1
∑ OOO 1
∑ O2OOO 10 1
∑ OOOOO 1
∑ ∑ ∑ ∑ O2O2O2 O3OOO O2O2OO O2OOOO 15 20 45 15 3 4 6 1 1 0 1
∑ OOOOOO 1
195
GENERAL THEORY OF OBSERVATIONS
S3 = = S1S2S3 = S4 = = S2S4 = S1S5 = S6 =
1 1 1 1 1 1 1 1
3 0 1 2 0 0 1
= S1 = = =
3 0 0 1
3 0 1
0 1
1
= 1 S2 1
7 5
=1 S3 1
3 4
5 6
7 5
3 6
9 12
6 4
7 3
1 4
6 6
3 0
1
1
3
3
0
3
0
3
0
0
1
S2S3 1
2
2
3
1
2
2
1
0
1
S4
1
3
3
1
3
3
0
0
1
S3
1
0
2
1
0
0
0
1
1
1
0
2
0
0
1
1
1
1
1
0
1
1
2
1
0
1
1 1 1 1
0 0 1
0 1
1
= S1 = S1S2S4 = S5 = S3S4 = S2S5 = S1S6 = S7 =
∑ O8
=
1 0 1 0 1
∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ O5O2 O4O3 O5OO O4O2- O3O3- O3O2- O4O- O3O2- O2O2- O3O- O2O2- O2O- OOOOO OO O2O OOO OOO OOO- OOOO O O2 O O 21 35 21 105 70 105 35 210 105 35 105 21 1 11 15 10 35 20 25 10 40 15 5 10 1
∑ O7 ∑ O6O
=
3 3 1 1 0 1
= S2 = = S3 = = S2S3
1 1 1 1 1 1
S4 = 1 = 1
0 1
1
∑ O7O ∑ O6O2 ∑ O5O3 ∑ O4O4 ∑ O6OO 8 28 56 35 28 6 16 26 15 15 4 8 12 7 6 5 10 11 5 10 2 4 6 3 1 3 4 5 3 3
∑ O5O2O 168 66 20 30 6 6
∑ O4O3O 280 90 28 25 6 9
∑ O4O2O2 210 60 16 15 6 3
∑ O3O3O2 280 70 20 10 6 4
∑ O5OOO 56 20 4 10 0 1
∑ O4O2OO 420 105 18 30 3 3
4 0
12 0
4 0
3 6
0 0
4 0
6 0
6 4
4 0
1 3
6 0
196 S1 S3 =1 = S2S4 = S5 = S2 = S4 = S1S3S4 = S1S2S5 = S6 = = S3S5 = S2S6 = S1S7 = S8 = ∑ O3O3OO 280 60 12 10 0 3 0 0 0 1
GENERAL THEORY OF OBSERVATIONS
1
1
2
3
1
0
2
1
1
2
0
0
1 1
2 2
1 2
2 2
2 1
1 1
0 2
4 2
0 1
1 0
0 0
0 1
1 1 1 1
3 0 0 1
3 1 2 0
1 2 0 1
0 0 1 1
3 0 0 0
3 0 0 0
0 0 0 1
0 0 1
0 1
1
1
1
1
1
0
0
1
1 1 1 1 1 1
2 0 0 0 1
1 0 0 1
0 0 1
0 1
1
∑ ∑ ∑ ∑ ∑ O3O2O2O O2O2O2O2 O4OOOO O3O2OOO O2O2O2OO 840 105 70 560 420 150 15 15 80 45 28 3 1 8 6 15 0 5 10 0 6 1 0 0 1 3 0 0 1 0 0 1 0 1 1
∑ O3OOOOO 56 6 0 1
∑ O2O2OOOO 210 15 1
∑ O2OOOOOO 28 1
∑ OOOOOOOO 1
GENERAL THEORY OF OBSERVATIONS
197
CHAPTER FIVE T. N. Thiele's contributions to statistics 58
Presentation This paper, a reprint of Hald (1981), contains a detailed discussion of Thiele's contributions to statistics and a brief summary of some of his contributions to other areas. It focuses mainly on Thiele (1889) but discusses the entire statistical production of Thiele, with three exceptions: Thiele's contributions to probability and likelihood, Thiele's correction for grouping, and Thiele's series expansions. These three issues have recently been discussed in Edwards (2001), Hald (2001), and Hald (2000b), respectively. Thiele's work is placed in a historical perspective and explained in modern terms. S.L.L. Fig. 1 Sketch portrait of T. N. Thiele; study by P. S. Krøyer for use in his painting ‘Et Møde i Videnskabernes Selskab’ (A meeting of the Royal Danish Academy of Sciences and Letters), 1897. Statens Museum for Kunst.
58
Reprinted with permission form International Statistical Review.
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
199
1 The background The works of Laplace and Gauss in the beginnings of the 19th century mark a new era in the theory of statistics. The central limit theorem proved by Laplace in 1812 provided the basic tool for the asymptotic theory of estimation and was used in his discussion of the method of least squares. Gauss (1809) gave a probabilistic basis for the method of least squares assuming a uniform distribution of the parameters and defining the best estimate as the one maximizing the posterior density. In his second proof, Gauss (1821–1823) gave a non-Bayesian theory of estimation for the linear model defining the best estimate as the linear unbiased estimate with minimum variance. Within the chosen framework, Gauss's theory of estimation was rather complete. Nevertheless, an immense number of papers on the method of least squares was published in the remaining part of the 19th century with the purposes to disseminate knowledge of the method, to give examples of applications and study the distribution of residuals, to discuss the numerical problems in solving the normal equations, to derive the solution for special cases of the linear model and to give other motivations for using the method than those presented by Gauss. Danish astronomers, geodesists and actuaries took part in this development. Heinrich Christian Schumacher (1780–1850), Professor of Astronomy at the University of Copenhagen, studied under Gauss in *1808–1809 and was therefore well acquainted with the method of least squares at the time of Gauss's first publication on this subject. When the Danish Geodetic Institute was founded in 1816, Schumacher became its director, and as such he initiated a close cooperation between a geodetic survey of Denmark, starting in 1817, and a geodetic survey of Hannover, which Gauss was commissioned to begin in 1818. A biography (in Danish) of Schumacher has been written by Einar Andersen (1975). Whereas Schumacher accepted the method of least squares on the authority of Gauss the next two generations of Danish statisticians were rather critical towards Gauss's justifications of the method. Carl Christopher Georg Andræ (1812–1893) succeeded Schumacher as director of the Danish Geodetic Institute. Besides completing the survey and carrying out the analysis of the data he wrote several papers on the method of least squares. Andræ (1867) assumed like Gauss (1809) that the parameters are uniformly distributed but in contradistinction to Gauss he defined the best estimate as the one giving maximum concentration of posterior probability for absolute deviations of any size. For a symmetric distribution this leads to the minimum variance
200
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
estimate as the best. Andræ pointed out that the best estimate depends on the error distribution. As an example he considered a uniform error distribution and proved that the best estimate is the average of the smallest and the largest observation. On these grounds he criticized Gauss's second model and concluded that the method of least squares leads to the best estimate only if the observations are normally distributed. Thus he combined the ideas of the two proofs of Gauss by first deriving the most probable value in the posterior distribution and afterwards proving that this leads to the minimum mean square error. Andræ (1860) also wrote an interesting paper on the estimation of the median and the interquartile distance in a symmetric distribution by means of linear functions of empirical percentage points. Georg Karl Christian Zachariae (1835–1907), who succeeded Andræ as director, wrote a textbook (1871) on the method of least squares. As a motivation for basing his exposition on the normal distribution he gave a detailed discussion of the hypothesis of elementary errors, i.e. the hypothesis that any observed error maybe considered as the sum of a large number of independent elementary errors having different (symmetric) distributions with finite moments, and proved the central limit theorem using characteristic functions following the proof given by Bessel (1838). Based on the ideas of Andræ he gave an excellent exposition of the method of least squares. About the same time another excellent textbook was written by the German geodesist F. R. Helmert (1872) giving a rather complete survey of the method of least squares as developed by Gauss and his pupils. Helmert's book is essentially based on the second proof by Gauss. Frederik Moritz Bing (1839–1912), actuary at the State Life Insurance Company, attacked the use of Bayesian methods in statistics in a paper published in Tidsskrift for Mathematik, 1879. A rather heated discussion evolved between Bing and Ludvig Valentin Lorenz (1829–1891), Professor of Physics at the Military Academy, who defended the Bayesian view. Bing demonstrated that Bayes's postulate may lead to contradictory results because the posterior probability depends on what transformations of the hypotheses (or parameters) are assumed to have equal probabilities. Thiele (1889, p. 55; 1903, p. 135) accepted Bing's argumentation. An account of Bing's arguments has been given by Arne Fisher (1922, pp. 54–81). Ludvig Henrik Ferdinand Oppermann (1817–1883), Professor of German at the University of Copenhagen and actuary at the State Life Insurance Company, exerted a profound influence on Thiele. In a paper of 1872 on the justification of the method of least squares he introduces a loss function, L say, and like Laplace and Gauss he defines the best estimator as the one minimizing L, but, whereas
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
201
Laplace and Gauss arbitrarily chose the functional form of L, Oppermann wanted to derive the functional form from fundamental principles without specifying the distribution of the observations. We shall sketch his main ideas. He restricted the class of estimators considered to location and scale equivariant functions. Let L = L(γ1 - m, …, γn - m), where the γ's denote n independent observations and m denotes an estimator. First he requires that L for all the observations should equal the sum of the L's for subgroups of observations, which leads to L = ∑ F(γi - m), where F is unknown. Next he requires that the best estimator based on all the observations should also be obtainable by finding the best estimators from subgroups of observations and combining them by means of the loss function as if they were direct observations. This leads to a differential equation which has the solution F(x) = x2 and hence the method of least squares. Oppermann used the distribution which today is known as the Gram–Charlier type A series, and estimated the parameters by the method of moments (Gram 1879, pp. 6–7, 94). Furthermore, he also pointed out the advantages of orthogonal transformations of observations in the linear model, see Gram (1879, p. 7) and Thiele (1903, p. 54). Jørgen Pedersen Gram (1850–1916), mathematician and actuary, worked closely together with Thiele. They were both actuaries in the Life Insurance Company, Hafnia, Thiele from 1872 and Gram from 1875. Gram wrote an important thesis (1879) on series expansions by means of orthogonal functions and the determination of the coefficients by the method of least squares. One of the series discussed is the Gram–Charlier Type A series. He also gave the general formula for variance-stabilizing transformations, g′(x) = 1/h(x), where g(x) denotes the transformation function and h(x) gives the relation between the mean and the standard deviation of the variable in question (Gram, 1879, p. 98). Furthermore, he discussed stratified sampling to estimate the total of a population and derived the rule for optimum allocation of the sample, which was rediscovered by Neyman in 1934 (Gram, 1883, p. 181).
2 Thorvald Nicolai Thiele, 1838–1910 T. N. Thiele was born into a well-known Danish family of book-printers, instrument-makers and opticians. His father, Just Mathias Thiele (1795–1874) was a man of many talents, practical as well as artistic. For many years he was private librarian to King Christian VIII, director of the Royal Collection of Prints and secretary of the Royal Academy of Fine Arts. He made a name for himself as dramatist, poet and folklorist. The home in which T. N. Thiele grew up was thus
202
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
a highly cultured one where many leading intellectuals of the time were frequent guests. He was named after the sculptor Bertel Thorvaldsen who was one of his godfathers. ‘Little Ida’, to whom Hans Christian Andersen first told ‘Little Ida's Flowers’ was his half-sister, Ida Holten Thiele. T. N. Thiele got his masters degree in astronomy in 1860, his doctors degree in 1866 and worked from 1860 as assistant to Professor Heinrich Louis d'Arrest (1822–1875) at the Copenhagen Observatory. During 1870–1871 he worked on establishing the actuarial basis for the life insurance company Hafnia, which was started in 1872 with Thiele as actuary. In 1875 he became Professor of Astronomy and director of the Copenhagen Observatory. He retired in 1907. In 1895 he became corresponding member of The Institute of Actuaries, London. He took the initiative to found the Danish Society of Actuaries in 1901 and was its President until he died. In 1867 Thiele married Marie Martine Trolle (1841–1889). They had six children. Thiele worked essentially in applied mathematics. Most of his papers on pure mathematics were inspired by practical problems. He worked in astronomy, numerical analysis, actuarial mathematics and applied and mathematical statistics. In the following sections we shall give a detailed account of his contributions to mathematical statistics. To get a fuller picture of Thiele's scientific activities we shall, however, indicate his contributions to the other fields mentioned above. In pure mathematics he wrote on elliptic functions, number theory, the calculus of symbols or operators and on the theory of continued fractions. His works in astronomy are mainly of theoretical or numerical nature. He wrote several papers on the determination of the orbit of double stars and was particularly interested in the elimination of systematic observational errors. He also wrote a paper on the three body problem. He participated in an investigation of the longitudinal distance between Copenhagen and Lund (Sweden). Thiele made an important contribution to numerical analysis through his introduction of reciprocal differences (1906b) and his book Interpolationsrechnung (1909). Thiele's interpolation formula with reciprocal differences gives a method of obtaining a rational function which agrees in value with a given function at any finite number of prescribed points, just as Newton's formula solves the same problem by means of divided differences and a polynomial. The interpolate is expressed as a continued fraction depending on the reciprocal differences. Introducing reciprocal derivatives as limits of reciprocal differences, Thiele's interpolation formula leads to a development of a given function as a continued fraction which terminates when the function is rational, just as Taylor's formula terminates when the
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
203
function is a polynominal. An account and extension of Thiele's work has been given by Nörlund (1924); see also Milne-Thomson (1933). In 1870 Oppermann used the method of least squares to fit a mortality formula to data collected in the Danish State Life Insurance Company covering the age period from 5 to 84 years. Presumably because Oppermann had to leave the Company the same year, only the resulting mortality table was published, and not the formula used nor the comparison of observed and graduated values (Gram, 1884; Hoem, 1980). The young Thiele (1871) used the same data to construct his own mortality table. First he formulated his own hypothesis for the force of mortality assuming: ‘that the causes of death naturally fall into 3 or 4 groups characteristic for different age groups, the three groups corresponding to childhood, manhood and old age, respectively. The fourth group includes the causes of death which afflict all ages with nearly the same force but I believe that such causes of death are relatively unimportant’.
Accordingly he proposed that the force of mortality should be written as a sum of three terms, which he specified as
where x denotes age. The idea of decomposing the mortality according to different causes of death has later been taken up by many authors. The estimation of the parameters is of course a problem in nonlinear regression and, since the distribution of the observations is approximately normal, Thiele states that the best estimates are obtained from the method of least squares with weights inversely proportional to the variances. However, since the variance depends on μ(x) this procedure becomes very complicated and Thiele therefore uses a simpler estimation method. To judge whether the graduation is satisfactory he carefully studies the differences between the observed and graduated values with regard to size and changes of sign. Finally he remarks that if he had used the method of least squares then the sum of weighted squared deviations, q say, plus the number of estimated parameters, k, has expectation equal to n, the number of observations, and the standard deviation of q + k is known, equal to √{2(n - k)}. He concludes that his graduation is nearly satisfactory since his sum of weighted squared deviations plus k differs less than two times the standard deviation from n. By the same method he shows that his graduation is better than the one based on Oppermann's five parameter formula. Later Thiele (1900, 1904) proposed a more complicated formula and demonstrated how to use it in practice.
204
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
Thiele's most important contribution to actuarial mathematics is his differential equation for the premium reserve which he invented in 1875. It is the natural foundation for life insurance mathematics in continuous time and shows how the rate of change of the premium reserve depends on the reserve itself and on the forces of mortality, interest and premium. It was first published by Gram (1910), see also Jørgensen (1913, p. 253), and a detailed treatment with extensions has been given by Hansen (1946). Thiele particularly enjoyed the formulation of mathematical models and the analysis of data, by graphical and numerical methods, and he was a master of calculations. Besides analysing his own astronomical data and data from vital statistics of importance for insurance, he helped his colleagues, Professor of Genetics W. Johannsen and Professor of Chemistry Jul. Thomsen, with analysing their data (Thiele, 1892b, 1906a). In the 1892 paper he formulated his philosophy about work in applied statistics as a number of recommendations which still may serve as a model worthy of imitation. Thiele's fundamental contributions to statistics are contained in a book Almindelig Iagttagelseslœre (General Theory of Observations), 1889, and a paper ‘Om iagttagelseslærens halvinvarianter’ (On the half-invariants from the theory of observations), 1899. They are rather difficult to understand, and presumably for that reason he wrote a popular version of the book called Elementœr Iagttagelselœre (Elementary Theory of Observations), 1897, covering nearly the same theory but without some of the more difficult proofs and with more examples. A very poor English translation Theory of Observations was published in 1903 and reprinted in Annals of Mathematical Statistics, 1931. In my opinion Thiele's first book is considerably better than the following two versions. At the time of publication it must have been considered highly unconventional as a textbook because it concentrates on skew distributions (instead of the normal), cumulants (instead of central moments) and a new justification and technique for the method of least squares (instead of the Gaussian minimum variance, linear unbiased estimation). Since both the book and his 1899 paper are written in Danish they are not widely known and I shall therefore give rather detailed references to help the interested reader. It is easy to find the corresponding results in the English version of his book. Thiele had the bad habit of giving inprecise references or no references at all to other authors; he just supposed that his readers were fully acquainted with the literature. (From that point of view it therefore serves him right that he was himself neglected, for instance by K. Pearson and R. A. Fisher.) I have tried to track down
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
205
the origin of his ideas as far as possible but some of my comments on this matter are pure guesswork. In his obituary of Thiele, Gram (1910) writes: ‘He thought profoundly and thoroughly on any matter which occupied him and he had a wonderful perseverance and faculty of combination. But he liked more to construct his own methods than to study the methods of other people. Therefore his reading was not very extensive and he often took a one-sided view which had a restrictive influence on the results of his own speculations. Furthermore, he had very great difficulties in finding correct expressions for his thoughts, in writing as well as verbally, even if he occasionally could be rather eloquent. Thiele's importance as a theoretician lies therefore more in the original ideas he started than in his formulations, and his ideas were in many respects not only original but far ahead of his time. Therefore he did not get the recognition he deserved and some time will elapse before his ideas will be brought in such a form that they will be accessible to the great majority, but at that time they will also be fully valued because of their fundamental importance.’
Perhaps the time has come now, 90 years after the publication of his book. We have grouped Thiele's main contributions to statistics under the headings: skew distributions, cumulants, estimation methods, the linear model, analysis of variance and a time series model combining Brownian motion and the linear model. We shall use modern terminology and notation to make Thiele's work easier to understand. For example, the symmetric functions defined by him and called half-invariants will here be called cumulants. Expectation and variance of a random variable are denoted by E and V, respectively, and the variance will also be denoted by σ2. Unless the contrary is explicitly stated the random variables (observations) considered are assumed to be independent. Thiele made no new contribution to the theory of discrete distributions. We shall therefore assume that the distributions considered are continuous, apart from the considerations on cumulants and estimation methods in §§4 and 5 which are of a general nature. The standardized normal density will be denoted by φ(x). For brevity we shall use matrix notation in the discussion of the linear model. Matrices are assumed to be conformable and of full rank unless otherwise explicitly stated.
3 Skew distributions At the time when Thiele became interested in statistics the normal distribution played a predominant role. However, Oppermann, Thiele and Gram working with economic and demographic data as actuaries realized the need for developing a theory of skew distributions.
206
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
The Gram–Charlier type A distribution. Let f(x) denote a continuous density with finite moments. Thiele (1889, pp. 13–16, 26–28) writes the density in the form
and determined the unknown coefficients in terms of the cumulants, see §4. Today this expansion is usually called the Gram–Charlier type A series. The name is, however, incorrect from a historical point of view. Gnedenko & Kolmogorov (1954, p. 191) point out that the expansion occurs in the work of Chebychev. Särndal (1971) and Cramér (1972) have written a history of the subject with special regard to Swedish contributions. As mentioned above Oppermann used the type A series before Gram. Thiele (1873) wrote a paper about numerical aspects of the series. In his books Thiele did not discuss the convergence of the series, presumably because he only considered the use of a finite part, neither did he comment on the property that the density may become negative. There are no indications of how they invented the series. I suppose they considered it trivial for the following two reasons. As actuaries they were familiar with the technique of finding approximation formulae by first transforming the given function by subtraction of or division by a suitably chosen simple function and then using a polynomial as approximation to the difference of quotient. Hence, the normal distribution multiplied by a polynomial of low degree was a natural starting point for approximating skew distributions. The second reason was the hypothesis of elementary errors. Using characteristic functions, Bessel (1838) had derived the type A series under the assumption that the distributions of the elementary errors were symmetric and Zachariae (1871) had included a discussion of this problem and of Bessel's proof in his book. Oppermann, Thiele and Gram were therefore fully aware of the probabilistic background of the expansion and presumably they considered an extension to nonsymmetric distributions a commonplace. In his discussion of the distribution of the circumference of trees Gram (1889, p. 114) remarks that a density of the form f(x) = (a + bx) exp {-k(x - c)2} seems to be adequate. Of course they were also aware of the possibility of using densities of the form
As a special case Gram (1879, pp. 105–107) fitted a gamma distribution to the distribution of the marriage age for men, using the method of moments to estimate the parameters.
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
207
Transformation of skew distributions. The first paper by Thiele (1878) on statistics has the title ‘Bemærkninger om skæve Fejlkurver’ (Remarks on skew frequency curves). Let the density of x be φ(x), the standardized normal density. Thiele remarks that the density of γ = f(x) then will be p(γ) = φ(x)/f′(x), which gives a general expression for skew distributions when f(x) is nonlinear. As a simple example he discusses the transformation γ = α + βx + γx2 and estimates the three parameters by setting the first three empirical moments about the origin equal to the corresponding theoretical moments. Thiele (1889, pp. 29–31) returns to this principle of transformation and points out that if the transformation is not one-to-one p(γ) will be the sum of terms of the form φ(x)/f′(x) corresponding to the number of roots of the equation f(x) = γ. For γ = α + βx + γx2 he derives the cumulants of γ in terms of the cumulants of x and gives the equations for determination of the parameters. (This section has not been included in the 1897 and 1903 books.) Thiele had this idea of transformation and used it long before his 1878 paper. In his thesis (1866, pp. 7–8) he pointed out that the geometric, instead of the arithmetic, mean should be used to estimate distances based on certain astronomical measurements. In 1875 he fitted a logarithmic normal distribution to data on the marriage age for females, using log (γ - α) as normally distributed, where γ denotes the age at marriage (Gram 1910).
4 Cumulants Thiele's (1889) fundamental contribution to the nonparametric theory of statistics is the introduction of cumulants (half-invariants as he called them) and the development of a corresponding theory, the only assumption being the existence of moments. Thiele's starting point is the one-to-one correspondence between n observations and a set of symmetric functions of order 1 to n. Beginning with the moments about the origin, m′r, he next defines the moments about the sample mean, mr, and uses the mean, the variance and to characterize the distribution. He then writes (1889, p. 19): ‘It is, however, better to use what we shall call the half-invariants defined by the formula (4.1)
where hi denotes the ith empirical half-invariant’. Solving these equations, he finds hr in terms of the first r moments about the origin and he also derives the simpler relations between the cumulants and the
208
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
moments about the mean. In particular he finds h1 = m′1, h2 = m2, h3 = m3 and
.
If we replace the empirical moments in (4.1) by the theoretical moments, μ′r, the theoretical cumulants, κr, are defined. Thiele derived the cumulants for the binomial, the rectangular, the normal and the Gram–Charlier type A distribution and also gave the cumulants of γ = α + βx + γx2 in terms of the cumulants of x. In 1903 he also gave the cumulants for the Poisson and the mixed normal distribution. Thiele's (1889, pp. 21–2) comments on the interpretation of the cumulants are as follows. The mean, κ1, depends on both location and scale, the variance, κ2, depends on the scale but is independent of the location, and the quantities are independent of both location and scale and therefore describe the shape of the distribution. The first four cumulants are the most important, the third characterizes the skewness and the fourth the flatness or peakedness of the distribution. His interpretation of the first four cumulants is presumably influenced by his results for the type A series. Let the random variable x have the cumulants κ1, κ2, … and consider the standardized variable γ =(x - κ1)/√κ2. Thiele proves that the density of γ, g(γ) say, may be written as (4.2)
the following terms being more complicated. Actually he gives the coefficients up to the 8th term (1889, p. 28). Considering only the first three terms we can easily see how γ1 and γ2 influence the shape of the distribution. For independent random variables, x1, …, xn, he proves the fundamental (addition) theorem (1889, p. 36) (4.3)
Finally, he derives the (theoretical) cumulants of the empirical cumulants, that is κr(hi) (1889, pp. 60–62), for the most important values of r and i. First of all
which he uses to prove that the distribution of the mean of independent and identically distributed random variables is asymptotically normal regardless of the distribution of the observations if only the moments exist. Expressing the higher empirical cumulants in terms of sums of products of observations (auxiliary tables are given), Thiele derives κ1(hi) for
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
209
i = 1, …, 6, κ2(hi) for i = 1, …, 4, κ3(hi) for i = 1, 2 and (in 1903) κ4(hi) for i = 1, 2. He regrets that he has not succeeded in finding the general formula. Among these results is the important formula (1889, p. 61)
The version used today (4.4)
is given on p. 39, 1897. Thiele points out that the variance of the empirical cumulants of higher order is relatively large. For the normal distribution he finds (1889, p. 64) for n large. Looking back at the above results one may ask two questions: (i) How did Thiele find the recursion formula (4.1) defining the cumulants? (ii) Why did he prefer the cumulants for the central moments? He does not give any answer to the first question. For the second question it is clear from his comments that the first reason was the simple property of the cumulants for the normal distribution and also the simple extension (4.2) to the type A series. Furthermore, the addition theorem (4.3) and its usefulness in proving asymptotic normality of estimators were decisive for him. Ten years after the publication of his book Thiele finally succeeded in finding the general definition of the cumulants. In a short paper ‘Om Iagttagelseslærens Halvinvarianter’ from 1899, he defines the cumulants by the equation (4.5)
where f(x) denotes the density for the random variable with cumulants equal to κ1, κ2, …. Equating the coefficients of ti on both sides of (4.5) he finds (4.6)
Taking logarithms on both sides of (4.5) he obtains similarly (4.7)
210
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
Finally, he obtains the recursive definition (4.1) by differentiating (4.5) and equating coefficients of equal powers of t. Next he turns to the operational properties of the cumulants. He remarks that t in (4.5) may be replaced by any suitably chosen operator and that an obvious choice is t = -D, where D denotes differentiation. Noting that e-aDf(x) = f(x - a), he proves that (4.8)
Setting f(x) = φ(x/σ)/σ, he finds that the effect of the operator exp(bD2/2) on a normal density is to add b to the variance. (It is easy to see from his formula that this is true for any distribution.) Applying the equation (4.5) with t = -D to the normal distribution he finds
Writing γ = (x - ξ) + (γ - x + ξ) and replacing f(γ) in the integral above by its Taylor expansion, he obtains (4.9)
i.e. a series expansion of density f(x) with given cumulants in terms of a normal density and its derivatives. A particularly simple result is obtained by choosing ξ = κ1 and σ2 = κ2. He finally states the general formula connecting two densities having the same mean and variance (4.10)
He adds that one has to check that the series is convergent and gives an example of a divergent series. In his book of 1903 he uses (4.5) for defining the cumulants and he is therefore able to simplify some of his previous proofs. He does not give the general formulae (4.6) and (4.7), neither does he mention the operational properties. However, he gives a reference to his 1899 paper on p. 49, 1903.
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
211
K. Pearson and R. A. Fisher have both commented on Thiele's work. In his fundamental paper on ‘Skew variation in homogeneous material’ K. Pearson (1895) comments on Thiele's 1889 book, but he seems to have overlooked Thiele's most important results. He writes: ‘Dr. Thiele does indeed suggest the formation of certain “half-invariants”, which are functions of the higher moments of the etc., of the above memoir. He further states (pp. 21–2) that a study of these observations—quantities corresponding to the half-invariants for any series of observations would provide us with information as to the nature of the frequency distribution. They are not used, however, to discriminate between various types of generalized curves, not to calculate the constants of such types.’
Pearson does not mention that Thiele introduced the shape coefficients, γr, neither does he mention Thiele's treatment of skew distributions by means of transformations or by the Gram–Charlier series. Pearson's main comment refers to a method invented by Thiele to represent a discrete distribution by means of binomial coefficients and differences of these. In his fundamental paper on cumulants and k statistics Fisher (1928) does not mention Thiele. Fisher begins with the definition (4.5) and goes on to derive (4.6) and (4.7). He ascribes (4.4) to Student; it is of course originally due to Gauss (1823, Art. 39). Although it is evident from Fisher's early writings that he was not familiar with the Continental literature, it is rather odd that he did not know the English version of Thiele's book, in particular as it is mentioned by Arne Fisher (1922) and by Whittaker & Robinson (1924, pp. 171–172). The paper by Cornish & Fisher (1937), in which the name cumulant is introduced instead of half-invariant, again starts from (4.5) and goes on to find the operational properties of the cumulants and derives (4.9) which by inversion gives the Cornish–Fisher expansion. They do not mention Thiele. In the fourth edition of his Statistical Methods Fisher (1932) inserted ‘a historical note on the principal contributors to the development of statistical reasoning’ containing only six names: Bayes, Laplace, Gauss, K. Pearson, Student and Thiele. Having mentioned Laplace he writes about Thiele: ‘These [the cumulants] seem to have been later discovered independently by Thiele (1889), but mathematically Laplace's methods were more powerful than Thiele's, and far more influential on the development of the subject in France and England’
(1932, p. 22). This seems to imply that Fisher's reason for not crediting Thiele for developing the theory of the cumulants was that Laplace had already done so, which is an astounding attitude in view of the facts.
212
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
5 Estimation methods and k statistics Between the times of Gauss and Fisher it was customary to state the result of a statistical analysis in the form of an estimate, t say, and its empirical standard error, σ^t (or the probable error 0.67449 σ^t), usually as t ± σ^t. This statement implied that t was approximately normally distributed with standard deviation σ^t. Tests and confidence intervals were then computed in the usual manner. Also Thiele used this method with the additional remark that one should take the third and fourth cumulant into account if necessary. As mentioned above Thiele used the method of moments in 1878. Let us briefly consider this method before discussing Thiele's principle of estimation. The method of moments uses the first r empirical moments about the origin, m′1, …, m′r, as estimates of the analogously defined theoretical moments μ′1 …, μ′r. We may, however, just as well use the empirical central moments, m′1, m2, …, mr, or the empirical cumulants, h1, …, hr, as estimates of the corresponding theoretical quantities, μ′1, μ2, …, μr or κ1, …, κr, respectively, because of the one-to-one relations between the three sets of symmetric functions and the analogous relations between the theoretical quantities, i.e. the method of moments has the important property that the same estimates are obtained whatever set of symmetric functions is used as starting point. It should also be noted that the relations referred to are nonlinear, at least for symmetric functions of higher order. Obviously, Thiele wanted to improve the method of moments, and in 1889, p. 63, he postulates that the best method of estimation is to set the empirical cumulant equal to its expectation. We shall quote some of his remarks on this matter from 1903, pp. 47–48. ‘The struggle for life, however, compels us to consult the oracles. But the modern oracles must be scientific …. It is hardly possible to propose more satisfactory principles than the following: The mean value of all available repetitions can be taken directly, without any change, as an approximation to the presumptive mean. If only one observation without repetition is known, it must itself, consequently, be considered an approximation to the presumptive mean value …. If necessary, we complete our predictions with the mean errors and higher half-variants …. The ancient oracles did not release the questioner from thinking and from responsibility, nor do the modern ones; yet there is a difference in the manner.’
Applying this principle to the empirical cumulants, Thiele arrives at the following method of estimation. Let E{hi} = fi(κ1, …, κi) With hi as the best estimate of E{hi}, the estimates κ^1, …, κ^r of the theoretical cumulates are obtained
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
213
by solving the r equations hi = fi(κ^1, …, κ^i) (i = 1, …, r). Thiele gives the solution for r = 5 (1889, p. 63). He does not mention the fact that the estimates are biased, apart from the first three. He does not give formulae for the variance of the estimates but remarks that the main term of the variance may be found by means of the variances of the h's, which he has previously derived. Thiele did not prove his postulate that hi is the best estimate of E{hi}. Today we know that he was right in the sense that any symmetric function is the minimum variance estimate among unbiased estimates of its expectation. As pointed out by Steffensen (1923b, 1930), Thiele's method of estimation leads to a dilemma (compared with the method of moments) because other estimates of the cumulants will be obtained if we start from the central moments, say, using mi as estimate of E{mi}. The explanation of this fact is, of course, that the relation between the moments and the cumulants is nonlinear so that the conversion formulae are destroyed by introducing unbiasedness of one particular symmetric function. After Steffensen's (1923b) discussion, Tschuprow (1924, pp. 468–472) turned the problem around by asking for an unbiased estimate of a given parameter. (This had presumably been done by many other authors, but Tschuprow has a fairly general formulation of the problem.) He did not, however, solve the problem for the cumulants. Bertelsen (1927, p. 144) found the first four symmetric functions, ki say, such that E{ki} = κi for i = 1, …, 4. Finally, Fisher (1928) by giving the general rules for finding the k statistics solved the problem and thus completed the work of Thiele. As explained by Cornish & Fisher (1937) – without mentioning Thiele – Thiele's formula (4.7) is the natural starting point for constructing the k statistics. Fisher points out that the requirement E{ki} = κi leads to manageable formulae for the k's which is not the case for the analogous requirement, E{qi} = μi, say, for the determination of qi. Steffensen indicates that Thiele considered his method of estimation as a generalization of the method used by Gauss for estimating the variance. This interpretation assumes mat Gauss started from the second empirical moment (or cumulant), found its expectation E{h2} = σ 2(n - 1)/n and solved the equation h2 = σ^2(n - 1)/n to find σ^2. However, one may just as well say that Gauss was looking for a symmetric function of order two with expectation σ2 and this interpretation will lead to Tschuprow's procedure. Gauss did not formulate a principle, he just solved the problem at hand. Consider the linear model and let Q denote the sum of the n squared residuals. Gauss (1823, Article 38) proved that E{Q} = (n - m)σ2, where m denotes the number
214
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
of parameters, and formulated his conclusion nearly as follows: ‘The value of Q considered as a random variable may be larger or smaller than the expected value but the difference will be of less importance the larger the number of observations so that one may use √{Q/(n - m)} as an approximate value for σ’.
Hence, Gauss derived an unbiased estimate of σ2 but stated his result as a biased estimate of σ. Of course, the standard deviation is the quantity of practical interest. Gauss required unbiasedness only for ‘natural’ parameters of the model, i.e. the expectations and the variance, presumably because they combine linearly. There is no doubt that Thiele considered the cumulants as the ‘natural’ symmetric functions to use exactly for this reason, see (4.3). Hence, if any ‘moment’ function is to be chosen for unbiased estimation then it should be the cumulant. Thiele made a mistake to start from the empirical cumulants instead of the theoretical. If we do not go into details of the history of the χ2 distribution, it seems reasonable to put on record that Oppermann (1863) in a query states that he has derived the distribution of
for n normally distributed observations, the proof being by induction. Unfortunately he does not give his result because he does not like the method of proof. Instead he asks the following question. Let f(γ) denote the density for the distribution of s2 so that
How is f(γ) to be found by evaluation of this integral? He received no answer. It is well known that Abbe (1863) in the same year solved the simpler problem of finding the distribution of , assuming that E{xi} = 0, by evaluating the corresponding integral, and that Helmert (1876a, b) found the distribution of m2. It is a peculiar fact that Helmert did not present this distribution in his book (1907). Thiele (1903, p. 46) derived the moment-generating function of h2 = m2, but did not go any further.
6 The linear model with normally distributed errors Thiele points out that the observations, apart from the errors, usually will be unknown functions of certain parameters. It is therefore necessary to consider the
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
215
following three problems: to formulate a hypothesis about the means and the error distribution; to estimate the parameters in this model; and to criticize the model by means of an analysis of the residuals (Thiele, 1889, p. 66; 1903, pp. 67–68). After some very general remarks he states that the only problem which so far has been solved satisfactorily is the one where the means are linear functions of the parameters (or the original hypothesis may be linearized) and the errors are independent and normally distributed. He also adds that the problem of estimation is a technical matter, whereas the specification and the criticism are the important problems. Thiele could not accept the proofs of Gauss (1809) and Andræ (1867) of the method of least squares because they were based on a uniform prior distribution and Thiele was a non-Bayesian (Thiele, 1889, p. 77). He was, however, of the same opinion as Andrae that the method of least squares should only be used for normally distributed observations. Nevertheless, it is rather strange that he does not comment on the second proof by Gauss (1821) based on the principle of minimum variance. Contrary to Laplace, Gauss and Oppermann, who had used minimization of a loss function to find the best estimate, Thiele wanted to derive estimates of the parameters in the linear model with normally distributed errors from ‘selfevident’ principles, i.e. without the use of a loss function. His great invention was what we today call the canonical form of the linear hypothesis. On p. 68 (1889) he writes as follows: ‘We shall first discuss a special case, which does not occur in practice, namely the case where the observations fall into two groups so that in the one group every observation determines its unknown mean whereas in the other group the expected value of each observation is a specified number, given by theory and independent of the means in the first group’.
Similar considerations may be found on pp. 66–67, 1903. Hence, in modern terminology, Thiele's special case may be formulated as follows. Let γ1, …, γn be independent and normally distributed with means E{γi} = ηi for i = 1, …, m, the η's being unknown, E{γi} = ηi0 for i = m + 1, …, n, the η0's being given numbers, and unknown variance σ2. He then states as evident that γi should be used as estimate of ηi for i = 1, …, m, and since E{(γi - ηi0)2} = σ2 for i = m + 1, …, n that
should be used as estimate of σ2. Furthermore, the criticism of the model should be based on the n - m errors.
216
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
Having solved the estimation problem for the canonical form, Thiele goes on to show how the general linear model may be transformed to the canonical form. He points out that the adequate mathematical tool is the theory of orthogonal transformations and gives an exposition of this theory adjusted to the needs of statistics. We shall give an account of the most important of Thiele's results, keeping to his ideas in the proofs but using matrix notation for brevity. Let {γ1, …, γn) = Y′ denote n independent and normally distributed random variables with mean (η1, …, ηn) = η′ and common variance σ2. In Thiele's exposition the variances were supposed to be proportional to σ2 with known proportionality constants. We shall, however, keep to the simpler model since the generalization is trivial. Thiele called independent linear functions of the observations ‘free functions’. We shall summarize his most important results on free functions without giving the proofs. Let A = (A1, …, Am) denote m given vectors of dimension n, and let similarly B = (B1, …, Br). Two functions A′1γ and B′1γ are said to be independent (free) if and only if A′1B1 = 0. The definition is based on the requirement that V(A′1γ + B′1) = V(A′1γ) + V(B′1γ). Two systems of functions A′γ and B′γ are said to be independent if any function in the one system is independent of all the functions in the other system, i.e. if A′B = 0. Any given function B′1γ may uniquely be written as a sum B′1γ = B′0γ + L′A′γ, say, where the first component B′0γ is independent of the m given functions A′γ, which means that L = (A′A)-1A′B1 and B0 = B1 - AL. In particular, for two functions A′1γ and B′1γ, we have that B′0γ is independent of A′1γ, where
From a system of m functions given by the coefficients A1, …, Am, we may construct a system of m independent functions using the method above. Using A′1Y as the first we find the coefficients of m - 1 independent functions as A2i = Ai - A1(A′1Ai)/A′1A1), since A′1A2i = 0 for i = 2, …, m. Similarly we construct m - 2 functions independent of A′1Y and A′22Y by means of the coefficients A3i = A2i - A22(A′22A2i)/A′22A22), since A′22A3i = 0 for i = 3, …, m, and so on. This procedure is identical with Gauss's algorithm for solving the normal equations. A system of n independent function is called a complete system. Z = Q′γ, where Q′Q = I is a complete system. Let Q be partitioned into two matrices A
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
217
and B so that
It follows from the results above that any linear function of Y may be written as a sum of two independent functions, the first being independent of A′Y and the second independent of B′Y. In particular, this is true for each of the elements of Y. Thiele naturally also mentions that distances are invariant under orthogonal transformations. Thiele writes that Oppermann and Helmert have pointed out the usefulness of working with independent functions. Helmert (1872, p. 164) introduced the concept of an equivalent system of linear functions defined by the requirement that it should give the same estimates as the original observations and as the most important example he used the reduced normal equations whose right-hand sides are independent functions of the observations. It was, however, Thiele who developed a general theory of free functions. As was usual at that time, Thiele considers two formulations of the linear model. In the first case, which according to Gauss is called ‘Adjustment by Correlates’ the problem is to estimate η and σ2 under the restriction A′1η = ζ10, where A′1 denotes a given (r × n) matrix of rank r = n - m(0 < r < n), and ζ10 is a given vector of dimension r. In the second case, called ‘Adjustment by Elements’, it is assumed that η = Xβ, where X denotes a given (n × m) matrix of rank m and β is an unknown vector of dimension m. Adjustment by correlates. In the following we shall introduce partitioned vectors and matrices of dimensions r, m, (r × r), (r × m), etc., without each time defining the symbol and specifying its dimension because it follows from the context. Let us supplement the r linear restrictions A′1η = ζ10 by m linear functions A′2η = ζ2 which we want to estimate, that A′2 is known whereas ζ2 is unknown. The linear model
will, in general, not be in canonical form and we therefore transform to
and, to get independence, we require that BA′ = Q′ and Q′Q = I.
218
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
It is known from Gauss's algorithm for solving the normal equations that there exists a lower-triangular matrix B satisfying the equation BA′ AB′ = I. Writing
we find
where the elements of T are independent and normally distributed random variables, the first r having known means and the last m having unknown means. Having found the canonical form, we may solve any estimation problem by expressing the function to be estimated, L′η say, in terms of θ10 and θ2 and replacing θ2 by T2. Equivalently we may start from L′Y, transform to a linear combination of T1 and T2 and replace T1 by its true value θ10. Let us first investigate the estimation of ζ2 using the inverse transformation of the one above. Setting
we find
which gives
as estimate ζ2. Using Z2 - ζ^2 = B21 (T1 - θ10), Thiele proves that
where Z2i denotes the ith element of Z2. Hence, the variance of ζ^2i is smaller than the variance Z2i.
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
219
The estimate of η is obtained from Y = Q1T1 + Q2T2, which leads to
Hence, compute η^ we need only find T1 and θ10. Note that A′1η^ = ζ10. From
it follows that
Since
we have
so that V{η^i} ≤ V{γ^i}. Considering the sum of squared residuals we get
and since the expectation of the right-hand side equals r σ2 the estimate of σ2 becomes
For the sum of squared errors under the restriction A′1η = ζ10 we have
Hence, the minimum with respect to η is obtained for θ2 = T2, that is
which according to the result above gives η = η^. This property has led to the name ‘the method of least squares’. To test a hypothetical value of σ2, we compare the quantity (n - m)s2/σ2 with its expectation (n - m) taking the standard deviation √{2(n - m)} into account.
220
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
Thiele continually stresses the importance of the criticism of the model. He recommends three procedures: graphical analysis of the residuals, analysis of the variation of signs of residuals, and comparison of residuals with their standard deviation. For all three procedures he points out the importance of using rational subgroups of observations to detect systematic deviations from the assumptions. Of course, if T1 and θ10 have been computed then the n - m differences should be investigated. However, usually the criticism is based on Yi - η^i and the corresponding estimated variance, Ve{γi - η^i}, say. For subgroups, Thiele uses the approximate test procedure to compare ∑(γi - η^i)2 with its estimated mean ∑Ve{γi - η^i} = M, say, using √(2M) as standard deviation. He used this form for criticism also in cases where the parameters had been estimated by other means than the method of least squares; see Thiele (1871) on the graduation of mortality data. Finally we note the following special case of the general theory. Let Z1 = A′1Y and Z2 = A′2Y be independent and let E{Z1} = ζ10 be known. The least squares estimate of ζ2 = A′2η is then Z2. This result follows from the assumption that A′1A2 = 0, which leads to B21 = 0 so that and . Adjustment by elements. Let us partition the n equations η = Xβ into the first n - m and the last m as indicated by
assuming that the rank of the (m x m) matrix X2 equals m. Eliminating β we get and
and
. Setting
where Y has been partitioned analogously to η, we find E{Z1} = A′1η = 0. Hence, the estimation problem may be solved by means of the methods given in the previous section putting ζ10 = 0. Noting that A′1X = 0, it follows that the simplest choice of Z2 is Z2 = X′Y, which makes Z1 and Z2 independent. The least squares estimate of ζ2 = E{Z2} = X′Xβ is then Z2 so that the equation for β^ becomes X′Xβ^ = X′Y, that is the normal equation. Using Gauss's algorithm we multiply the equation by a lower-triangular matrix, G say, which leads to GX′Xβ^ = GX′Y, where GX′XG′ = D = diag (d1, …, dm). Note that detailed discussion of the Gaussian algorithm in matrix notation has been given by Henry Jensen (1944). Thiele observes that GX′Y = T, say, represent m independent functions and that V{ti} = σ2di.
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
221
If we introduce λ as new parameter by the transformation β = G′λ, the reduced normal equation becomes Dλ^ = T so that the elements of λ^ are independent and V{λ^i} = σ2/di. Transforming backwards Thiele finds the properties of η^ = Xβ^ = XG′λ^ and of P′β^ = P′G′λ^ by means of the properties of λ^. Based on the equation
Thiele remarks that if η is a series with an unknown number of terms then an orthogonal transformation makes it possible to judge the importance of each time (i.e. test the significance of each new coefficient included) by comparing with its expectation which equals 1. He illustrates this with an example using orthogonal polynomials. The singular case. The method of fictitious observations. Let the rank of X be less than m so that there is no unique solution of the normal equations. Thiele points out that a practical method for handling this case is to introduce fictitious observations so as to make the solution determinate. Let the rank of X be m - 1. Thiele introduces the enlarged model
where z denotes a fictitious observation with mean c′β and variance σ2, the row vector c′ being linearly independent of the rows of X. The normal equation for the enlarged model is
We may then ask under what conditions the unique solution β^0 will satisfy the original normal equation X′Xβ^ = X′Y. Eliminating X′Y we find
so that X′Xβ^0 = X′Xβ^ if and only if c′β^0 = z, which is Thiele's condition for using fictitious observations (1889, p. 94). The estimate β^0 obviously depends on the fictitious observation z. Hence, we are able to estimate only linear functions P′β for which P′β^0 does not depend on z, that is what today is called estimable functions. Thiele uses this method in the two-way analysis of variance. The method used today is to introduce identifiability constraints, which may be considered as a special case of Thiele's method.
222
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
7 Analysis of variance One-way classification. This problem is discussed only in the 1897 edition; see pp. 41–44. Thiele formulates very clearly the basic ideas of the analysis of variance and carries out such an analysis for an example with 20 groups and 25 observations per group. He first finds the means and variances within groups, x¯i and and the average variance within groups, say. As a preliminary investigation he computes the 20 standardized deviations (x¯i - x¯)√25/sw, and remarks that the variation is not significant because all these values lie between -2.1 and +2.1. He then adds the following important remark: ‘The most efficient test for the hypothesis is obtained by comparison of the variance between groups and the variance within groups since any systematic variation in the true means will increase the variance between groups’.
He computes where denotes the variance between groups, and concludes that this difference is insignificant because it is smaller than the standard deviation of , which he estimates by means of (4.4) setting k4 = 0. Of course, he should have estimated the standard deviation of d but he has presumably not found it worth while to do so. Having carried out an analysis of the variation of the means he gives an analogous analysis of the variances within groups to test the hypothesis that the true variances are equal. Two-way classification without interaction. Thiele treats this model as a special case of the general linear model and derives estimates of the parameters in the form used today. His starting point (1889, pp. 96–99) is the following problem. Consider die observation of the passage times for k stars over m parallel threads. The true values of the observations may then be written as
where hi denotes a known velocity, and the Y's are supposed to be independent and normally distributed with V{γij} = σ2qi, where qi is known. For simplicity we shall set qi = 1 in the following. Corresponding to the special structure of the design matrix Thiele develops a computational technique for the construction of the normal equations, called the method of partial eliminations, by which he gets the unknowns separated. Furthermore, he adds a fictitious observation z to make the solution determinate.
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
223
He then finds the estimates
Thiele remarks that αi and βj are not estimable, as the estimates depend on the fictitious observation z, but that differences such as hiαi - hναν and βj - βμ are estimable. Generalizing, he adds that contrasts (as they are called today) are estimable and gives the following estimates:
where V denotes the variance of the estimate. He also gives the estimate η^ij and its variance, finds the variance of γij - η^ij and proves that
which leads to the usual estimates of σ2. For hi = 1 the formulae simplify to the standard ones used today. In 1897 and 1903 he only discusses the simple model with ηij = αi + βj and the same variance for all γij (1903, pp. 100–103). He gives an example with 11 measurements of the abscissae of 3 points on a line, the position of the scale being different for each of the 11 measurements. Moreover, in 6 of the 11 cases one of the measurements is lacking. He derives the normal equations for this two-way analysis of variance with missing observations and expresses the solution in terms of a fictitious observation as above. Finally, he computes η^ij and s2 and uses the ratios (γij - η^ij)2/Ve {γij - η^ij, where Ve denotes the estimated variance, and sums of these over rational subgroups for the criticism of the model.
8 A time series model combining Brownian motion and the linear model with normally distributed errors Thiele's (1880) first paper on the method of least squares is an impressive example of his ability to formulate a completely new mathematical model for an experimental situation, to solve the mathematical problems involved in the estimation of
224
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
the parameters, to analyse the data and to criticize the model and the experimental set-up in view of the outcome of the statistical analysis. He must himself have considered the paper an essential new contribution since he published a French version the same year from a Danish publishing house. Helmert (1907) gave a summary of Thiele's main results. A discussion of the paper will be given by S. Lauritzen (1981). Here we shall only indicate the model and the main results. Consider a stochastic process (Brownian motion) which at times t0, t1, …, tn takes on the values z0, z1, …, zn, and assume that zi + 1 - zi(i = 0, 1, …, n - 1) are normally distributed with mean 0 and variance , say. Suppose further that zi is unobservable and that the observation γi for given value of zi is normally distributed with mean zi and variance ω2. The problem is to estimate the z's, σ2 and ω2. This is Thiele's (1880) formulation of a model which he used to describe the ‘quasisystematic’ variations of an instrument-constant. Starting from preliminary values of σ2 and ω2 Thiele solves the problem of estimating the z's by minimizing (8.1)
This leads to the normal equations for the determination of the estimates, z^i. Also V{z^i} and V{z^i + 1 - z^i} are found. Noting that the estimate of σ2 in the linear model equals
Thiele proposes to use
as estimate of ω2, and
as estimate of σ2. The problem is then solved by iteration. Finally Thiele extends the model by letting the mean of γi for given zi be a linear function of x1, …, xm. Thiele and Gram also contributed to the theory of graduation (smoothing) by means of moving weighted averages, see Gram (1879, p. 112). Looking at the problem in the present section from the point of view of graduation the z^i's are graduated values of the γi's and Thiele's minimization of (8.1) is a special case of what today is known as the Whittaker–Henderson method.
T. N. THIELE'S CONTRIBUTIONS TO STATISTICS
225
Acknowledgements My thanks are due to Sven Danø for information on the Thiele family and to Jan M. Hoem, Søren Johansen and Steffen Lauritzen for discussions on the manuscript.
Résumé On trace les grandes lignes de la base des travails de Thiele (1838–1910). Il contribua à la théorie des distributions dissymétriques, il défina les cumulants et étudia leur propriétés. Il inventa la forme canonique du modèle linéaire et il réduisit le modèle linéaire général à la forme canonique au moyen d'une transformation orthogonale en créent une justification nouvelle pour la méthode des moindres carrés. Il aussi réalisa des analyses de variance à une et deux dimensions. Il donna l'extension au modèle linéaire additionant un mouvement Brownien, et il développa la théorie d'estimation et de prédiction pour cette série chronologique.
CHAPTER SIX On the halnvariants in the theory of observations 59
Presentation In this little paper (a reprint of part of Hald (2000a)) read before the Royal Danish Academy of Sciences and Letters, Thiele concludes the mathematical theory of his halfinvariants. This paper also concludes our selection of translations of Thiele's original work. S.L.L. Fig. 1 ‘Et Møde i Videnskabernes Selskab’ (A meeting of the Royal Danish Academy of Sciences and Letters), painting by P. S. Krøyer, 1897. Thiele is seated in the foreground with his left hand under his cheek and right hand on his knee. Reproduced from Lomholt (1954), who has written about this painting.
59
Reprinted with permission from International Statistical Review.
ON THE HALFINVARIANTS IN THE THEORY OF OBSERVATIONS
227
I have in several works on the theory of observations shown that many and great advantages are obtained by expressing the laws of error by certain symmetric functions—the halfinvariants—of the repeated observations. But I have not failed to notice that my use of these functions also had its drawbacks. In particular, the definition I had given of the halfinvariants was imperfect because it was both indirect and complicated; indirect because it demanded another system of symmetric functions, the sums of powers, inserted as intermediate between the observations and the halfinvariants, and complicated because it only in the form of recurrent equations was simple enough to be remembered and used in proofs. It was a no lesser defect that a relation between the halfinvariants and the frequency functions was lacking. It must be recognized that the frequency functions are the most direct and in certain cases the most advantageous expression for the law of error, and previously they have nearly always been employed as if the frequency function was the only possible mathematical expression for laws of error. These defects I can now remedy. The relation between the halfinvariants μr and the sums of power sr can be written as the following identical equation in the variable z(1)
which means that the μ-series equals the logarithm of the s-series. In this identity the right-hand side may be resolved into a sum in which every term depends only on one of the repeated observations ; hence we have (2)
which is excellently suited as a definition of the halfinvariants μr. From the identity (1) my previous systems of equations may be derived by the method of undetermined coefficients. Comparing directly the coefficients of zi the explicit equations for si, will result:
228
ON THE HALFINVARIANTS IN THE THEORY OF OBSERVATIONS
Taking the logarithms of both sides of the identity (1) the explicit equations for μi, will result:
Finally, differentiating the identity (1) with respect to z before comparing coefficients of z′ we find
resulting in the recursive system of equations
whose relative simplicity and analogy to the binomial formulae allowed me in my previous lack of anything better to use it as the definition of the halfinvariants. Since both sr and μr are constants for a law of error it is tempting occasionally to use the complete arbitrariness of z to interpret this variable as a symbol of operation; the only danger is that it is impossible a priori to secure the convergence of an infinite series in powers of an operator. But otherwise the symbol of differentiation D is to be recommended for operations on laws of error because of the definition (2), since as is well-known eaD *f(x) = f(x + a); but for not to produce sums but differences between x and the observed numbers oi it will be practical to set z = –D instead of z = D in our definition. In particular we shall use the identical symbols according to equation (2) to operations on frequency functions trying to derive one function from another. As a first example we shall naturally choose the general typical or exponential frequency function exp with mean m and standard error n. Since this law of error is continuous we have to assume about the values oi in (2) that fheir number s0 is infinite and that the frequency of a value between o and o + do is Φ( o)do, where also the frequency function Φ(o) must be assumed continuous.
ON THE HALFINVARIANTS IN THE THEORY OF OBSERVATIONS
229
From (2) follows the symbolic equation
Furthermore we shall assume that Φ(o) = Φ( x – m + (o + m – x)) can be developed in powers of o + m – x according to Taylor's series. Inserting this in the right-hand side of the equation, from which the symbols temporarily had disappeared, we again introduce the differentiation symbol D, which is identical with the previous one because m is constant. Hence
so (3)
It is thus possible to get such a general law of error as that given by the halfinvariants μr symbolically derived from the general typical. In particular, if we dispose of the constants such that m = μ1, n2 = μ2, that is, neither the mean nor the standard error is changed, then we have (4)
Accordingly it should be possible to derive any law of error from anyone else with the same mean and standard error by the operation (5)
of course under the assumption that Φ(x) is not a function such that any of the necessary expansions will not be convergent. To investigate whether the derivation also may be extended to changes of the mean and the standard error we shall treat the case for the typical laws of error. From (3) it follows, setting μ3 = μ4 = … = μr = 0, that
230
ON THE HALFINVARIANTS IN THE THEORY OF OBSERVATIONS
(6)
To prove this theorem we have to study the effect of the symbol exp
. For an arbitrary function f(γ) we obtain
but since
we get
if the Taylor expansion can be used on f(x). Choosing particularly
we find
The operation given by the symbol exp mean square error. This corroborates (6).
applied to the frequency function thus results in an addition of b to the
ON THE HALFINVARIANTS IN THE THEORY OF OBSERVATIONS
231
It has to be remembered, however, that it is essential that the standard error is real for the typical law of error. Hence, if the addition of b to the mean square error should cause a change of sign the validity of the formula ceases. Also the limiting case that the standard error may become equal to zero ought to be excluded. Observations without error do not have a proper law of error. Closer investigations cause greater doubts regarding the general formulae (3), (4), and (5). Even if they may be useful in special cases it is easy to show that they not always will lead to convergent series. Particularly one could consider to use them for elucidating the importance of individual halfinvariants in relation to the law of error by deriving frequency functions for such laws of error that deviate from the typical form by having only one of the higher halfinvariants different from zero. A law of error where μ1 = m = 0, μ2 = n2 = 1, …, μr - 1 = 0, μr + 1 = 0, …, would according to (4) have the frequency function
but this series is not convergent for r > 2, which is particularly apparent when after the differentiations x is put equal to zero. For example
has
and
has
It will be seen that one cannot freely combine arbitrarily chosen values of the halfinvariants μ3, μ4, … to get a usable law of error.
CHAPTER SEVEN The early history of cumulants and the Gram–Charlier series 60
Presentation This paper, a reprint of part of Hald (2000a), contains a detailed discussion of Thiele's halfinvariants and their use in series expansions. S.L.L. Fig. 1 Portrait of Thiele at the age of 70. Painting by Herman Vedel. Reproduced from photograph made for Hafnia's celebration of Thiele's 100th birthday, 1938.
60
Reprinted with permission from International Statistical Review.
THE EARLY HISTORY OF THE CUMULANTS
233
1 Introduction The normal distribution was introduced as an error distribution by Gauss in 1809 and as a large-sample distribution of the arithmetic mean by Laplace in 1810. It did not last long before a generalization was carried out by considering the normal distribution as the first term of a series expansion, later known as the Gram–Charlier series. We shall discuss the historical development of this series from three different points of view. In section 2 we describe how Poisson, Bessel and Bienaymé generalized the Laplacean central limit theorem by including more terms in the expansion of the logarithm of the characteristic function. By means of the inversion formula they found an expansion, the Gram–Charlier series, for the density of a sum of independently and identically distributed random variables with the normal density as the leading term. The following terms contain Hermite polynomials multiplied by complicated moment coefficients derived from the underlying distribution, which they determined up to the moments of the sixth order. These coefficients were simplified by Thiele, who expressed them in terms of the cumulants and derived a recursion formula for them. In section 3 we discuss Chebyshev's least squares fitting of a polynomial to the observed values of a function by means of orthogonal polynomials. He pointed out the advantages of the successive determination of the coefficients and the residual sum of squares. Generalizing this method Chebyshev found the least squares approximation to an arbitrary integrable function by means of an infinite series of orthogonal polynomials, and choosing the normal density as weight function the Hermite polynomials were introduced. Applying this method to a continuous density the Gram–Charlier series follows. The generalized central limit theorem discussed in section 2 may thus be considered as the least squares approximation to a continuous density. In section 4 we explain how Thiele and Gram introduced the Gram–Charlier series from a completely different point of view. Realizing that the normal distribution was unsatisfactory for describing economic and demographic data, they proposed to multiply the normal density by a power series and determine the coefficients by least squares, which led to the Gram–Charlier series and thus a new system of skew distributions. Thiele pointed out the one-to-one relationship between n observations and the first n symmetric functions and stated that the cumulants are the simplest for describing the data. He stressed that the first four cumulants and the corresponding
234
THE EARLY HISTORY OF THE CUMULANTS
terms of the Gram–Charlier series often will give a satisfactory characterization of a distribution. As mentioned above he expressed the coefficients of the Gram–Charlier series in terms of the cumulants. In the first instance Thiele (1889) defined the cumulants recursively in terms of the moments; we shall suggest how he may have arrived at this formula. In a little known paper (1899) he gave the modern definition of the cumulants as the coefficients in the power series for the logarithm of the moment generating function; he did not mention the similar expansion of the logarithm of the characteristic function in the derivation of the central limit theorem. In this paper he also introduced operator M(-D), where M(t) = E(ext) and D denotes differentiation, and derived the Gram–Charlier series by applying this operator to the normal distribution. Finally, he showed that the operator exp[αr(D)r] applied to a density with cumulant κr increases κr by αr. By means of a product of such operators he thus transformed a density with given cumulants to another with specified cumulants. We have translated Thiele's 1899 paper from Danish into English in the Appendix (Chapter 6 of this volume). In the early proofs the authors tacitly assumed that all moments are finite and that the moment generating function exists, only Thiele and Gram discussed problems of convergence. In accordance with common usage in statistics we assume that the Hermite polynomials Hr(x) are defined by differentiation of exp(-x2/2) and that the reader is familiar with the properties of these polynomials.
2 The central limit theorem, the moments, and the Gram–Charlier series Let x be a random variable with finite moments, μ′r and μr, r = 0, 1, 2, …, cumulants κr moment generating function M(t) = E(ext), characteristic function ψ(t) = E(eixt), and cumulant generating function κ(t) = ln ψ(t). The following discussion is based on the expansion (2.1)
and the corresponding power series for ln ψ(t). Laplace uses the characteristic function and its logarithm in his proofs of the central limit theorem. For an arbitrary distribution he (1810, Art. VI; 1812, II, §22) finds
THE EARLY HISTORY OF THE CUMULANTS
235
and for a symmetric distribution with zero mean he (1811, Art. VIII; 1812, II, §20) obtains
where we have introduced the notation μ′r instead of Laplace's kr/k. These expansions were studied in more detail by Poisson (1829, pp. 8–9; 1837, p. 269), who gives his result as
where
The expressions for h and g in terms of μ′r are due to Poisson, he leaves the derivation of l to the reader. We have added the corresponding formulas in terms of μr and κr. Referring to Laplace, but not to Poisson, Bienaymé (1852, p. 45) gives a simpler proof of Poisson's result, including the expression for l. Bessel (1838) refers to Laplace and Poisson and extends Laplace's expansion for a symmetric distribution by including the term
It follows from Laplace's definition of the characteristic function that the characteristic function for a linear combination of independent random variables equals the product of the characteristic functions for the components, and hence that the logarithm equals the sum of the logarithms. This property is used by Poisson, Bessel and Bienaymé in their proofs. As pointed out by Bru (1991) in his discussion of Bienayme's proof this implies that (2.2)
Let sn = x1 + ⋯ + xn be the sum of n independently and identically distributed variables with a continuous density. We shall sketch a proof of the central limit theorem based on a combination of the methods used by the four authors mentioned above. The basic idea and technique is due to Laplace. For brevity we shall
236
THE EARLY HISTORY OF THE CUMULANTS
introduce the cumulants in the expansion of ln ψ(t) instead of the coefficients expressed in terms of the moments about zero, that is, we write (2.3)
We shall like Bessel stop at the sixth term. The density of sn is found from the inversion formula
where
which follows from the expansion of n ln ψ(t). The main terms becomes (2.4)
where
and φ(u) denotes the standardized normal density. To evaluate the following terms the authors mentioned differentiate the two sides of (2.4) with respect to sn with the result that
where we have denoted the polynomials by Hr since they are equal to the Hermite polynomials. By means of this result the terms involving R(t) are easily found. Setting (2.5) the expansion may be written as (2.6)
which today is known as the Gram–Charlier series.
THE EARLY HISTORY OF THE CUMULANTS
237
The first term is due to Laplace, who also found the third term for a uniformly distributed variable. The general form of the second term is given by Poisson. For a symmetric distribution Bessel derives the third and the fifth term. Bienaymé derives the first three terms to which he adds the term because it is of the same order in n as the third term. Ordering the terms according to powers of the resulting series is called the Edgeworth series. One may wonder why the synthesis above does not occur in the early literature on the central limit theorem. The reason may be that the second term, the correction for skewness, was deemed sufficient for practical applications. Laplace developed his large-sample estimation and testing theory using the asymptotic normality of the linear estimates involved. Poisson took the correction for skewness into account in some of his examples but found that the effect was negligible with the sample sizes at hand. As noted by Molina (1930), Laplace (1811, Art. V; 1812, II, §17) derives the complete Gram–Charlier expansion in his discussion of a diffusion problem, using orthogonal polynomials proportional to the Hermite polynomials with argument . A more detailed discussion of the early history of the central limit theorem is given by Hald (1998, Chapter 17). In the early proofs of the central limit theorem the cumulants thus implicitly occur. However, nobody thought of defining the coefficients in question as separate entities and to study their properties. This had to wait for Thiele (1889).
3 Least squares appromixation by orthogonal polynomials The problem of fitting a polynomial of a given degree, m say, to n + 1 observations (x0, y0), …, (x1, y1), …. (xn, yn), m < n, where (3.1)
E(εi) = 0 and var(εi) = σ2/wi, wi being a known positive number, is a special case of the Gaussian linear minimum variance estimation by the method of least squares. In a series of papers between 1835 and 1853 Cauchy extends the linear model by considering the case where the number of terms is unknown; in particular he discusses the polynomial model with unknown m. Rewriting (3.1) in the form (3.2)
238
THE EARLY HISTORY OF THE CUMULANTS
where hr(x), r = 0, 1, 2, …, is a suitably chosen polynomial of degree r, he shows how the true value of y may be estimated stepwise such that the inclusion of one or more term does not change the coefficients of the previous terms. After adding a new term the residuals are calculated and used for judging the goodness of fit and thus for deciding when to stop the procedure. Cauchy does not use the method of least squares so his polynomials are nonorthogonal. Bienayme criticizes Cauchy for not using the method of least squares but does not himself work out the corresponding theory. This was done by Chebyshev who did not know Cauchy's work when he wrote his first paper (1855). However, in his second paper (1859a) he refers to Cauchy and mentions that the regression coefficients are determined successively and independently of one another. A survey of Cauchy's work may be found in Hald (1998, Chapter 24). Let there be given n + 1 observations f(x0), f(x1), …, f(xn) and let the reciprocal of the variance of f(xi) be w(xi) > 0. By the method of least squares Chebyshev (1855) fits a polynomial of degree m to the n + 1 observations, m < n, using w(xi) as weight. We shall write the model as (3.3)
By means of the theory of continued fractions he proves that there exists a set of m + 1 orthogonal polynomials [hr(x)] satisfying the relations
hr being of degree r, and that the least squares approximation may be written as
with
where ar, the estimate of αr, depends on hr only. The polynomials maybe normed in any convenient way, in particular they may be made orthonormal. In the next paper Chebyshev (1859a) proves that hr(x) may be found as the denominator of the r-th convergent in the continued fraction for the function
THE EARLY HISTORY OF THE CUMULANTS
239
and finds a recursion formula for hr in terms of hr - 1 and hr - 2. He notes that the minimum sum of squares equals
which may be used to decide when to stop. Referring to Chebyshev (1855), Hermite (1859) gives a simpler derivation of hr(x). Neither Chebyshev nor Hermite observes that Laplace's (1816) orthogonalization of the equations of condition for the linear model provides a formula from which the orthogonal polynomials may be found by recursion. Inspired by Fourier's representation of an arbitrary function as an infinite trigonometric series and by his own results above Chebyshev (1859b) presents a method for approximating an arbitrary integrable function f(x) by an infinite series in terms of orthogonal polynomials. Briefly told, he replaces the sums above by integrals, assuming that x and w(x) are continuous and that the integrals involved are finite; f(x) is defined on a finite or an infinite interval, we shell leave out the limits of integration. Hence, (3.4)
and (3.5)
The polynomial hr(x) is found as the denominator for the r-th convergent in the continued fraction for the function
By suitable choices of w(x) Chebyshev obtains the series that today are named after Maclaurin, Fourier, Legendre, Laguerre, and Hermite; we shall only discuss the last one. Choosing as weight function
240
THE EARLY HISTORY OF THE CUMULANTS
Chebyshev finds that
Hence, in our notation Chebyshev results may be written as (3.6)
and
Chebyshev remarks that for ω → 0 the series tends to Maclaurin's series since Hr(u)ωr → xr. Chebyshev does not discuss further applications of the series. Hermite (1864) begins by defining the polynomials Ur(x) by the equation
He discusses the properties of these polynomials and the corresponding infinite series, remarking that it follows from Bienaymé's note to the translation of Chebyshev's 1855 paper that the series belongs to the wide category of expansions which give interpolation formulas derived by the method of least squares. Most of Hermite's paper is taken up with generalizations to two and more variables. Gnedenko & Kolgomorov (1954) call the polynomials Hr(x) Chebyshev–Hermite polynomials; presumably they did not know that the polynomials previously had been used by Laplace, as noted by Molina (1930) and Uspensky (1937, p. 72). To see the connection with the cumulants and the Gram–Charlier series we shall consider the expansion of a frequency function written in the form g(x) = φ(x)f(x). Using Chebyshev's formula (3.6) for ω = 1 we get
THE EARLY HISTORY OF THE CUMULANTS
241
Since Hr(x) is a polynomial of degree r, E[Hr(x)] is a linear combination of the moments of g(x). For a standardized variable we get (3.7)
This is the Gram–Charlier series for g(x) in the form given by Thiele (1889). It is equal to (2.6) for n = 1 and κ2 = 1. Chebyshev did not notice this coincidence in 1859. However, in 1887 he derives the main terms of the central limit theorem and gives limits for the remainder using a new method of proof, the method of moments. At the end of this paper he briefly states that p(sn) may be expanded in an infinite series by the method given in his 1859b paper and indicates the result in the form (2.6) but leaves to the reader to find the coefficients in terms of the moments. Hence, the series (2.6) may be interpreted as the least squares approximation to p(sn) = φ(u)f(u).
4 Thiele's halnvariants, their operational properties, and the Gram–Charlier series A survey of Thiele's statistical work is given by Hald (1981). Here we present some supplementary remarks on Thiele's invention and use of the cumulants and the Gram–Charlier series. Thiele named his new symmetric functions halfinvariants. For linguistic reasons some later authors called them semiinvariants. Fisher (1929) introduced the term cumulative moment function instead of semi-invariant, and Fisher & Wishart (1931) abbreviated this to cumulant. In a personal communication S.M. Stigler has pointed out to me that Hotelling (1933) claims the priority to the term cumulant. We shall transcribe Thiele's formulas to modern notation using hr and Kr for the empirical and theoretical cumulants, respectively, following Fisher's notation. Thiele (1889) notes that there exists a one-to-one relationship between the n observations and the first n symmetric functions. He introduces the raw moments m′r, r = 0, 1, …, n, the moments about the arithmetic mean mr, and the reduced moments , remarking that the observations without loss of information may be represented by either (m′1, …, m′n), or (m′1, m′2, …, mn), or (m′1, m′2, ). He adds that the first four terms of either of the last two sets often will give a good characterization of the empirical distribution. However, he (p. 21) points out that it is a drawback that the moments of even
242
THE EARLY HISTORY OF THE CUMULANTS
order and also the reduced moments increase rapidly with r and that it would be advantageous to introduce a new symmetric function not having this property. We believe that this is the key to his recursive definition of the cumulants. In his discussion of the relation between the moments he derives the formula
To get a symmetric function of even order which is smaller than mj it is necessary to replace m′1r−j by a larger number, a natural choice being m′r - j. This may have induced Thiele to define the cumulants by the recursion (4.1)
and h0 = 1. Solving for the h's Thiele finds the first six empirical cumulants in terms of m′ and m, the first four being h1 = m′1h2 = m1, h3 = h3, and , the last one showing the desired effect. Since the theoretical distribution is the limit of the empirical for n → ∞ the theoretical cumulants are defined by the recursion (4.2)
and κ0 = 1. By means of this formula Thiele derives the cumulants for several distributions; we shall only give his results for the normal distribution and the Gram–Charlier series. For the standardized normal distribution he finds mat μ2r + 1 = 0 and μ2r = 1 · 3 ⋯ (2r - 1). It follows from (4.2) that the cumulants of uneven order equal zero and those of even order satisfy the relation
which shows that κ2r = 0 for r = 2, 3, …. It must have been a great satisfaction for Thiele to find that the mean and the variance of a normally distributed variable equals κ1 and κ2, respectively, and that all the cumulants of higher order equal zero. Thiele (1903, p. 25)
THE EARLY HISTORY OF THE CUMULANTS
243
writes: “This remarkable proposition [κr = 0 for r ≥ 3] has originally led me to prefer the half-invariants to every other system of symmetrical functions.” Obviously, the normal distribution is insufficient for describing distributions of economic and demographic data as encountered by actuaries. It seems that the many-talented Danish philologist, politician, forester, statistician, and actuary L. H. F. Oppermann is the first to propose a system of skew frequency functions obtained by multiplying the normal density by a power series. Moreover, he proposed to estimate the coefficients by the method of moments. Oppermann did not himself publish these ideas; they are reported by his younger actuarial colleagues Thiele (1873) and Gram (1879, pp. 93–94). Thiele (1873) writes the density in the form
In the textbook he (1889, p. 14) uses (4.3)
Regarding the two forms of φ he (1903, p. 17) remarks: “In the majority of its purely mathematical applications is preferable, unless (as in the whole theory of observations) the factor in the index is to be preferred on account of the resulting simplifications of most of the derived formulæ.” He disapproves of the then commonly used form , which goes back to Gauss. Independently of Thiele, Stigler (1983) has proposed the form . To estimate the coefficients in the expansion for a grouped frequency distribution Thiele (1889, p. 100) introduces the standardized variable z = (x - x)/s, where x and s denote the sample mean and standard deviation for the n observations. Taking the length of the constant class interval as unit and denoting the i-th relative frequency by fi, ∑fi = 1, Thiele writes
where f(zi) denotes the density at the midpoint of the interval. Using 1/φ(zi) as weight the least squares estimate of cT is obtained by minimizing
244
THE EARLY HISTORY OF THE CUMULANTS
which gives the normal equations (4.4)
if the sum on the right side can be approximated by the integral. The estimate is thus obtained by equating the empirical and the theoretical means of Hr(x). This is Gram's explanation of Oppermann's method of estimation. The justification for the choice of weight function and the replacement of the sum by the integral was provided by Thiele. Thiele (1873) investigates the difference between the sum and the integral in (4.4) and concludes (wrongly) that it is insignificant for small values of r if the length of the class interval is less than the standard deviation. He missed the Sheppard corrections for grouping. In (1189, p. 28) he states that for r = 3 and 4 the class interval should be at most one-fourth of the standard deviation. Thiele (1889, p. 100) considers g(x) as a special case of the linear model. He notes that var(fi) = fi(1 - fi)/n ≅ fi/n, if the number of classes is large. The weight is thus proportional to 1/fi which he approximates by 1/φ(xi), and the least squares solution then follows as shown above. However, Thiele overlooks that fi and fj, i ∉ j, are correlated with covariance -fifj/n so that the assumptions for applying the method of least squares are not satisfied. Nevertheless the method leads to simple estimates of high efficiency for moderately skew distributions. Thiele should of course have minimized the quadratic form obtained by taking the covariance into account, which would have led him to the minimum χ2 estimate. To find the coefficients of the Gram–Charlier series in terms of the cumulants Thiele starts from the relation cr = E[Hr(x)], which gives cr in terms of μ′1, …, μ′r. He then introduces the cumulants instead of the moments by means of the recursion formula. We shall give a slightly simplified version of Thiele's proof using the symbol μ′r, for μ′r and similarly for κ so that (4.2) becomes
and E[Hr(x)] = Hr(μ′). Thiele writes the Hermite polynomial in the form
and
THE EARLY HISTORY OF THE CUMULANTS
245
From the recursion formula
he obtains
Hence
An analogous result holds for odd subscripts. The proof above corresponds to Thiele's (p. 27) more elementary procedure consisting of successive elimination of the moments by which he finds the first six expressions for cr + (r 1)cr-2 in terms of the cumulants. From the pattern obtained he concludes that (4.5)
or
and c0 = 1. This is the first time that a simple formula is given for the calculation of the coefficients in the Gram–Charlier series.
246
THE EARLY HISTORY OF THE CUMULANTS
Setting κ1 = 0 and κ2 = 1 Thiele finds c1 = c2 = 0, c3 = κ3, c4 = κ4, c5= κ5, and
which gives
We have already used these results by writing g(x) in the form (3.7). Thiele remarks that the right side of (4.5) is of the same form as (4.2). Noting (p. 30) that κ1(ax + b) = ak1 + b and κr(ax + b) = ar κr, r = 2, 3, …, Thiele (p. 60) states the fundamental formula
For the arithmetic mean he finds κr(h1) = κr/nr-1 and γr(h1) = γr/nr/2 and thus the asymptotic normality of h1. Moreover, he proposes to test the significance of hr in relation to a hypothetical distribution by using the asymptotic normality of and if necessary to take the higher cumulants of hr into account. Thiele realized that his proofs are cumbersome because of the recursive definition of the cumulants. Looking for a more direct definition he (1899) found the formula (4.6)
which is the definition used today. Replacing t by it the moment generating function M(t) on the right side becomes the characteristic function ψ(t) which leads to the alternative definition (2.3). By means of (4.6) Thiele derives the direct expression for κ in terms of μ′ and vice versa. He also proves the recursion formula. Setting t = -D, where D denotes differentiation, and operating on the normal distribution with mean ξ and variance σ2 he obtains
which is the Gram–Charlier series in symbolic form. Finally, he proves that the operator exp[αr(-D)]r applied to a density with cumulant κr increases κr by αr
THE EARLY HISTORY OF THE CUMULANTS
247
so that ‘any law of error may be derived from anyone else with the same mean and standard deviation by the operation’
Instead of proving these results we shall let Thiele speak for himself through the translation of his paper given in the Appendix (Chapter 6 of this volume). In the English version of his textbook Thiele (1903) starts from the definition (4.6) which he uses for simplifying the proofs from 1889. He does not mention the operator M(-D). The results in Thiele's 1899 paper were rediscovered by Cornish & Fisher (1937); they do not refer to Thiele. In the fourth edition of Statistical Methods (1932) and in later editions Fisher gives a misleading evaluation of Thiele's theory of the cumulants. An even greater distortion may be found in Fisher's letter to Hotelling of 1 May 1931, published by Bennett (1990, pp. 320–321). Independently of Thiele, Hausdorff (1901) defines the cumulants, which he calls ‘canonical parameters’, by (4.6) and derives the same results as Thiele, except for the operational properties depending on M(-D). A survey of Hausdorff's contributions to probability theory is due to Girlich (1996). The contributions of Charlier and Edgeworth have been discussed in the fundamental paper by Cramér (1928), see also Särndal (1971) and Cramér (1972).
Acknowledgement I thank Professor E. Arjas for suggestions that have improved the presentation of the paper.
Résumé L'histoire ancienne de la série de Gram et Charlier est discutée selon trois points de vue: (1) une généralisation du théorème central-limite de Laplace, (2) une approximation par la méthode des moindres carrés d'une fonction continue au moyen de polynômes de Chebyshev et Hermite, (3) une généralisation de la distribution normale de Gauss à une système de distributions asymétriques. Thiele a défini les cumulants en fonction des moments, d'abord par une formule de récurrence, puis par un développement du logarithme de la fonction génératrice des moments. Il a construit un opérateur différentiel qui ajuste un cumulant quelconque à une valeur
248
THE EARLY HISTORY OF THE CUMULANTS
desirée. Son article peu connu de 1899, portant sur les propriétes des cumulants, est traduit du danois en anglais en annexe. Mots clés: Bessel; Bienaymé; Théorème central-limite; Chebyshev; Cumulants; Gram; Série de Gram et Charlier; Semiinvariants; Hausdorff; Hermite; Laplace; Moindres carrés; Moments; Polynômes orthogonaux; Poisson; Thiele.
CHAPTER EIGHT Epilogue After this presentation of Thiele's work it seems obvious to ask whether Thiele has been unduly neglected. This question does not have a simple answer. It is obvious that much of his work was not well understood and appreciated by his contempories in Denmark and the rest of Europe. On the other hand, the Danish Actuarial Society had a medal struck in his honour on the occasion of his 70th birthday in 1908. He was elected a Corresponding Member of the Institute of Actuaries in London in 1895. And although R. A. Fisher certainly did not give Thiele proper credit and recognition, in the fourth edition of Statistical Methods Fisher (1932) provides a list of the main contributors to statistics containing only six names: Bayes, Laplace, Gauss, K. Pearson, Student, and Thiele. This is good company, even for Thiele. Thiele's elementary book from 1903 was reprinted in full in the second volume of the Annals of Mathematical Statistics, in 1931. So to the extent that Thiele was understood, it was indeed widely acknowledged that he was an important figure in the early development of statistics. The mystery is why he was apparently forgotten in Denmark in more recent times, until Hald began his historical research in 1979. As pointed out by Stigler (1999), Thiele is not mentioned at all in the most influential Danish textbook in statistics (Hald 1952), and only briefly in the classic of Harald Cramér (Cramér 1946). The booklet written on the occasion of the 70th birthday of G. Rasch (Jensen 1971) mentions Rasch's predecessors H. Cl. Nybølle and H. Westergaard, but not a word about Thiele. There are many and complex reasons for this, one of them being the lack of a proper academic statistical environment in Copenhagen in the period between Thiele's death in 1910 and the appointment of J. F. Steffensen to the Chair of Insurance Mathematics in 1919. Steffensen was a great admirer of his spiritual predecessor Thiele and Thiele's work. He was also aware of the important developments in statistics that took place in England and elsewhere in his
250
EPILOGUE
Fig. 1 The gold medal struck to honour Thiele on his 70th birthday. Later, the medal was given as a present to the Institute of Actuaries in London on the occasion of the 100th anniversary of this institute, and can now be seen in its museum (P. Johansen 2001, personal communication). The inscription on the front of the medal reads: T. N. Thiele – Founder of the Society of Danish Actuaries. In the centre of the back of the medal the inscription reads: From the Actuarial Society for Scientific Achievements; and that around the edge: Struck on the occasion of Thiele's 70th birthday – 24 December 1908.
EPILOGUE
251
own time, although he did not include this in his lectures. Anders Hald and Georg Rasch therefore took their primary inspiration directly from R. A. Fisher's ground-breaking work. Indeed, although Hald was well aware of Thiele's existence, and had been influenced indirectly by Thiele through the study of Steffensen's textbook, Hald never had the urge or opportunity to read Thiele's originals before the 500th anniversary of the University of Copenhagen (A. Hald 2001, personal communication), when the importance of Thiele's contributions to statistics was rediscovered.
Bibliography Abbe, E. (1863). Über die Gesetzmässigkeit in der Verteilung der Fehler bei Beobachtungsreihen. Dissertation, Jena 1863. See Kendall, M. G. (1971). The work of Ernst Abbe. Biometrika, 58, 369–373. Andersen, E. (1975). Heinrich Christian Schumacher. Geodætisk Instituts Forlag, København. Andersen, K. (1999). Wessel's work on complex numbers and its place in history. Det kongelige danske Videnskabernes Selskabs Skrifter, Nye Samling, 46, 65–98. Andræ, C. G. (1860). Udvidelse af en af Laplace i Mécanique céleste angivet Methode for Bestemmelsen af en ubekjendt Størrelse ved givne umiddelbare lagttagelser. Oversigt over det kongelige danske Videnskabernes Selskabs Forhandlinger 1–38. Andræ, C. G. (1867). Om den rette Begrundelse af de mindste Quadraters Methode. Den danske Gradmaaling, 1, 556–561, Kjøbenhavn. Atkinson, A. C. and Bailey, R. A. (2001). One hundred years of the design of experiments on and off the pages of Biometrika. Biometrika, 88, 53–97. Bennett, J. H. (ed.) (1990). Statistical inference and analysis. Selected correspondence of R. A. Fisher. Clarendon Press, Oxford. Bertelsen, N. P. (1927). On the compatibility of frequency constants, and on presumptive laws of error. Skandinavisk Aktuarietidsskrift, 10, 129–56. Bessel, F. W. (1838). Untersuchungen über die Wahrscheinlichkeit der Beobachtungsfehler. Astronomische Nachrichten, 15, 358–359, 369–404. Bienaymé, I. J. (1852). Sur la probabilité des erreurs d'aprés la méthode des moindres carrés. Journal de Mathématiques Pures et Appliquées, 17, 33–78. Bing, F. M. (1879). Om aposteriorisk Sandsynlighed (On posterior probability). Tidsskrift for Mathematik, 3, 1–22, 66–70, 122–131. Branner, B. (2000). Danish Mathematical Society. European Mathematical Society Newsletter, 35, 14–15. Brenner, S. O. (1939). Bogtrykker Johan Rudolf Thiele's samtlige Efterkommere (All of the descendants of Printer Johan Rudolf Thiele). Personalhistorisk Institut, Copenhagen. Bru, B. (1991). A la recherche de la démonstration perdue de Bienaymé. Mathématiques Informatique et Sciences Humaines, 29, 5–17. Burrau, C. (1926). Ved Den Danske Aktuarforenings 25-Aars Stiftelsesdag (At the 25 anniversary of the Danish Actuarial Society). Copenhagen. Burrau, C. (1928). T. N. Thiele 1838–1910. Nordic Statistical Journal, 1, 340–8. Chebyshev, P. L. (1855). Sur les fractions continues [in Russian]. Journal de Mathématiques Pures et Appliquées, (1858) 3, 289–323 [in French].
EPILOGUE
253
Chebyshev, P. L. (1859a). Sur l'interpolation par la méthode des moindres carrés. Mém. Acad. Sci. St. Pétersbourg, 1(7), No. 15, 1–24. Chebyshev, P. L. (1859b). Sur le développement des fonctions à une seule variable. Bull. phys.-math. Acad. Sci. St. Pétersbourg, 1, 193–200. Chebyshev, P. L. (1887). Sur deux théorèmes relatifs aux probabilités. Bull, phys.-math. Acad. Sci. St. Pétersbourg, 55 [in Russian]. Acta Mathematica (1890–1891), 14, 305–15 [in French]. Cornish, E. A. and Fisher, R. A. (1937). Moments and cumulants in the specification of distributions. Revue de l'Institut International de Statistique., 4, 1–14; 5, 307–20. Cramér, H. (1928). On the composition of elementary errors. Skandinavisk Aktuarietidsskrift, 11, 13–74, 141–180. Cramér, H. (1946). Mathematical methods of statistics. Princeton University Press, Princeton, NJ. Cramér, H. (1972). On the history of certain expansions used in mathematical statistics. Biometrika, 59, 205–207. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B, 39, 1–22. Edgeworth, F. Y. (1885). A method of ascertaining variations in the rate of births, deaths and marriages. Journal of the Royal Statistical Society, 48, 628–49. Edwards, A. W. F. (2001). Estimating a binomial parameter using the likelihood function. Comments on Thiele (1889). In Annotated readings in the history of statistics, (ed. H. A. David and A. W. F. Edwards). Springer-Verlag, New York. Farebrother, R. W. (1999). Fitting linear relationships: A history of the calculus of observations 1750–1900. Springer-Verlag, New York. Fisher, A. (1922). The mathematical theory of probabilities and its application to frequency curves and statistical methods. Macmillan, New York. Fisher, R. A. (1929). Moments and product moments of sampling distributions. Proceedings of the London Mathematical Society, Series 2, 30, 199–238. Fisher, R. A. (1932). Statistical methods for research workers, (4th edn). Oliver and Boyd, Edinburgh. Fisher, R. A. and Wishart, J. (1931). The derivation of the pattern formulae of two-way partitions from those of simpler patterns. Proceedings of the London Mathematical Society, Series 2, 33, 195–208. Gauss, C. F. (1809, 1821, 1823). Abhandlungen zur Methode der kleinsten Quadrate von Carl Friedrich Gauss. In deutscher Sprache herausgegeben von A. Börsch und P. Simon. (1887), Stankiewicz, Berlin. Gauss, C. F. (1823). Theoria combinationes observationum erroribus minimis obnoxiae. Dieterich, Göttingen. Girlich, H.-J. (1996). Hausdorffs Beiträge zur Wahrscheinlichkeitstheorie. In Felix Hausdorff zum Gedächtnis, Ed. E. Brieskorn, Band I, 31–70. Westdeutscher Verlag, Opladen.
254
EPILOGUE
Gnedenko, B. V. and Kolmogorov, A. N. (1954). Limit distributions for sums of independent random variables. AddisonWesley, Cambridge, Mass. Gram, J. P. (1879). Om Rœkkeudviklinger, bestemte ved mindste Kvadraters Methode (On series expansions, determined by the method of least squares). Høst, Copenhagen. Gram, J. P. (1883). Om Beregning af en Bevoxnings Masse ved Hjælp af Prøvetræer. Tidsskrift for Skovbrug, 6, 137–98. Gram, J. P. (1884). Om Udjævning af Dødelighedsantagelser og Oppermann's Dødelighedsformel. Tidsskrift for Mathematik, Femte Rk. 2, 113–39. Gram, J. P. (1889). Om Konstruktion af Normal-Tilvæxtoversigter, med særligt Hensyn til Iagttagelserne fra Odsherred. Tidsskrift for Skovbrug, 11, 97–151. Gram, J. P. (1910). Professor Thiele som Aktuar. Dansk Forsikrings-Aarbog, 26–37. Gyldenkerne, K. and Darnell, P. B. (1990). T. N. Thieles embedsperiode (The period of office of T. N. Thiele). In Dansk astronomi gennem firehundrede år. (Danish astronomy through 400 years), (ed. C. Thykier). Rhodos, Copenhagen. Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations. Econometrica, 11, 1–12. Hald, A. (1952). Statistical theory with engineering applications. John Wiley and Sons, New York. Hald, A. (1981). T. N. Thiele's contributions to statistics. International Statistical Review, 49, 1–20. Hald, A. (1983). Statistikkens teori (Theoretical statistics). In Københavns universitet 1479–1979 (University of Copenhagen 1479–1979), pp. 213–27. Gad, Copenhagen. Hald, A. (1998). A history of mathematical statistics from 1750 to 1930. John Wiley and Sons, New York. Hald, A. (2000a). The early history of the cumulants and the Gram–Charlier series. International Statistical Review, 68, 137–53. Hald, A. (2000b). On the early history of skew distributions. Thiele's three frequency functions. Preprint No. 6, Department of Statistics and Operations Research, University of Copenhagen. Hald, A. (2001). On the history of the correction for grouping. Scandinavian Journal of Statistics, 28, 417–28. Hansen, Chr. (1946). Om Thieles Differentialligning for Prœmiereserver i Livsforsikring. Hagerup, København. Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72, 320–38. Hausdorff, F. (1901). Beiträge zur Wahrscheinlichkeitsrechnung. Ber. Verh Kgl. Sächs. Ges. Wiss. Leipzig, 53, 152–78. Helmert, F. R. (1872). Die Ausgleichungsrechnung nach der Methode der kleinsten Quadrate, 2nd edition (1907). Teubner, Leipzig.
EPILOGUE
255
Helmert, F. R. (1876a). Über die Wahrscheinlichkeit der Potenzsummen der Beobachtungsfehler und über einige damit im Zusammenhange stehende Fragen. Zeit. Math. Phys., 21, 192–218. Helmert, F. R. (1876b). Die Genauigkeit der Formel von Peters zur Berechnung des wahrscheinlichen Beobachtungsfehlers direchter Beobachtungen gleicher Genauigkeit. Astronomische Nachrichten, 88, 113–32. Hermite, C. (1859). Sur l'interpolation. Comptes Rendues de l' Academie des Sciences, Paris, 48, 62–7. Hermite, C. (1864). Sur un nouveau développement en série des fonctions. Comptes Rendues de l' Academie des Sciences, Paris, 58, 93–100, 266–73. Hoem, J. M. (1980). Who first fitted a mortality formula by least squares? Blätter der DGVM, 14, 459–60. Hoem, J. M. (1983). The reticent trio: Some little-known early discoveries in life insurance mathematics by L. H. F. Oppermann, T. N. Thiele, and J. P. Gram. International Statistical Review, 51, 213–21. Hotelling, H. (1933). Review of Statistical methods for research workers, by R. A. Fisher. Fourth Edition. Edinburgh: Oliver and Boyd. 1932, xiii, 307 pp. Journal of the American Statistical Association, 28, 374–75. Jensen, H. (1944). An attempt at a systematic classification of some methods for the solution of normal equations. Geodætisk Institut, Meddelelse No. 18, København. Jensen, A. (1971). Speech on the occasion of the 70th birthday of Georg Rasch. Institute of Mathematical Statistics and Operations Research, Technical University of Denmark. (In Danish). Johnson, N. L. and Kotz, S. (ed.) (1997). Leading personalities in statistical sciences. John Wiley and Sons, New York. Jorgensen, N. R. (1913). Grundzüge einer Theorie der Lebensversicherung. Fischer, Jena. Kalman, R. E. and Bucy, R. (1961). New results in linear filtering and prediction. Journal of Basic Engineering, 83 D, 95–108. Laplace, P. S. (1810). Mémoire sur les approximations des formules qui sont fonctions de très grands nombres et sur leur application aux probabilités. Mémoires de l' Academie Royale des Sciences de Paris, 1809, 353–415. Laplace, P. S. (1811). Mémoire sur les integrales definies et leur application aux probabilités, et specialement a la recherche du milieu qu'il faut choisir entre les résultats des observations. Mémoires de l' Academie Royale des Sciences de Paris, pp. 279–347. Laplace, P. S. (1812). Théorie analytique des probabilités. Courcier, Paris. Laplace, P. S. (1816). Sur l'application du calcul des probabilités à la philosophie naturelle. Théorie analytique. Premier supplément. Larsen, O. (1954). Universitetsminder (University memories). Unpublished manuscript. Lauritzen, S. L. (1976). Appendix to Winkel et al.: Method for monitoring plasma progesterone concentrations in pregnancy. Clinical Chemistry, 22, 427–8.
256
EPILOGUE
Lauritzen, S. L. (1981). Time series analysis in 1880: A discussion of contributions made by T. N. Thiele. International Statistical Review, 49, 319–31. Lauritzen, S. L. (1999). Aspects of T. N. Thiele's contributions to statistics. Bulletin of the International Statistical Institute, 58, Book 1, 27–30. Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems (with discussion). Journal of the Royal Statistical Society, Series B, 50, 157–224. Lomholt, A. (1954). ‘Et Møde i Videnskabernes Selskab’. P. S. Krøyers maleri og dets tilblivelse (A meeting of the Royal Danish Academy of Sciences and Letters. The painting of P. S. Krøyer and its making). Munksgaard, Copenhagen. Milne-Thomson, L. M. (1933). The calculus of finite differences. Macmillan, London. Molina, E. C. (1930). The theory of probability: Some comments on Laplace's Théorie analytique. Bulletin of the American Mathematical Society, 36, 369–92. Nielsen, H. (1965). SKAK i tusind år (Chess for a thousand years). On the occasion of the 100th anniversary of Copenhagen Chess Club. Ejbybro, Kirke Hyllinge. Nielsen, N. (1910). Matematiken i Danmark 1801–1908. Gyldendalske Boghandel/Nordisk Forlag, Copenhagen/ Kristiania (Oslo). Nörlund, N. E. (1924). Differenzenrechnung. Springer-Verlag, Berlin. Norberg, R. (2001). Thorvald Nicolai Thiele. In Statisticians of the centuries, (ed. C. C. Heyde and E. Seneta), pp. 212–15. Springer-Verlag, New York. Oppermann, L. (1863). En Forespørgsel (Af en Dilettant). Tidsskrift for Mathematik, Første Rœkke 5, 16. Oppermann, L. (1872). Zur Begründung der algemeinen Fehlertheorie. Methode der kleinsten Quadrate. Hamburg, 20pp. Patterson, H. D. and Thompson, R. (1971). Recovery of interblock information when block sizes are unequal. Biometrika, 58, 545–54. Pearl, J. (1998). Graphs, causality, and structural equation models. Sociological Methods and Research, 27, 226–84. Pearson, E. S. (ed.) (1948). Karl Pearson's early statistical papers. Cambridge University Press, Cambridge. Pearson, K. (1895). Contributions to the mathematical theory of evolution, II. Skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London, Series A, 186, 343–414. Reprinted in Pearson (1948). Petersen, J. H. (1986). Three precursors of modern theories of old-age pensions: A contribution to the history of social-policy doctrines. History of Political Economy, 18, 405–17. Plackett, R. L. and Barnard, G. A. (ed.) (1990). ‘Student’ – A statistical biography of William Sealy Gosset. Clarendon Press, Oxford. Poisson, S. D. (1829). Suite du Mémoire sur la probabilité du résultat moyen des observations, inséré dans la Connaissance des Temps de l'année 1827. Conn, des Temps pour 1832, 3–22.
EPILOGUE
257
Poisson, S. D. (1837). Recherches sur la probabilité des jugements en matière criminelle et en matière civile, précedées des règles générales du calcul des probabilités. Bachelier, Paris. Ramskou, K. (2000). Danish Mathematical Society through 125 years. Historia Mathematica, 27, 223–42. Rao, C. R. (1973). Linear statistical inference and its applications, 2nd edition. John Wiley and Sons, New York. Rao, C. R. and Mitra, S. K. (1971). Generalized inverse of matrices and its applications. John Wiley and Sons, New York. Särndal, C.-E. (1971). The hypothesis of elementary errors and the Scandinavian school in statistical theory. Biometrika, 58, 375–92. Schweder, T. (1980). Scandinavian statistics, some early lines of development. Scandinavian Journal of Statistics, 7, 113–29. Smith, K. (1916). On the ‘best’ values of the constants in frequency distributions. Biometrika, 11, 262–76. Smith, K. (1918). On the standard deviation of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations. Biometrika, 12, 1–85. Steffensen, J. F. (1923a). Matematisk Iagttagelseslœre (The mathematical theory of observations). Gad, Copenhagen. Steffensen, J. F. (1923b). Et Dilemma fra Iagttagelseslæren. Matematisk Tidsskrift B, 3, 72–76. Steffensen, J. F. (1930). Some recent researches in the theory of statistics and actuarial science. Cambridge University Press, Cambridge. Stigler, S. M. (1978). Francis Ysidro Edgeworth, statistician. Journal of the Royal Statistical Society, Series A, 141, 287–322. Stigler, S. M. (1983). A modest proposal: a new standard for the normal. The American Statistician, 36, 137–138. Stigler, S. M. (1999). Discussion for ISI history session—IPM 3. Bulletin of the International Statistical Institute, 58, Book 3, 67–8. Sundberg, R. (1974). Maximum likelihood theory for incomplete data from an exponential family. Scandinavian Journal of Statistics, 1, 49–58. Thiele, T. N. (1866). Undersøgelse af Omløbsbevægelsen i Dobbeltstjernesystemet Gamma Virginis (An investigation of the orbit movements in the double-star system γ Virginis). Copenhagen. D.Sc. Thesis. Thiele, T. N. (1869). Bemærkninger om Kædebrøker (Remarks on continued fractions). Tidsskrift for Mathematik, 2, 144–6. Thiele, T. N. (1870). Den endelige Kædebrøksfunktions teori (The theory of finite continued fractions). Tidsskrift for Mathematik, 2, 145–70. Thiele, T. N. (1871). En mathematisk Formel for Dødeligheden. Reitzel, Kjøbenhavn. Thiele, T. N. (1873). Om en Tilnærmelsesformel (On an approximation formula). Tidsskrift for Mathematik, 3, 22–31.
258
EPILOGUE
Thiele, T. N. (1874). Om talmønstre (On patterns of numbers). In Forhandlingerne ved de skandinaviske Naturforskeres 11te møde, København, Juli 1873 (Proceedings of the 11th Meeting of the Scandinavian Natural Scientists), pp. 192–5. Copenhagen. Thiele, T. N. (1878). Bemærkninger om skæve Fejlkurver. Tidsskrift for Mathematik, Fjerde Rœkke2, 54–7. Thiele, T. N. (1879). Bemærkninger om periodiske Kædebrøkers Konvergens (Remarks on the convergence of periodic continued fractions). Tidsskrift for Mathematik, 4, 70–4. Thiele, T. N. (1880). Om Anvendelse af mindste Kvadraters Methode i nogle Tilfælde, hvor en Komplikation af visse Slags uensartede tilfældige Fejlkilder giver Fejlene en ‘systematisk’ Karakter. Det kongelige danske Videnskabernes Selskabs Skrifter, 5. Rœkke, naturvidenskabelig og mathematisk Afdeling, 12, 381–408. French version: Sur la compensation de quelques erreurs quasi-systématiques par la méthode des moindres carrés. C. A. Reitzel, København, 1880. Thiele, T. N. (1886). Om Definitionerne for Tallet, Talarterne og de tallignende Bestemmelser (On the definitions of the number, the types of numbers and the number-like concepts). Det kongelige danske Videnskabernes Selskabs Skrifter, 6. Rœkke, naturvidenskabelig og mathematisk Afdeling, 2, 453–514. Thiele, T. N. (1889). Almindelig Iagttagelseslœre: Sandsynlighedsregning og mindste Kvadraters Methode. C. A. Reitzel, Copenhagen. Thiele, T. N. (1891). Contribution to the proceedings at the meeting of Nationaløkonomisk Forening on January 6, 1891. Nationaløkonomisk Tidsskrift, 29, 59–61. Thiele, T. N. (1892a). Om Forholdet mellem Afstemningskunst og almindelig Iagttagelseslære (On the relationship between the art of voting and the general theory of observations). Thieles Bogtrykkeri, Copenhagen. Thiele, T. N. (1892b). Iagttagelsesteoretiske Regninger angående Bestemmelser af Professor Jul. Thomsen af Varmefylde og Vægtfylde for visse Stoffers vandige Opløsninger, Oversigt over det kongelige danske Videnskabernes Selskabs Forhandlinger, 71–141. Thiele, T. N. (1897). Elementœr Iagttagelseslœre (The elementary theory of observations). Gyldendal, Copenhagen. Thiele, T. N. (1899). Om Iagttagelseslærens Halvinvarianter. Oversigt over det kongelige danske Videnskabernes Selskabs Forhandlinger, 3, 135–41. Thiele, T. N. (1900). Om Dødelighedstavlers Beregning. Oversigt over det kongelige danske Videnskabernes Selskabs Forhandlinger, 139–42. Thiele, T. N. (1903). Theory of observations. Layton, London. Reprinted in Annals of Mathematical Statistics, 2, 165–307 (1931). Thiele, T. N. (1904). Adjustment of tables of mortality. Aktuaren 1–10. Thiele, T. N. (1906a): Et Arvelighedsspørgsmaal belyst ved iagttagelseslære. Oversigt over det kongelige danske Videnskabernes Selskabs Forhandlinger, 149–152. Thiele, T. N. (1906b). Différences réciproques. Oversigt over der kongelige danske Videnskabernes Selskabs Forhandlinger, 153–71.
EPILOGUE
259
Thiele, T. N. (1909). Interpolationsrechnung. Teubner, Leipzig. Tschuprow, A. A. (1924). Ziele und Wege der stochastischen Grundlegung der statistischen Theorie. Nordisk Statistisk Tidsskrift, 3, 433–93. Uspensky, J. V. (1937). Introduction to mathematical probability. McGraw-Hill, New York. Wahba, G. (1983). Bayesian confidence intervals for the cross-validated smoothing spline. Journal of the Royal Statistical Society, Series B, 45, 133–50. Wessel, C. (1798). Om Directionens analytiske Betegning. Et Forsøg anvendt fornemmelig til plane og sfæriske Polygoners Opløsning. Det kongelige danske Videnskabernes Selskabs Skrifter, Nye Samling, 5, 469–518. Translation into English: see Wessel (1999). Wessel, C. (1897). Essai sur la représentation analytique de la direction, traduction du mémoire intitulé: Om direktionens analytiske betegning. Publié avec trois planches de l'original et préfaces de MM H. Valentiner et T. N. Thiele (A treatise on the analytic representation of direction, a translation of the paper entitled: Om direktionens analytiske betegning. Published with three figures from the original and prefaces by Mr. H. Valentiner and Mr. T. N. Thiele) Academie Royale des Sciences et des Lettres de Danemark, Copenhagen. Wessel, C. (1999). On the analytical representation of direction; an attempt applied chiefly to solving plane and spherical polygons. Matematisk-Fysiske Meddelelser fra Det Kongelige Danske Videnskabernes Selskab, 46, 101–43. Translation of Wessel (1798) by Flemming Damhus. Whittaker, E. T. and Robinson, G. (1924). The calculus of observations. Blackie, London. Wright, S. (1921). Correlation and causation. Journal of Agricultural Research, 20, 557–85. Zachariae, G. (1871). De mindste Kvadraters Methode. 2nd edition (1887). Nyborg, Kjøbenhavn.
This page intentionally left blank
Index χ2-distribution , 214; non-central , 95 Abbe, E. , 214, 252 adjustment , 136; by correlates , 150, 217example , 157; by elements , 161, 217, 220algorithm , 164for coefficients , 78; geometric , 25, 49 Andersen, E. , 199, 252 Andersen, H. C. , 202 Andersen, K. , 4, 252 Andræ, C. C. G. , 199, 215, 252 ARIMA process , 43 Arjas, E. , 247 Atkinson, A. C. , 4, 252 Bailey, R. A. , 4, 252 Barnard, G. A. , 4, 256 Bayes, T. , 122, 200, 211, 249 Bayesian argument , 124, 150, 200, 215 Bennett, J. H. , 3, 247, 252 Bertelsen, N. P. , 213, 252 Bessel, F. W. , 200, 206, 235, 252 Bienaymé, I. J. , 233, 235, 237, 238, 240, 252 Bing, F. M. , 124, 150, 200, 252 Branner, B. , 6, 252 Brenner, S. O. , 2, 252 Brownian motion , 12, 42, 224 Bru, B. , 252 Bucy, R. , 16, 47, 255 Burrau, C. , 2, 6, 252 Cauchy, A. L. , 237 causality , 61 central limit theorem , 104, 234 characteristic function , 234 Charlier, C. V. L. , 247 Chebyshev, P. L. , 206, 238–240, 252 Chebyshev–Hermite polynomials , 240 chess , 6 Christensen, M. , 55 Chuprov, A.A. , 213, 259 components of variance , 41; estimation of , 25 continued fraction , 18, 49, 202, 238 Cornish, E. A. , 211, 213, 247, 253 Cramér, H. , 206, 249, 253 cumulants, see halfinvariants , 3 d'Arrest, H. L. , 202 Damhus, F. , 259 Danish Actuarial Society , 6, 202, 249 Danish Mathematical Society , 6 Danø, S. , 225
Darnell, P. B. , 2, 254 David, H. A. , 253 degrees of freedom ; equivalent , 28, 34, 49 Dempster, A. P. , 41, 54, 253 dependent observations , 184 design of experiments , 4 diagnostics, see error, criticism , 155 differential equation ; Thiele's , 2 distribution, see error law , 65 double stars , 2, 202 Edgeworth, F Y. , 173, 247, 253, 257 Edwards, A. W. F., v , 122, 124, 253 elliptic functions , 202 EM algorithm , 3, 25, 41, 54 error ; accidental , 63; criticism , 155, 177, 179, 215and minimization , 168by grouping , 156, 180by sign variations , 180graphical , 180; quasi-systematic , 7, 156, 185, 224; random , 10; systematic , 9, 63, 179, 180; trend in , 63, 180 error law , 65; actual , 65; exponential ,
262
INDEX
10; of average , 129; of halfinvariants , 132; of linear transformations , 101; of standard deviation , 130; of the method , 105, 108; of transformations , 93, 206; theoretical , 65 estimable functions , 221 estimation ; methods of , 212; non-parametric , 133, 212; of frequency , 118; of halfinvariants , 134; of variance , 213; of variance components , 25, 49 excess equations, see degrees of freedom , 28 expectation , 116 Farebrother, R. W. , 2, 13, 141, 253 fictitious observations ; method of , 14, 45, 171, 175, 221 Fisher, A. , 3, 200, 211, 253 Fisher, R. A. , 3, 82, 134, 204, 211, 213, 241, 247, 249, 252, 253 flatness , 85, 208 Fourier, J. B. J. , 239 free functions , 20, 48, 144, 216; for coefficients , 178; systems of , 48, 144complete , 147 Frisch, R. , 3 Gauss, C. F. , 150, 199, 211, 213, 215, 233, 243, 249, 253 geometric adjustment , 25, 49 Girlich, H.-J. , 247, 253 Gjersing, H. E. , 6 Gnedenko, B. V. , 206, 240, 254 graduation ; of mortality data , 203, 220; Whittaker–Henderson , 224 Gram, J. P. , 2–4, 70, 183, 201, 203, 205, 224, 243, 244, 254 grouping , 67; Sheppard's correction , 244 Gyldenkerne, K. , 2, 254 Haavelmo, T. , 63, 254 Hafnia , 2, 6, 201, 202 Hald, A., v , 2, 3, 6, 13, 15, 55, 70, 74, 82, 150, 238, 251, 254 halfinvariants , 3, 82, 232, 241; additivity , 103, 129, 211; as operator , 210, 228; by expansion , 210, 227; error law of , 132; for binomial distribution , 86; for Gram–Charlier series , 91; for normal distribution , 90; for relative frequency , 125; for transformations , 93linear , 103, 129logarithmic , 104multivariate , 96univariate , 93; interpretation of , 85, 211; of halfinvariants , 132, 211; recursion for , 82, 209, 227, 243; symbolic expansion , 227; transformation of , 207 Hansen, Chr. , 204, 254 Harville, D. A. , 54, 254 Hausdorff, F. , 254 Helmert, F R. , 41, 200, 214, 217, 224, 255 Hermite polynomial , 236, 244 Hermite, C. , 239, 254 Heyde, C. C. , 256 Hoem, J. M. , 2, 225, 255 Hotelling, H. , 3, 82, 241, 247, 255 Institute of Actuaries , 202, 249 instrument-constant , 10
insurance , 4; company , 6, 124, 200; life , mathematics , 2 interpolation , 76, 185; theory of , 2, 202 Jensen, A. , 249, 255 Jensen, H. , 220, 255 Johannsen, W. , 204 Johansen, P. , 249 Johansen, S. , 225 Johnson, N. L. , 2, 255 Jørgensen, N. R. , 204, 255 Kalman filter , 3, 16, 41 Kalman, R.E. , 16, 47, 255 Kolmogorov, A. N. , 206, 240, 254 Kotz, S. , 2, 255 Krøyer, P. S. , 198, 256
2, 117;
INDEX
k-statistics , 134, 212 kurtosis, see flatness , 85 Laird, N. , 41, 54, 253 Laplace, P. S. , 122, 150, 199, 200, 211, 215, 233–235, 237, 239, 240, 249, 252, 255 Larsen, O. , 3, 4, 255 Lauritzen, S. L. , 2, 6, 7, 25, 54, 224, 225, 255, 256 law of large numbers , 105, 108 least squares ; method of , 140, 157, 200justification , 149, 215 likelihood , 3, 119, 123; diagram , 197; maximum , 122 linear normal model , 214; canonical form , 150, 215 Lomholt, A. , 256 Lorenz, L. V. , 200 marriage age ; distribution of , 208 master of calculations , 204 master of computation , 4 mathematical expectation , 116 maximum likelihood ; restricted , 54 mean , 80 method of ; fictitious observations , 14, 45, 171, 175, 221; least squares , 140, 157, 200; normal places , 13, 141 Milne-Thomson, L. M. , 203, 256 minimum variance ; principle of , 215 Mitra, S. K. , 171, 257 model criticism, see error, criticism , 155 model improvement , 182 Molina, E. C. , 237, 240, 256 moments, see sums of powers , 79 Neyman, J. , 201 Nielsen, H. , 1, 6, 256 Nielsen, N. , 2, 256 Norberg, R. , 2, 256 Nørlund, N. E., v , 203, 256 normal ; distribution , 10logarithmic , 208; equations , 216; places , 13, 141 numbers ; complex , 4; patterns of , 4; theory of , 4, 202 Nybølle, H. Cl. , 249 one-way classification , 222 operator calculus , 202 Oppermann, L. H. F. , 3, 20, 48, 200, 201, 203, 214, 215, 217, 243, 244, 256 orthogonal polynomials , 237 orthogonal systems, see free functions , 48 Patterson, H. D. , 41, 54, 256 Pearl, J. , 63, 256 Pearson, E. S. , 256 Pearson, K. , 3, 85, 204, 211, 249, 256 pension system , 4 Petersen, J. , 6 Petersen, J. H. , 256 Plackett, R. L. , 4, 256 Poisson, S. D. , 233, 235, 237, 256, 257
263
powers ; sums of , 79 probability, 109; calculus of , 111; indirect , 124, 150; inverse, see indirect , 124; posterior , 123 Ramskou, K. , 6, 257 Rao, C.R. , 45, 171, 257 Rasch, G. , 249, 251, 255 reciprocal differences , 202 REML , 54 residual analysis, see error, criticism , 155 Robertson, G. , 211, 259 Rubin, D.B. , 41, 54, 253 sampling ; optimal allocation , 201; stratified , 201 Särndal, C. E. , 206, 247, 257 Schreiber–Samuelsen model , 4 Schumacher, H. C. , 199, 252 Schweder, T. , 2, 257 Seneta, E. , 256 series expansion , 69; adjustment of coefficients , 178; binomial , 70; coefficients as free functions , 178; Cornish–Fisher , 211; Edgeworth ,
264
INDEX
237; Gram–Charlier , 74, 201, 206, 208, 233, 237, 242, 244–246; halfinvariants of , 91; Hermite , 239; Maclaurin , 239; Thiele type B , 70 shape coefficients , 211 skew distributions , 205 skewness , 85, 208 Smith, K. , 3, 4, 257 Spiegelhalter, D. J. , 7, 256 standard deviation , 85; after adjustment , 153 Steffensen, J. F. , 3, 212, 249, 257 Stigler, S. M., v , 82, 173, 241, 243, 249, 257 structural equation , 63 ‘Student’ , 211, 249, 256 sums of powers , 79, 227; reduced , 81 Sundberg, R. , 51, 257 symmetric functions , 227, 241 Thiele's ; applied statistics , 204; differential equation , 2; free functions , 144; halfinvariants , 82, 241; medal , 250; reciprocal differences , 202; type A series , See series expansions, Gram–Charlier , 232; type B series , 70 Thiele, I. H. , 202 Thiele, J. M. , 1, 201 Thiele, J. R. , 252 Thompson, R. , 41, 54, 256 Thomsen, J. , 204 Thorvaldsen, B. , 202 Thykier, C. , 254 time series , 40, 223 Toft, B. , v transformation ; variance stabilizing , 201 Trolle, M. M. , 2, 202 Tschuprow, A. A., see Chuprov, A. A. , 213 two-way classification , 173, 222 Uspensky, J. V. , 240, 258 Valentiner, H. , 4, 259 Wahba, G. , 28, 258 weight , 140 Wessel, C. , 4, 252, 259 Westergaard, H. , 249 Whittaker, E. T. , 211, 259 Wishart, J. , 3, 82, 241, 253 Wright, S. , 63, 259 Zachariae, G. K. C. , 200, 206, 259 Zeuthen, H. G. , 6