Editor’s Letter Mike Larsen, Executive Editor
Dear Readers, This issue of CHANCE contains articles about history, probability, statistical modeling, messy data, risk, graphics, and sports. Actually, several of the articles concern sports in one context or another. The sports articles are not just for the sports enthusiast, however. These articles explore diverse statistical topics and methods and illustrate a variety of concerns in probability and statistics. The first couple of articles concern historical themes and the development of statistical practice. Brian Clauser discusses the relationship between R. A. Fisher and Karl Pearson and the impact of strict adherence to set significance levels for hypothesis testing. The author suggests personal relations, copyright restrictions, and economic conditions related to World War I influenced the format of tables and subsequent statistical practice. The ‘file drawer problem’ and the registration of clinical trials prior to initiation are two related modern issues. In a related article, Stephen Stigler presents his view of the origin of the 5% significance level standard for hypothesis testing. As I mentioned, many articles have sports themes. Two international contributions concern winter sports. Moudud Alam, Kenneth Carling, Rui Chen, and Yuli Liang investigate the progression of young skiers in Sweden. Their growth curve modeling adjusts for age and gender effects and produces individualized evaluations of progress. It is conceivable that such methods would be useful in health or education studies. Three of the authors of this paper are graduate students. Bill Hurley’s sports example is the National Hockey League. His substantive topic is the birthday matching problem and probability calculations. Did you know birthdays of hockey players in the NHL are becoming increasingly nonuniform? Read the article to find out why and what impact this has on probability calculations. Rachel Croson, Peter Fishman, and Devin Pope compare golf and poker. Is poker a game primarily of skill or of luck? The authors confront the difficulty of answering this question when data are available on only the top players. Philip Price discusses a variant of a betting pool for college basketball’s March Madness single elimination tournament. Here, there are data that could be useful in estimating probabilities, expectations, and variability. These data and
the questions raised in the article could be used to motivate teaching introductory concepts. Can your students devise a better strategy? Brian Schmotzer examines the relation of leads and time remaining to the eventual winner of a college basketball game. The data are noisy, so the author employs smoothing techniques. He then translates solutions into simple rules and equations that you can use in real time. Katherine McGivney, Ray McGivney, and Ralph Zegarelli address the same question, but for professional basketball. The authors employ logistic regression models to describe the performance of simple rules. Simulation is used to further evaluate the rules. Both articles are possible only because the authors made considerable efforts to produce usable data. In Mark Glickman’s Here’s to Your Health column, Stephanie Land critically comments on failings of risk perception and the challenge of effectively communicating risks. Clear comparative graphical presentations of risk are part of her story. In the Visual Revelations column, Paul Velleman and Howard Wainer take a look at blood sugar measurements and diabetes. The data here are for one individual. Outliers and trends are investigated in detail. In this article, teachers of statistics will find comparisons of measures of center and examples of smoothing noisy time-varying data. The issue is completed with a new puzzle from Jonathan Berkowitz in his Goodness of Wit Test column and the announcement of a CHANCE graphics contest. As always, a one-year (extension of your) subscription to CHANCE will be awarded for each of two correct puzzle solutions chosen at random from among those received by the deadline. The graphics contest similarly will award one-year (extensions of) subscriptions to winners as described in the announcement. We look forward to your puzzle and graphics entries! And, as always, I look forward to your comments, suggestions, and article submissions. Enjoy the issue! Mike Larsen
CHANCE
5
War, Enmity, and
Statistical Tables Brian E. Clauser
“ I am surely not alone in having suspected that some of Fisher’s major views were adopted simply to avoid agreeing with his opponents.” — Leonard J. Savage, Annals of Statistics, 1976
A
lthough the use of strict nominal levels for significance testing (e.g., .01 or .05) has passed in and out of vogue, it is beyond question that this approach to statistics has had a profound impact on 20th century science. Numerous authors have raised concerns about the theoretical appropriateness and practical impact of this approach. In fact, a recent book by Stephen T. Ziliak and Deirdre N. McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Economics, Cognition, and Society), goes so far as to argue that the emphasis on significance testing (rather than on measures of effect size) has damaged the study of economics, psychology, and medicine, and, in the process, cost society both jobs and lives.
6
VOL. 21, NO. 4, 2008
The File-Drawer Effect Among the unintended and problematic consequences of statistical testing using fixed levels of significance is the potential for this practice to lead to publication bias. If experiments that fail to achieve the nominal level of significance are unlikely to become part of the literature, significant results arising from type I error may appear to provide a basis for sound conclusions. This phenomenon, referred to as the file-drawer effect, has been recognized as a theoretical problem in the psychological literature for decades. Theodore Sterling, in 1959, was among the first to document the impact nominal levels of significance have had on the psychological literature. He reviewed 362 articles selected at random from four journals.
From left: R. A. Fisher, Karl Pierson, and William Gosset
Of the 294 articles that included significance tests, only eight reported nonsignificant results; 97% reported results that were significant at a level of p≤.05. Emphasis on these nominal levels of significance has been explicitly espoused by both journal editors and the influential Publication Manual of the American Psychological Association. As editor of the Journal of Experimental Psychology, Arthur Melton stated in 1962 that papers were unlikely to be accepted if they did not report a statistically significant result. Later, the Publication Manual supported strict adherence to nominal levels of significance when it advised authors, “Do not infer trends from data that fail by a small margin to meet the usual levels of significance.” Although there has been active debate more
recently about the appropriate application of tests of statistical significance, the impact of this practice continues to be evident in numerous fields. In medical research, such as drug trials, publication bias literally may be a matter of life and death. When pharmaceutical companies support hundreds of drug trials, many of the significant findings reported for the smaller number of studies that make it to publication are likely to have occurred by chance. This is so serious a problem that the Food and Drug Administration (FDA) has instituted reporting requirements that require all trials to be recorded, regardless of whether the outcome results in publication. Similarly, the International Committee of Medical Journal Editors, which represents the
CHANCE
7
More recent texts are designed for use with computer programs that automatically provide the specific level of sigmatica nificance canc associated with the calculated test statistic, b but the use of strict levels of statistical significance canc for interpreting the results continues. David M Moore and George McCabe, in Introduction to the Pr Practice of Statistics, identify a four-step process for interp interpreting the results of a statistical test: 1) state the null hyp hypothesis, 2) calculate the test statistic, 3) identify the associa associated level of significance, and 4) state a conclusion. They go o on to say, “One way to do this is to choose a level of significcance α .” What they don’t offer is an alternative approach to d drawing conclusions. editors of hundreds of journals—including the New England Journal of Medicine, The Lancet, and the Journal of the American Medical Association—has instituted a policy stating a clinical trial must have been registered before the first subject was recruited to qualify for publication. In an editorial announcing the policy, the committee stated that individuals who have agreed to participate in clinical trials “deserve to know that the information that accrues from their altruism is part of the public record, where it is available to guide decisions about patient care, and deserve to know that decisions about their care rest on all of the evidence, not just the trials that authors decided to report and that journal editors decided to publish.” A recent paper in the New England Journal of Medicine used these reporting requirements to examine the extent of publication bias for studies of selected antidepressant medications. The authors concluded that publication of a study was dependent on whether it produced a significant result. Out of 74 registered studies, 51 were published and 23 were not. Of the studies viewed by the FDA as having significant positive results, 37 were published and only one was not. Meta-analysis indicated the publication bias increased the effect size for the 12 studied medications from 11% to 69% (32% overall). This suggests a strong bias toward publication for studies that show significant positive results and a sizable change in the resulting interpretation of the efficacy of the medications.
Textbook Prescriptions Emphasis on the use of strict levels of significance for interpreting experimental results also has been reflected in introductory statistical textbooks (used to train future researchers). In Elementary Statistics, originally published in 1954 and republished in 1968, Janet Spence, Benton Underwood, Carl Duncan, and John Cotton present an example to illustrate the use of Student’s t-statistic. Results are presented for an experiment with five degrees of freedom that results in a t-statistic of 2.19, and the reader then is instructed to compare this value to the value in the table associated with a .05 level of significance (2.34). The conclusion that follows is, “In short, we cannot reject the null hypothesis.” The authors are being consistent with the previously noted advice from the Publication Manual and teaching by example that the values in the table (in this text, .05 and .01) are the values that matter. In this example, the t-statistic was associated with a probability of .06, but there is no way for the reader to know this.
8
VOL. 21, NO. 4, 2008
Origin of the 0.05 Cut-Off Given the widespread and potentially problematic influence of nominal levels of significance described so far, it is of interest to consider the origin of the use of specific levels of statistical significance as strict guidelines for interpreting the results of experiments. Historically, this development is of particular interest because the originators of the statistical procedures recommended the application of rough guidelines, not strict cut-offs. For example, Howard Wainer and Daniel Robinson summarize Fisher’s approach in an article published in Educational Researcher by stating: Throughout Fisher’s work, he used statistical tests to come to one of three conclusions. When p (the probability of obtaining the sample results when the null hypothesis is true) is small (less than .05), he declared that an effect had been demonstrated. When it is large (p is greater than .2), he concluded that if there is an effect, it is too small to be detected with an experiment this size. When it lies between these extremes, he discussed how to design the next experiment to estimate the effect better. Ultimately, it may be impossible to provide a definitive answer to the question of why these nominal levels of significance have become so widespread, but one possibility is that this trend in statistical practice arose out of historical conditions essentially unrelated to statistical theory. It may be that specific levels of statistical significance are used simply because that is how R. A. Fisher structured the tables included in his 1925 monograph, Statistical Methods for Research Workers. Furthermore, it may well be that it was war and personal enmity that brought about the existence of these tables in the first place. That is, Fisher may have introduced tables structured around these levels of significance for the sole purpose of creating a new and alternative format for his tables—a format created not because Fisher believed the new format was advantageous, but because he was prevented by copyright restriction from publishing the tables in their original form. Finally, it may be that the enforcement of those copyright restrictions resulted in part from the economic difficulties faced by the journal Biometrika after World War I and in significant part from the copyright being controlled by Karl Pearson. Pearson disliked Fisher and never would have considered allowing him to reproduce the original tables.
Figure 1. A section from “Student’s” table of the z statistic Reprinted from “The Probable Error of a Mean,” “Student” (1908), Biometrika.
Figure 2. Section from Fisher’s table of the t-statistic Reprinted from Statistical Methods for Research Workers, R. A. Fisher (1925), Oliver & Boyd.
Though all of this is speculation, it is speculation based on historical evidence. The argument is presented in two parts. First, it is suggested that Fisher’s modification of the format of the statistical tables appearing in his monograph changed the way the tables were read and unavoidably moved toward a focus on specific levels of significance. Second, it is argued that the change in the format of the tables can be traced to a variety of historical factors, but that there is little evidence to support Fisher’s claim that he viewed the format as superior.
The Tables Consider, for example, two statistical tables that existed before 2 Fisher’s monograph: Pearson’s χ distribution and “Student’s” t (or as “Student” originally referred to it, z) distribution. In both Pearson’s table from the 1900 publication in Philosophical Magazine and the table in “Student’s” 1908 Biometrika paper, rows and columns are defined in terms of the observed values, and the body of the table is comprised of probabilities. Figure 1 shows a section of the table from Biometrika. With this table, the researcher completes the calculations, finds the row corresponding to the sample size and the column corresponding to the magnitude of the statistic, and thus identifies the level of significance within the body of the table. So, for example,
with a sample size of five and a z equal to 1.0, the associated probability level is 0.9419. Interpolation is required for the user to identify the magnitude of the statistic that would be associated with a .01 or .05 significance level. Fisher inverted this design so the user selects the level of significance and identifies the value of the statistic associated with that level. With the earlier design, the user read the level of significance associated with the results. With Fisher’s approach, shown in Figure 2, the user finds the value of the statistic required to achieve a specified level of significance. So, for example, with five degrees of freedom, a significance level of .05 is achieved when t is at least 2.571. With this change, specified levels of significance take on a new and central importance. Fisher acknowledged that the original tables could not be reproduced due to copyright restrictions, but he described the new format of his tables as that “which experience has shown to be more convenient.” It appears to be a matter of historical fact that Fisher is the first to have published tables in this form. It is also clear that Fisher’s versions of the tables have become ubiquitous. It is impossible to prove that Fisher’s table led to the use of established levels of significance, but it is clear the new format facilitated their use, and it is difficult to believe they did not influence practice.
CHANCE
9
choice of format was the result of copyright restriction and not attributable to any advantage to the reader. During the years leading to the publication, Fisher discussed his interest in publishing the tables—in their original form—in correspondence with Gosset. Although the letters Gosset received from Fisher have been lost, Fisher carefully filed his correspondence from Gosset, and Gosset’s comments strongly suggest that, as late as the end of 1923, Fisher hoped to publish Student’s table in its original format. In July of that year, Gosset responded to an apparent inquiry about the possibility of reproducing the table from the 1908 Biometrika paper: As to “quoting” the table in Biometrika it depends just what you mean by quoting. I imagine they have the copyright and would be inclined to enforce it against anyone. The journal doesn’t now pay its way though it did before the war and they are bound to make people buy it if they possibly can. I don’t think, if I were editor, that I would allow much more than a reference! Subsequently, Gosset referred to the table he and Fisher had been working on, which ultimately appeared in Metron, saying, “I take it that this table is, if it gets finished in time, to be published in your book.” Fisher’s continued interest in publishing the table in its original form appears to be confirmed by a reply from Gosset, written at a point when Gosset was considering offering this new table to Pearson for publication in Biometrika: R. A R A. Fisher
Fisher and Gosset One opportunity for a direct comparison of how the two forms of the table are interpreted comes from the writings of William Gosset and Fisher. Gosset (who published under the name “Student”) provided an illustration of the use of his table in his 1908 paper. The example compared the additional hours of sleep gained from the use of two separate medications. After discussing the effect of each of the medications in comparison to no medication, Gosset stated: I take it the real point of the authors was that [medication] 2 is better than [medication] 1. This we must test by making a new series, subtracting 1 from 2. The mean values of this series are +1.58, while the S.D. is 1.17, the mean value being 1.35 times the S.D. From the table the probability is 0.9985, or the odds are about 666 to 1 that [medication] 2 is the better soporific. Gosset makes no comparison of his results to nominal levels of significance. By comparison, Fisher uses the same data set to illustrate the use of the t-test in Statistical Methods. He calculates t to be equal to 4.06 and then concludes, “For n=9 only one value in a hundred will exceed 3.250 by chance, so the difference between the results is clearly significant.” The format of the table makes it necessary for Fisher to draw a conclusion relative to this single nominal level (p≤.01). When Fisher published his 1925 monograph, the copy2 rights to Pearson’s χ table and Student’s z table were owned by Biometrika and controlled by Karl Pearson. Fisher was unable to publish the tables in their original form without Pearson’s permission. Fisher’s comment about the advantage of this new form of the table aside, there is considerable evidence that the 10
VOL. 21, NO. 4, 2008
Re your postscript about publication, I quite agree: when the thing is put together I will either send it or take it to K.P. [Karl Pearson] and will make it quite clear that you wish to have the right of publication in case you wish to include it in any book that you may be bringing out. Clearly, Gosset assumed Fisher intended to publish the tables in their original form in December of 1923. By July of 1924, Fisher had instructed his publisher to forward proofs of Statistical Methods to Gosset; these reflected the new form of the tables. Apparently, Fisher’s change of view occurred as publication approached. In this context, it is also worth noting that Fisher continued to work with Gosset on the new table. That table ultimately was published after Statistical Methods, but it was published in the earlier format.
Fisher and Pearson The impact of the war and the continuing economic difficulties for Biometrika to which Gosset referred may have been real factors in restricting use of the copyright, but when matters relating to Pearson arose, it is clear Gosset engaged in a level of diplomacy that recognized the extremely strained relationship between Pearson and Fisher. In a 1922 letter, Gosset commented, “In most of your differences with Pearson I am altogether on your side …” Few professional antagonisms have been more public than that between Fisher and Pearson. It is difficult to identify the original source of the antagonism. From the start, Pearson may have felt threatened by Fisher’s mathematical ability. In 1912, Gosset received a letter from Fisher providing a mathematical proof of Student’s formula for the probable error of the mean. Such a proof was missing from Student’s 1908 paper and was beyond Gosset’s mathematical ability. Gosset forwarded the proof to Pearson for his opinion with the suggestion that the
proof be published in Biometrika. Pearson responded, “I do not follow Mr. Fisher’s proof …” The relationship between Fisher and Pearson certainly was not helped by Fisher’s2 identification of an error in the way Pearson applied the χ test in his 1900 paper. In Fisher’s article, he identified the importance of degrees of freedom in interpretation of the table; the suggestion that Pearson had failed to understand what was arguably his own most important contribution to mathematical statistics must have been a considerable affront. Regardless of the origin of this animosity, it was clear Pearson took his anger to the grave and that even Pearson’s death did not end Fisher’s animosity. Pearson’s final Biometrika publication was an attack on a paper by R. S. Koshal, who had argued in print for the superiority of Fisher’s maximum likelihood estimation over Pearson’s method of moments. Although Koshal’s paper was the immediate target of the attack, Pearson was explicit about his attack being directed at Fisher. The paper had a single author, Koshal, but Pearson refers to it as the Koshal-Fisher paper. Pearson complained about the methodology used by Koshal and identified the fact that some of the tabulated results contained computational errors. Pearson, of course, pointed these errors out: “Had Professor Fisher … investigated … Koshal’s arithmetic, he would probably have been less dogmatic about the importance of the Koshal-Fisher results.” Pearson then could not resist repeatedly reminding his readers of these errors: “Two of which be it remembered are in error,” “… there are blunders in the arithmetic,” “… owing to faulty arithmetic,” and “… also springs from the occurrence of blunders in Koshal’s arithmetic.” Shortly after Pearson’s death, Fisher responded in the pages of the Annals of Eugenics with a paper titled “Professor Karl Pearson and the Method of Moments.” Fisher identified what he considered to be errors in the procedure implemented by Pearson. Based on these errors, he referred to Pearson as a “clumsy mathematician” and went on to say, “Had it not been for his arrogant temper, his taste for numerical exemplification might well have saved him from serious theoretical mistakes.” Judging the magnitude of Pearson’s errors to be 20 times those of Koshal’s, Fisher continued, “The English language does not seem to possess a word 20 times as forcible as ‘blunder.’ May we hope that some Eastern tongue known to Koshal is more amply provided.” In 1950, more than a decade later, Fisher’s introduction to this same paper reprinted in Contributions to Mathematical Statistics showed his hostility had not diminished: “Pearson was an old man when it occurred to him to attack Koshal, but it would be a mistake to regard either the errors or the venom of that attack as a sign of failing powers. … If peevish intolerance of free opinion in others is a sign of senility, it is one which he had developed at an early age. Unscrupulous manipulation of factual material is also a striking feature of the whole corpus of Pearson’s writing …”
Summary History does not provide the sort of randomized experiments Fisher would have advocated; we cannot know whether the loose guidelines advocated by Fisher for interpreting levels of significance would have evolved into the strict rules they
became if Fisher had published the tables in their original form. It is difficult to believe the format did not have an impact on the way the tables were used and the specific values perceived. Was Pearson’s mutual dislike of Fisher the factor that led to Fisher’s development of a new format for statistical tables (highlighting levels of nominal significance)? Stephen Stigler, in an article beginning on Page 12, suggests that the complexity of presenting the tables for Fisher’s F statistic may have led to Fisher’s choice of format for his new tables. Surely, Fisher’s decision was influenced by a combination of factors, but Fisher’s own words suggest that copyright restriction led to the construction of the new tables. Fisher acknowledged in Statistical Methods for Research Workers that2 he did not present the previously available table of the χ owing to copyright restriction. His correspondence with Gosset during the period in which he was working on this monograph similarly made clear his interest in publishing the original form of Student’s table. Was it inevitable that the owner of the copyright for these tables would prevent publication by others? The simple answer is “no.” To this day, many of us learn statistics from texts that credit Fisher as the source of the tables; certainly, Fisher and his publisher—Oliver and Boyd—were not in the business of losing money. Fisher allowed G. Udny Yule and M. G. Kendall to reproduce his table of the z distribution when he was still producing revised editions of his own monograph, and he allowed the same consideration to nearly 200 other authors. Clearly, it is possible that the commonly used nominal levels of significance would have gained prominence without the series of events described in this article. But, the available evidence suggests the conditions created by World War I and the subsequent economic difficulties of Biometrika—combined with Pearson’s tight-fisted nature and, most of all, his animosity for R. A. Fisher—produced the circumstances that led to Fisher’s tables establishing nominal levels of significance.
Further Reading Box, J.F. (1978) R. A. Fisher: The Life of a Scientist. New York: John Wiley & Sons. De Angelis, C., Drazen, J.M., Frizelle, F.A., Haug, C., Hoey, J., Horton, R., et al. (2004) “Clinical Trial Registration: A Statement from the International Committee of Medical Journal Editors.” Annals of Internal Medicine, 141: 477–478. Gigerenzer, G. (2004) “Mindless Statistics.” Journal of SocioEconomics, 35:587–609. Pearson, E.S. (1990) “Student”: A Statistical Biography of William Sealy Gosset. (R. L. Plackett & G. A. Barnard, eds.). Oxford: Clarendon Press. Rosenthal, R. (1979) “The File Drawer Problem and Tolerance for Null Results.” Psychological Bulletin, 86:638–641. Turner, E.H., Matthews, A.M., Linardatos, E., Tell, R.A., Rosenthal, R. (2008) “Selective Publication of Antidepressant Trials and Its Influence on Apparent Efficacy.” New England Journal of Medicine, 358:252–260. Wainer, H. and Robinson, D.H. (2003) “Shaping Up the Practice of Null Hypothesis Significance Testing.” Educational Researcher, 32:22–30. CHANCE
11
Fisher and the 5% Level Stephen Stigler
S
urely R. A. Fisher played a major role in the canonization of the 5% level as a criterion for statistical significance, although broader social factors were involved. Fisher needed tables for his 1925 book and, evidently, Karl Pearson would not permit the free reproduction of the Biometrika tables, so Fisher computed his own. Fisher found it convenient to table values in the extremes for levels such as 10%, 5%, 2%, 1%—roughly halving the level with each step. One simple explanation for the format he selected lies in the fact that the book introduced “analysis of variance,” or ANOVA. For most readers, this would be their first exposure to ANOVA, and Fisher needed a way to make the new test accessible— essentially the F-test, although he preferred to work in terms of z = log(F). The table here was entirely novel, requiring entry via two parameters: the numerator and denominator degrees of freedom (df). It would have been impractical to provide a full table of the distribution for each pair of values: With the 10 levels of both dfs he wished to include, 100 tables would have been required if he gave the same level of detail he gave for his normal distribution table, or 10 tables if he gave the reduced level of detail that Gosset gave in his 1908 table for the t-distributions. So, Fisher initially settled on only giving one table for the 5% point. Once that was decided, it is not implausible that Fisher chose (in a book for practical workers) to make the other tables conform to that same simple format. This was not a huge task, and it had the bonus of casting all assessments of significance in the same accessible form. The first edition (1925) of Fisher’s book Statistical Methods for Research Workers had six tables: I. and II. Tables of the inverse cumulative normal distribution (of z in terms of P, where P= F(–z)+1–F(z) = Pr{|Z|>z} and Z has a standard normal distribution). He gave this for P = .01 to .99 (increments of .01) and for P = .001, .0001, ..., .000000001.
12
VOL. 21, NO. 4, 2008
III. Percent points y, where P= 1–F(y), for chi-square, df = 1, 2, ..., 30, and P = .99, .98, .95, .90, .80, .70, .50, .30, .20, .10, .05, .02, .01. IV. Percent points for the t-distributions, df = 1, 2, ..., 30, ∞, and P = .9, .8, .7, .6, .5, .4, .3, .2, .1, .05, .02, .01. V. Percent points for the correlation coefficient r, for n = 1 (1) 20 (5) 50 (10) 100 and for P = .1, .05, .02, .01. He also gave (as Table V (B)) the hyperbolic tangent transformation of r. VI. Table VI gave only the P = .05 percent points for the distribution of z (the log of the F-statistic) by numerator df and denominator df, for df = 1, 2, 3, 4, 5, 6, 8, 12, 24, ∞. By the third edition (1930), he had added a table giving the 1% points and enlarged the range of denominator df considerably. Note that only Fisher’s Table VI strongly emphasized the 5% point. The others gave varying degrees of extended coverage, especially for the Normal, t, and chi-square distributions, where they gave a pretty good idea of each whole distribution. Later editions of Statistical Methods for Research Workers (from the seventh of 1938) moved all the tables from the end of the book and interspersed them through the text. All these tables and more were given in Fisher and Frank Yates’ book, Statistical Tables for Biological, Agricultural and Medical Research. There, the table for (essentially) the F-distribution was expanded to include a range of values from the 20% to 0.1% points. My own view is that while Fisher’s initial Table VI (but only that table) fixed attention at the 5% level (rather than, say, 6%, 10%, or 2%), that fixation is largely the result of a social process extending back well before Fisher. Even in the 19th century we find people such as Francis Edgeworth taking values “like” 5%—namely 1.5%, 3.25%, or 7%—as a criterion for how firm evidence should be before considering a matter seriously.
Odds of about 20 to 1, then, seem to have been found a useful social compromise with the need to allow some uncertainty, a compromise between (say) .2 and .0001. That is, 5% is arbitrary (as Fisher knew well), but fulfils a general social purpose. People can accept 5% and achieve it in reasonable size samples, as well as have reasonable power to detect effect-sizes that are of interest. In my 1986 book, The History of Statistics, I speculate that the lack of such a moderate standard of certainty was among the factors that kept Jacob Bernoulli and Thomas Bayes from publishing. The use of Fisher’s tables only served to make the choice more specific. One may look to Fisher’s table for the F-distribution and his use of percentage points as leading to subsequent abuses by others. Or, one may consider the formatting of his tables as a brilliant stroke of simplification that opened the arcane domain of statistical calculation to a world of experimenters and research workers who would begin to bring a statistical measure to their data analyses. There is some truth in both views, but they are inextricably related, and I tend to give more attention to the latter, while blaming Fisher’s descendents for the former. After all, a perceptive 1919 article warning of the potential misuse of what we now call statistical significance by the psychologist Edwin G. Boring is ample evidence that the abuse predated Fisher.
Further Reading Boring, Edwin G. (1919) “Mathematical vs. Scientific Significance.” Psychological Bulletin, 16:335–338. Fisher, Ronald A. (1925) Statistical Methods for Research Workers (first ed.), Edinburgh: Oliver & Boyd. Fisher, Ronald A., and Yates, Francis (1938) Statistical Tables for Biological, Agricultural and Medical Research (first ed.), London: Oliver & Boyd. Stigler, Stephen M. (1986) The History of Statistics: The Measurement of Uncertainty Before 1900, Cambridge, Mass.: Harvard University Press.
How to Determine the Progression of Young Skiers? Moudud Alam, Kenneth Carling, Rui Chen, and Yuli Liang
Skiers at the ages of 14–17 as for-runners in Svenska Skidspelen 2006. The style of skiing is skate.
S
ports are popular among children in most countries, and many parents enjoy watching their children participate in sports as leisure activities. Various surveys suggest children play sports because it provides an arena for social relations, and as the child grows older, the stimulus from getting positive feedback of training in terms of race results plays a greater role. A sad aspect of sports and children is the high drop-out rate in the early teen-age years. To the extent the drop-out is caused by an emerging interest in other activities or a greater focus on studies, it is fine. But a substantial number of teen-agers drop out because they do not get positive feedback from the training in terms of racing results. Sports associations could probably take various measures to reduce the drop-out rate, and an ongoing debate concerns such measures.
One explanation for the high drop-out frequency is that puberty affects individuals differently and the differential effects impact performance in sports. It is well-known that puberty occurs at somewhat different ages for individual teenagers. The risk of dropping out of sports could be high for individuals who experience puberty late because they might perform worse than their peers. That is, it is possible they would be dejected by comparatively weak racing results. Val Abbassi summarized male and female progression curves and the puberty effect in an article in American Academy of Pediatrics titled “Growth and Normal Puberty.” According to this work, the female curve is at its steepest around the age of 12. For males, it is around 14 years of age and steeper than for females. These curves are conventionally assumed to approximate the progression in sports. During the period of rapid change, the estimation of the progression curve is challenging.
CHANCE
13
7
● ●
6
●
5
●
●
●
●
4
Velocity (m/s)
●
●
● ●
2
3
●
10
12
14
16
18
Age Figure 1. Average skiing velocity for Julia Forsmark, computed from her participation in two annual races in six consecutive years
Table 1—Number of Skiers Divided by Gender and Age Age
Boys
Girls
Total
18
23
15
38
17
25
24
49
16
32
25
57
15
27
32
59
Total
107
96
203
Repeated Race Measures Are Unreliable in Skiing To get to the core of the problem, consider a girl named Julia Forsmark whose progression in skiing velocity is shown in Figure 1. The figure shows the average racing velocities in two races each year from the age of 10 to the age of 15. (She turned 15 in 2008.) Julia’s slowest speed was recorded at 10 years, as one would expect. However, her fastest race was at the age of 12, when she reached almost seven meters per second. Is her recent training scheme bearing no fruit? Or did she have an effect of early puberty and then level off? Or are the data unreliable? The answer is that the data are unreliable in the sense that there are confounding variables that need to be taken into consideration. To understand why, one needs to know a bit about cross-country racing. Obviously, the velocity of skiing depends on the skill of the skier, but many other factors will influence the speed in a particular race. Wind and temperature are two such factors; strong wind and low temperatures generally decrease the speed. Snow condition is an even more important factor, as the friction of the skies on the snow depends on the type of snow. Fresh snow and a low dew point will make the skies glide slowly on the snow surface; whereas, icy, granular snow will make the skies glide fast. 14
VOL. 21, NO. 4, 2008
How the skies interact with the snow condition is also affected by the ski-waxing. The classical style of skiing uses glide-wax on one part of the ski, which is always in contact with the snow, and grip-wax on another part of the ski, which is in contact with the snow only when the skier is kicking forward. In freestyle skiing (or skate), only a glide-wax is used, as the skies are constantly in contact with the snow. The skier must prepare the waxing before the race; she cannot change the skies during the race. Consequently, the race result will be contingent on whether the applied wax was optimal for the conditions at hand. Yet another factor that will influence the speed is the profile of the track (e.g., the total climbing meters and the sharpness of curves along the track). Most races take place in forests, and the organizers may need to modify the tracks from year to year depending on ongoing timber logging and the snow depth. Hence, the tracks for the same race might differ over time. The Swedish Ski Association and International Ski Federation (FIS) provide recommendations for the profile of a track, but the race organizers are fairly free to decide the details. Another influencing factor is the actual distance, as the velocity is computed as the ratio of the stipulated distance over racing time. The stipulated racing distances are, at best, indicative of the actual length of the track. Indeed, due to the complicating factors, cross-country ski organizers are not careful in setting a track that matches the stipulated distance. Even though it is hard to tell, we would not be surprised if a stipulated racing distance of 3 kilometers was revealed to be anything between 2.5 and 3.5 kilometers. The problem of unreliable race data due to confounding factors extends to other sports. The reader might, for instance, want to reflect upon potential confounders in mountain and road biking, rowing, and cross-country running. A distinction can be made between sports. Some sports, such as swimming and indoor track and field, have standardized race environments. Others, such as those mentioned in this article, are subject to more confounding factors and variability in conditions. For the former class of sports, the repeated race data are reliable because the race environment is standardized. Therefore, it is fairly simple to determine an adolescent’s progression. For the latter class of sports, it is more difficult to measure progression.
Examination of Young Skiers Selected from Two Races The race data for Julia shown in Figure 1 came from her participation in Lilla SS, which has been held annually in midFebruary since 1974 in Falun, and Morapinglan, which has been held annually around New Years Eve since the 1980s in Mora. Her race data come from 2003–2008 (i.e., six years of data). We have selected all cross-country skiers ages 15–18 who participated in one or both of the races in 2008. Thereafter, we have traced the skiers’ results back to 2003. Table 1 shows the number of skiers in the sample, divided by gender and age. There are several reasons why the two races—Lilla SS and Morapinglan—were selected. Lilla SS started the same year the World Championship in skiing (cross-country skiing, ski-jumping, and Nordic combined) was held in Falun. Lilla SS benefited from a fast technological development of time recording and electronic recording due to the staff members involved with the championship. From the beginning, the race
Table 2 — Median Velocity in Lilla SS 2004 (m/s) Age 10
11
12
13
14
15
16
17
18
Boys
2.87
3.28
3.51
3.68
3.85
5.25
5.33
5.40
5.40
Girls
2.79
3.21
3.27
3.53
3.68
4.44
4.68
4.80
4.80
Table 3— Deciles of Boys’ (18 Years Old) Velocities in Lilla SS 2004 and MP 2008 Age 10%
20%
30%
40%
50%
60%
70%
80%
90%
Lilla SS 2004
5.07
5.14
5.21
5.35
5.40
5.46
5.61
5.71
5.81
MP 2008
5.12
5.18
5.26
5.32
5.40
5.53
5.58
5.63
5.76
has attracted skiers ages 7–20, who have tried to return every year. The same is true for Morapinglan. Moreover, Mora and Falun are geographically close and belong to the same skiing district. Therefore, the likelihood of a skier participating in both races is high. The races also have in common the fact that skiers are racing individually in the sense that one skier at a time departs and her racing time is recorded. An important consequence is that the skier’s racing time can be regarded as independent, which renders the statistical modeling simpler. For the 203 skiers, we have 967 racing results, which means not every skier participated in all 12 occasions. We have six or more repeated measures for 35% of the skiers, and the average number of racing times per skier is 4.8. We do not believe this will bias the results because the main reasons for a skier to not participate are illness, the race falling on the same date as another important race, or the skier being committed to another sport having an event on the same date. Sports competitions for adolescents are usually organized in age classes, and this is true for skiing. Conventionally, participants born in the same year would compete in the same class. As a consequence, there might be an age difference of one year between the participants, and the data obtained from Lilla SS and Morapinglan makes it possible to deduce the year of birth for the skiers in the sample. Such a measure of age is too crude for a valid estimate of the progression curve. Luckily, we have had access to the skiing clubs’ member registers, from which we have retrieved the date of birth of all the skiers in the sample. We used this information jointly with the race dates to compute the age of the skiers, measured in days and converted into fractions of years.
Standardization of the Velocity Figure 1 showed that Julia skied her fastest race at the age of 12 (in Morapinglan) of almost 7 m/s. In Lilla SS, the same year, her velocity was 4 m/s. These two values could not possibly describe the difference in her performance that year, but rather the difference in the confounding factors of the two races. To render the velocities of the 12 race occasions comparable, we standardized the velocity by using Lilla SS in 2004 as the reference race. That occasion was selected as
reference because there is reason to believe the race distances are accurate and other conditions were fairly stable. Moreover, the number of participants in each gender and age class was sufficiently large. Table 2 shows the median velocity for each racing class by age and gender. The median is calculated based on all participants, not only the 203 skiers included in this study of progression. The skiers that are age 17 and 18 race in the same class at Lilla SS, which is the reason the medians are equal for these two age groups. The median velocity was calculated instead of the mean because it is fairly common to find outliers (e.g., breaking a ski pole during the race would make the velocity very low and aberrant). If we believed the confounders altered the distribution of velocity by shifting its center, it would be natural to add or subtract the races’ medians. However, we believe the variance of the velocity is also affected by the confounders. We therefore compute the standardized velocity by scaling the skier’s velocity at one race occasion by a factor being the ratio of the median velocity in Lilla SS 2004 for the relevant racing class and the corresponding median velocity at the race occasion. After standardizing the velocity, we believe the distribution should be about the same for the 12 race occasions. This belief is a consequence of seeing no reason why the ability of the skiers should differ between the races. One potential problem with scaling the velocity might be that the standardizing only works for the center of the distribution, but fails to render the extremes—such as the top velocities—comparable. To check this, we compared the deciles of the standardized velocities by age and gender. Table 3 provides one example of this checking procedure. It shows the deciles for 18-year-old boys at Lilla SS in 2004 and the deciles for boys at the same age in Morapinglan in 2008. We find the standardization has made the distributions similar for the 12 race occasions and different classes. Figure 2 shows box plots of the standardized velocities for each age. Most noteworthy is the substantial jump in velocity that appears to occur at about 14. For the boys, the jump is high; whereas, it seems the girls have a more prolonged and weaker progression around 14. CHANCE
15
6
6
●
5
5
●
●
● ● ●
● ● ●
4
Velocity (m/s)
● ●
4
Velocity (m/s)
●
●
●
● ●
●
● ●
3
3
●
● ●
●
●
10
2
2
●
11
12
13
14
15
16
17
18
10
11
12
13
14
15
16
17
18
Age Age Figure 2. The distribution of the 203 skiers’ standardized velocity by age and gender. Boys to the left and girls to the right.
where the parameters have a meaningful interpretation in this context. Parameters β 1 and β 2 are the lower and upper asymptotes for the response variable. The upper asymptote is β 2, which states the velocity the skier will eventually reach (provided she continues the training and racing). The lower asymptote β 1 refers to the lowest velocity of the skier (as a toddler, presumably) and can be taken as another example of the danger of extrapolating outside the range of data. The parameter β 3 is the age of the skier when she has reached halfway to her highest velocity—we think of it as identifying the age at puberty. The parameter β 4 expresses the rate of progression at puberty. To illustrate the role of β 3 and β 4 on the shape of the curve, we provide three examples in Figure 3. In the example we set β 1 and β 2 to equal three and six, which correspond roughly to the estimates for the boys’ progression. The four-parameter logistic model can be estimated by use of the freeware R, even though we extend the model to allow 16
VOL. 21, NO. 4, 2008
5.5 5.0 4.5
Velocity (m/s)
(1)
4.0
B 2 B1 1 exp[(B3 Age) / B 4 ]
3.5
Velocity B1
3.0
Figure 2 suggests the relationship between velocity and age is neither linear nor quadratic. Instead, we use the four-parameter logistic (FPL) model, which is suitable for modeling an S-shaped relationship. Details about this model can be seen in Mixed Effect Models in S and S-PLUS by Jose Pinherio and Douglas Bates. The nonlinear logistic growth function is defined as
6.0
The Four-Parameter Logistic Model
10
12
14
16
18
20
Age Figure 3. Examples of the four-parameter logistic curve. B 3 and B 4 are set to 14 and 1 (solid line), 14 and 5 (dashed line), and 15 and 1 (dotted line).
for random effects. As it is reasonable to believe the individual progression curve might differ from individual to individual in the sample, the random effects for the four parameters are required for estimating an individual curve for each skier.
Girls
4.0
B1
3.518 (0.032)
3.179 (0.044)
B2
5.327 (0.034)
4.871 (0.063)
B3
14.044 (0.041)
14.098 (0.087)
B4
0.219 (0.017)
0.723 (0.056)
4.5
Boys
3.5
Fixed effects
3.0
Velocity (m/s)
5.0
5.5
Table 4—Fixed Effects and Random Effects Estimates (Standard Errors in Parentheses)
10
12
14
16
18
Age Figure 4. Typical progression curves for boys (solid curve) and girls (dashed curve)
Random effects (S j; j = 1,2,3,4) Boys
Girls
B1i
0.199
0.186
B 2i
0.277
0.248
B3i
0.294
0.343
B 4i
<0.001
<0.001
Residuals
0.226
0.180
No. of observations
495
302
No. of skiers
107
38
Including the random effects for the four-parameter equation (1) is re-expressed as
B 2i B1i E it ; Velocity it Bli 1 exp[(B3i Ageit ) / B 4i ] Where εit ∼ Ν(0, σ ) independently for i=1,2,…, n (number of boys or girls) and t=1,2,...,12 (race occasions) with βji = βj + bji , bji ∼Ν (0,σj2), and εit ⊥ bji (2) for j=1,2,3,4. 2
Note: βji and not βli εit and not εi In equation 2, β j presents the population average of the β ji parameters. For example, β4i is the slope at puberty for skier i. The term b ji is the individual (random) effect for skier i on parameter j. For example, b4i is the deviation from β4 for skier i; β4i = β4+b4i . The random effects for each parameter have a distribution with a constant variance that can be different for each parameter. The parameter σ 2j presents between individual variation in a certain aspect ( β j ). There also is a random error term: εit is a random error term for an individual i at a certain race occasion t. The nonlinear random effects model (or nonlinear mixed model) provides a relationship between velocity and age that can be regarded as a typical curve and a skier’s deviation from this typical curve. To enable estimation of the model, we assume the random effects are normally distributed with a common variance.
Progression Curves for Boys and Girls We experienced difficulties in estimating the progression curve for girls, so we had to omit data from girls for whom we had fewer than six repeated race measures. As a sensitivity analysis, we estimated the boys’ progression curve with all skiers and those with six or more measures. For the boys, the results were
almost identical, and therefore we believe the girls’ progression curve is reliable, even though some observations were removed. Figure 4 shows the estimated, typical progression curves for boys and girls. The estimated progression curves are very much in line with the theoretical prediction from the medical literature on puberty. The girls start their improvement at about 12 and have a strong improvement that is extended over four years. At about age 13, they are equally as fast as the boys. The typical boy, on the other hand, will experience an enormous improvement when the velocity increases by 50% in one year around the age of 14. The typical boy will be about 10% faster than the girl after puberty, an observation that is in line with the common belief of gender difference. However, most people involved in cross-country racing would claim boys and girls are equally fast before the teen-age years. Some would even say girls would be faster due to an earlier motoric development. The estimated curves do not confirm that belief. CHANCE
17
fixed 10
12
ID 16
14
D16−1 ●●
5.0
18
D17−1 ● ●
●
●●
● ●
4.5 ●
4.0 ● ●
3.5
●●
● ●
● ●
● ●
●
●●
3.0
Std. Vel. (m/s)
D18−9
D16−4 ●
●
● ●
●●
●●
● ●
●
● ●●
D16−2 ●●
● ●
● ●
●
● ●
5.0 4.5 4.0
●●
● ●
●
● ●
3.5
●●
●
3.0
D16−15
D16−12
D15−7
5.0 4.5 ●●
4.0 3.5 3.0
●●
10
●
12
● ●
● ●
●●
● ●
●
●
●
●●
●●
● ●
● ●
●●
● ●
●●
●●
14
16
18
10
12
14
16
18
Age Figure 5. Progression curves for eight girls who participated in 10 or more races. The solid lines show the typical progression; whereas, the dashed lines show the individual skier’s curve. The circles show the velocity at different races.
The individual variation in the progression is quite substantial, as one would expect. However, the standard variation of the random effect parameter β4 is practically zero, which means the rate of progression around puberty is the same for the skiers (though there is an important gender difference as evident from Figure 4). The age of the skiers when this puberty effect appears varies a lot. Taking the estimate of the standard deviation for the random parameter β3 at face value and recalling it is assumed to be normally distributed, one can deduce that a difference of one year between two skiers is fairly common for both boys and girls. Finally, the estimated standard deviations for the two asymptote parameters show there is variation in the ability of the skiers before and after puberty, but that variation seems to be similar in magnitude for boys and girls. The variation of the residuals also is worth noting, as the residuals could be interpreted as how much one skier deviates in one particular race from what would be a typical result for her. We have examined the residuals, and they provide interesting information to the individual skier that we do not disseminate here. We note, however, that the residual analysis suggests the four-parameter logistic model is not entirely fitting the progression of boys around the age of 10. The model suggests a minor progression between 10 and 12 years of age; whereas, we believe the actual progression to be somewhat stronger based on alternative, more complicated models. 18
VOL. 21, NO. 4, 2008
Skiers’ Progression Curves Figures 5 and 6 present a time series display of the progression of velocity for the individual girls and boys, along with the goodness of fit of the FPL model. The solid lines present the typical (or population-specific when the random effects are set to 0) progression curves and the broken lines present the individual specific (i.e. when the random effects are set to their predicted values) progression curves. To distinguish individual skiers, we create and assign a unique identification number (ID) to each individual. For example, we denote Julia with an ID D15-7. We include only those girls and boys who participated in 10 or more races in the figures, as such a display is not very informative for those who have a few observations. In Figure 1, we took a female skier—Julia—as an example of race performance before standardizing and modeling the data. In Figure 5, we show her estimated progression curve (Panel: D15-7) jointly with the typical curve for the girls. The figure also provides her (standardized) velocity at the 12 races in which she has taken part. First one can note that both the observations and her progression curve suggest a steady improvement. Before the teen-age years, she was about average, but then goes faster than average. The progression curve suggests she will stay above average in the years to come. Skier Johan Eriksson provides another example (see Panel: H18-2 in Figure 6). He was very fast before his teen-age years and seems to have entered puberty relatively early. After puberty, he is still fast compared with the typical male skier, but less so compared with his childhood.
fixed 10
12
H16−12 ● ●
ID 16
14
18
H18−2 ●
● ●
H17−7 ● ●
●
● ●
● ●
6.0 5.5 5.0 4.5 4.0 3.5 3.0
●
●
●
Std. Vel. (m/s)
●
●
●
6.0 5.5 5.0 4.5 4.0 3.5 3.0
●●
● ●
●
●●
H15−8
H18−7
●●
●
●●
●●
●
● ●
●
●●
● ●
● ●
H16−6 ●●
●
● ●
● ●
● ●
● ●
●●
● ●
● ●
●●
10
12
14
16
18
10
12
14
16
18
Age Figure 6. Progression curves for six boys who participated in 10 or more races. The solid lines show the typical progression; whereas, the dashed lines show the individual skier’s curve. The circles show the velocity at different races.
Discussion We found results that surprised us. One is the strength of the boys’ puberty effect, which is stronger than we anticipated. Second is the accuracy of the model for individual skiers as exemplified in figures 5 and 6. The latter fact suggests it would be possible to use the model to make statements about the individual’s future progression with good precision. Third is the magnitude of the confounding factors, which implies it is practically impossible to evaluate one’s progression based on race data. In view of this difficulty, it has been gratifying to see that the standardizing method and statistical model have enabled a successful disclosure of the underlying progression curves. As an alternative to standardization, one can think about collecting information on confounding variables such as temperature, snow condition, and wind condition and include them in the statistical model. We did not consider including the confounding variables in the four-parameter logistic model, as the precise data on the confounding variables are difficult to obtain. To clarify, let us consider the wind condition. The formal data one can obtain from the weather bureau is a measure of the average speed and direction of the wind flow over a wide area around the skiing track. Whereas, the specific data we need is only concerning the different parts of the skiing track, as the track may partially be open and partially surrounded by trees. Therefore, the effects of the wind condition on a skier’s performance are different in different parts of the track. Moreover, it is well-known that the wind condition changes frequently, hence a continuous measure of the wind condition may be required. Similar reasoning is also applicable for not considering other confounding variables. With the aid of modern technology, such as a GPS device, the organizers might control the exact length of the skiing track. In that case, the velocity of the skiers at different races would be comparable. However, the organizers are reluctant to commit to preparing a ‘perfect track’ for a couple of reasons. First, their only intention is to find the first and second places in a specific race. Second, due to the sudden fluctuation of weather conditions, they often have to decide about the specific tracks to be used just a few hours before the race starts. Therefore, the preparation of a perfect track is often impossible.
It might be interesting to note that adult skiers try to circumvent the problem of unreliable race data by relying on laboratory testing of physical strength, as described in “Assessment of the Reliability of a Custom Built Nordic Ski Ergometer for Cross-Country Skiing Power Test,” published in the Journal of Sports Medicine and Physical Fitness. However, no one has been able to demonstrate that the lab test results in a noncompetitive setting are valid for a competitive race setting. There is one direct application of the estimated progression curves that could help in lowering the drop-out rate. The race organizers could provide, in addition to the race results, individual progression curves that would encourage the teen-agers to focus on their evolution, rather than their relative ranking in the races. Another application would be to define race classes based on the skiers’ progression, rather than their age, even though it is unlikely to be implemented in the short-term. We have strong confidence in the model, but one should note that the variation in puberty effect might be underestimated. We sampled 15–18-year-olds in 2008, which means selected skiers conditional on them not dropping out due to puberty. It is possible, perhaps even likely, that some skiers dropped out before 15 due to late puberty. If so, the model would have underestimated the actual variation in puberty effect. There is scope for analyzing drop-outs as a function of the progression curve, which could be done by following a sample of, say, 10-year-old skiers and examining whether dropping out is strongly related to late puberty.
Further Reading Abbassi, V. (1998) “Growth and Normal Puberty.” American Academy of Pediatrics, 102:507–511. Bortolan, L., Pellegrini, G., Finizia, G., and Schena, F. (2008) “Assessment of the Reliability of a Custom Built Nordic Ski Ergometer for Cross-Country Skiing Power Test.” Journal of Sports Medicine and Physical Fitness, 84:177–182. Pinherio, J.C. and Bates, D.M. (2000) Mixed Effects Models in S and S-PLUS, Springer: New York. CHANCE
19
The Birthday Matching Problem When the Distribution of Birthdays Is Nonuniform W. J. Hurley
“How many people must be in a room before the probability that some share a birthday, ignoring year and ignoring leap days, becomes 50 percent?” — Richard von Mises
I
f birthdays are uniformly distributed and independent of one another, then the answer is surprisingly few: 23 will get the probability to just over 50%; with 40, a match happens close to 90% of the time. This problem is a good illustration of how our intuitions are sometimes poor when it comes to assessing the probabilities of chance events.
Richard von Mises Courtesy of Smithsonian Institution Libraries, Washington, DC
20
VOL. 21, NO. 4, 2008
In a popular blog article, Alan Bellows wrote the following: So does this mean that you can walk into a math class of 40 students, bet them that at least two people in the room share a birthday, and win 90% of the time? Not exactly. In real life, where math is not particularly welcome, birthdays are not distributed perfectly throughout the year. More people are born in the springtime, which throws the numbers off. He is suggesting that the probabilities are not as high when the distribution of birthdays is nonuniform. This certainly doesn’t agree with my intuition. Nor is it likely to agree with the intuitions of most readers of CHANCE. Surely the probability of a match increases as the distribution moves away from a uniform distribution.
It is well known that the distribution of National Hockey League (NHL) player birthdays is nonuniform. There are many more players born in the first six months of the year than in the last six months. Consequently, if we looked at a collection of NHL teams (each team has a roster of 24), we would expect to see a higher percentage of teams with a match than what the von Mises probability suggests. Let’s look at some theory, and then ask, “What do the data show?“
A Sketch of the Proof That the von Mises Probability Is a Lower Bound Suppose there are only two people and d0 possible birthday in a year. We assume
that d0 = 365 and the days are numbered consecutively 1, 2, ... , d0. Let the probability a person is born on day k be pk so that ∑k pk =1. Assuming independence, the probability that the two people have the same birthday is: p12 p22 ... pd20 .
(1)
We seek to solve the problem of finding the set of probabilities such that (1) is as small as possible with constraints that all the probabilities are between 0 and 1 and they add to 1: min. p12 p22 ... pd200 s.t. p1 p2 ... pd0 1 (2) 0 b pk b 1 for all k.
It is well known that this problem has a global minimum: p*=(1/d0,1/d0,...,1/d0). (3)
CHANCE
21
Table 1—NHL Birthday Data as Reported in 'More on the Relative Age Effect' by Terry Daniel and Christian Janssen 1961
1972
1982
48
156
442
1st Half (Jan–Jun) 2nd Half (Jul–Dec)
55
160
273
First Half Proportion
0.466
0.494
0.618
Table 2—NHL Birthday Data for the 1987, 1997, and 2007 Seasons 1987
1997
2007
Totals
1st Half
299
363
401
1,063
2nd Half
205
261
319
785
First Half Proportion
0.593
0.582
0.557
0.575
Table 3—The Distribution of Male Birthdays in Canada for Selected Years by Quarter 1969
1979
1989
1999
Quarter1
45,891 (24.2%)
45,767 (24.3%)
47,750 (23.8%)
42,173 (24.4%)
Quarter2
49,727 (26.2%)
48,718 (25.9%)
52,191 (26.0%)
45,028 (26.0%)
Quarter3
48,886 (25.8%)
48,186 (25.6%)
52,250 (26.0%)
44,931 (26.0%)
Quarter4
45,294 (23.8%)
45,405 (24.2%)
48,585 (24.2%)
40,925 (23.6%)
Totals
189,798
188,076
200,776
173,057
To see this, readers might consider the two-dimensional instance: min. p12 p22 s.t. p1 p2 1
(4)
0 b p1 , p2 b 1.
The objective function is a sphere and the point where its level set corresponding to the minimum for this problem touches the constraint is (1/2; 1/2). Hence, for two people, any nonuniform distribution of birthdates gives a higher probability of a match than a uniform distribution, and therefore the von Mises probability is a lower bound. For the outline of a more general proof, see the sidebar. 22
VOL. 21, NO. 4, 2008
The NHL and the Relative Age Effect Table 1 presents NHL birthday statistics reported by Terry Daniel and Christian Janssen for three years: 1961, 1972, and 1982. Player birthdates are classified as either first half (January to June) or second half (July to December). Note that the distribution was essentially uniform in 1961 and 1972 but, in 1982, there were significantly more birthdays in the first half. Table 2 gives the same breakout for the 1987, 1997, and 2007 seasons. The 2007 season data is taken from www. nhl.com and is based on the first 24 players listed on each team’s roster. Data for the 1987 and 1997 seasons are taken from
www.hockeydb.com. Table 2 indicates the 1982 distribution has persisted. But the distribution of male birthdays in Canada throughout the last 40 years has been uniform. Table 3 presents numbers from Statistics Canada of live male births in Canada by quarter in each of four years, one year in each of the last four decades. This birth rate is remarkably constant across quarters. In fact, based on a chisquare test, the hypothesis that the distribution of male births is uniform across quarters cannot be rejected. As it turns out, hockey is not much different than other sports where the minor system feeding the professional system is characterized by a calendaryear grouping system and early streaming (children are selected at early ages to play on elite all-star teams). It turns out that children born early in the year are more likely to play on these allstar teams. This phenomenon is termed the Relative Age Effect (RAE), and it is observed in professional soccer worldwide, Major League Baseball in North America and Japan, and professional hockey at all levels in North America. Research on the RAE and its relation to various population characteristics has a rich history, dating to the work of Ellsworth Huntington (1938), Corrado Gini (1912), and R. Pintner and G. Forlano (1933). Jochen Musch and Simon Grondin have written a comprehensive review of the RAE in sports, and its discovery in hockey is generally attributed to Roger Barnsley, Angus Thompson, P. R. Barnsley, Grondin, P. Deshaies, and L. P. Nault. The key to understanding this phenomenon is to look at sports where the effect ought to occur but doesn’t. A good example is professional football in North America (the NFL and CFL). Similar to baseball, hockey, and soccer, size matters in football. Yet we do not observe the RAE in professional football. I believe the reason for this is that there is no early streaming in youth football. Generally, players are not streamed until high school, at age 14–15, when they try out for high-school teams. In contrast, youth hockey players are streamed at very early ages. In Canada, AAA (the top level) hockey begins as early as age 7. Children who compete and win spots on these teams seem to gain an advantage that persists through to professional hockey. But, it wasn’t always like this. Streaming began in the late 1960s and
A More General Proof That the von Mises Probability Is a Lower Bound Consider a group of n people. The best way to calculate the probability that at least two have the same birthday is to calculate the probability that no two people have the same birthday. Let A be the event there are at least two people with the same birthday; let A0 be the event that no two have a common birthday. Clearly
which is equivalent to max s.t.
(14)
0 b pk b 1 for all k.
(5)
Pr(A) + Pr(A0) = 1
¤ k' S k (p1 , p2 ,..., pd0 ) ¤ k pk 1
The Lagrangean for this problem is
¥ ´ L(p1 , p2 ,..., pd0 ) ¤ S k (p1 , p2 ,..., pd0 ) L ¦1 ¤ pk µ § k ¶ k '
and we can get Pr(A) by calculating Pr(A0). To get Pr(A0), let (6)
X ( X1 , X 2 ,..., X d0 )
be a vector of random variables where Xk is the number of the n people with a birthday on day k. Assuming independence and given the vector of probabilities, R (p1 , p2 ,..., pd0 ), (7)
and the first-order conditions are ´ uL u ¥ ¦ ¤ S k (p1 , p2 ,..., pd0 ) µ L 0, for j=1,2,…,d0 up j up j § k' ¶
X follows a multinomial probability law where the probability of the outcome x = (x1, x2, … , xd0) is
and
f0 ( x1 , x 2 ,..., x d0 )
n! x p1x1 p2x2 ...pd0d0 . x1 ! x 2 !...x d0 !
(8)
uL 1 ¤ pk 0. uL k
(9)
k
Hence
Pr( A0 )
¤
x A A00
f0 ( x1 , x 2 ,..., x d0 )
(10)
Note that, for all x ∈A0 ,
Note that uS k (p1 , p2 ,..., pd0 ) / up j is now a product of n – 1 probabilities. Moreover, by symmetry,
(11)
k '
Hence the probability that at least two people have the same birthday is (12) Pr( A) 1 n ! ¤ S k (p1 , p2 ,..., pd0 ). k '
The problem we seek to solve is min 1 n !¤ k 'S k (p1 , p2 ,..., pd0 ) s.t.
¤ k pk 1; x k [0,1] k
(13)
(18)
¥ ´ u ¦ ¤ S k (p1 , p2 ,..., pd0 )µ / upi § k ' ¶
(19)
have the same number, η0, of these products. To see that p* = (1/d0, 1/d0, … , 1/d0)
n! n! solves these conditions, note that x1 ! x 2 !...x d0 ! since, for xk = 0 or 1, xk! = 1. Moreover, for this same reason, uS k (p1 , p2 ..., pd0 ) = (1/d0)n-1 for all j x p* up j each term, p1x1 p2x2 ...pd0d0 for x ∈A0 , is simply a product of a proper subset of n probabilities all raised to the power 1. Suppose there and therefore are 0 of these products numbered 1, 2, … , 0. Let product k be uS (p , p ,..., p ) = (1/d )n-1 for all j. 0 0 S k (p1 , p2 ,..., pd0 ; and let the set of all products be Γ. Therefore ¤ k 1 2 d0 up j p* k ' Pr( A0 ) n ! ¤ S k (p1 , p2 ,..., pd0 ).
(16) (17)
¥ ´ u ¦ ¤ S k (p1 , p2 ,..., pd0 )µ /up3 To get Pr(A0), note that we must sum f0 over all possible out§ k ' ¶ comes, x, where at most 1 of the n persons is born on each day. That is, we are interested in the set of outcomes and AA0 [( x1 , x 2 ,..., x d0 )| x k {0,1} k , ¤ x k n].
(15)
(20)
(21)
(22)
Hence if we set λ∗ = 0(1/d0)n-1, then the conditions in (16) are solved by p* and λ∗. Finally, note that
¤ pk ¤1 / d0 1. k
(23)
k
Hence p* and λ∗ solve the first-order conditions. One can then use second-order information to show that p* is a unique global maximum. Hence the uniform distribution provides the lowest probability of a match. Stated another way, a nonuniform distribution of birthdays results in a match probability greater than the von Mises probability.
CHANCE
23
Table 4—NHL Teams Where There Is a Birthday Match for the 1987, 1997, and 2007 Seasons Season
#Teams
#Matches
1987
21
16
1997
26
17
2007
30
14
Totals
78
47
early 1970s and gained momentum after our experience with the Russians in the 1972 Summit Series. The NHL did not exhibit an RAE in the 1960s or 1970s; the effect only started to appear in the mid-1980s, when the children streamed in the 1970s began to reach NHL age. Consequently, what are the necessary conditions for the RAE to exist at the professional level of a sport? In my view, there are two. First, the system of age categorization for the youth sport must be based on the calendar year. And second, there must be early streaming based on ability.
The Frequency of Birthday Matches on NHL Rosters I examined the rosters of 78 NHL teams over the 1987, 1997, and 2007 seasons. Table 3 reports the number of these teams where there was at least one birthday match. In all, there were 47 for a frequency of 47/78 = 0.610. But, what should the frequency be? If NHL birthdays were uniformly distributed, the von Mises probability of a match, 0.538, would apply. But they are not, and so 0.538 is, at best, a lower bound on the true probability. To get a better bound, I undertook the following calculation. Given an NHL roster of 24 players, suppose each player has a birthday in the first half of the year with probability p and, therefore, a birthday in the second half with probability 1 – p. Hence, the number of players with a birthday in the first half of the year is a binomial random variable with parameters n = 24 and probability p. Suppose that, for those players born in the first half of the year, each of the 181 possible birthdays (February 29 is ignored) is equally likely. Similarly, for those born in the second half, each of 24
VOL. 21, NO. 4, 2008
the 184 possible birthdays is also equally likely. Based on the NHL data presented above, this assumption is not likely to be correct, but it ought to give a sharper bound. In detail, let 1(m1) be the von Mises probability that there is at least one match for m1 players with birthdays in the first half of the year. In a similar way, I define 2(m2), the probability of a match for the m2 players born in the second half. Given that m of 24 players have a birthday in the first half, the probability of a match over the whole year is μ(m) = 1 − [1 −
1(m)] [1 − λ2 (24 − m)]. (24)
Then, conditioning on the number of players with birthdays in the first half, the probability of a match on a team with 24 players is ⎛ 24 ⎞ L(p) = ∑ μ(m) ⎜ ⎟pm (1 − p) n−m . ⎝m⎠ m=0 n
(25)
Of the 78 teams I examined, the proportion of players born in the first half was 0.575. Hence, assuming p = 0.575, the lower bound is L(0.575) = 0.547. Assuming the true probability of a match is 0.547, the p-value for the sample outcome of 47 of 78 teams exhibiting a match is 0.192, which is not low enough to warrant the conclusion that the sample is inconsistent with the bound.
An Improved Bound There are a number of ways one could get an improved bound. One would be to partition a year into a larger number of discrete periods, say quarters, and then make the same calculation as above, albeit with a multinomial distribution. Another approach is the following Monte Carlo simulation. I first generated a data set comprising all players who began their careers in the NHL in the 1985 season or after. There were 2,717 such players. From this data set, I sampled the birthdays of 24 players without replacement and determined whether at least two birthdays matched. I repeated this sampling 100,000 times. Of these 100,000 repetitions, 56,537 had at least two birthdays the same. Hence, based on the period 1985–2006, I estimate the probability of a match on an NHL team to be 0.565. And, not surprisingly, the sample proportion, 0.610 (47 of 78 teams), is consistent with this probability (the p-value is 0.29).
Pedagogical Lessons I have shown that the von Mises probability is a lower bound on the true probability of a match in the case where birthdays are nonuniformly distributed. I then used NHL birthdates to illustrate the idea. The argument may have some pedagogical value, as the use of populations whose birthday distributions are known (e.g., NHL players) is a nice way to illustrate that the von Mises probability is a lower bound. Interested readers who teach might consider using other professional sports. The two that come to mind are baseball in North America and Japan and soccer just about anywhere.
Further Reading Barnsley, R.H., Thompson, A.H., and Barnsley, P.E. (1985) “Hockey Success and Birthdate: The Relative Age Effect.” Journal of the Canadian Association for Health, Physical Education, and Recreation, 51:23–28. Bellows, Alan. (2006) “The Birthday Paradox.” www.damninteresting. com/?p=402. Daniel, T.E., and Janssen, C.T.L. (1987). “More on the Relative Age Effect.” Journal of the Canadian Association for Health, Physical Education, and Recreation, 53:21–24. DasGupta, Anirban. (2005) “The Matching Birthday and the Strong Birthday Problem: A Contemporary Review.” Journal of Statistical Planning and Inference, 130:377–389. Gini, C. (1912) “Contributi Statistici ai Problem Dell’eugenica.” Estratto dalla Rivista Italiana di Sociologia, Anno XVI. Fasc. III-IV. Grondin, S., Deshaies, P., and Nault, L.P. (1984) “Trimestres de Naissance et Participation au Hockey et au Volleyball.” La Revue Quebecoise de l’Activite Physique, 2:97–103. Huntington, Ellsworth. (1938) Season of Birth: Its Relation to Human Abilities. John Wiley & Sons: New York. Musch, J., and Grondin, S. (2001) “Unequal Competition as an Impediment to Personal Development: A Review of the Relative Age Effect in Sport.” Developmental Review, 21:147–167. Pinter, R., and Forlano, G. (1933) “The Influence of Month of Birth on Intelligence Quotients.” Journal of Educational Psychology, 24:561–584.
Poker Superstars: Skill or Luck? Similarities between golf — thought to be a game of skill — and poker Rachel Croson, Peter Fishman, and Devin G. Pope
“Why do you think the same five guys make it to the final table of the World Series of Poker every year? What are they, the luckiest guys in Las Vegas?” — Mike McDermott (Matt Damon in the 1998 film “Rounders”)
T
he popularity of poker has exploded in recent years. The premier event, the World Series of Poker Main Event, which costs $10,000 to enter, has increased from a field of six in 1971 to 839 in 2003 and 5,619 in 2005. Broadcasts of poker tournaments can frequently be found on television stations such as ESPN, Fox Sports, the Travel Channel, Bravo, and the Game Show Network. These tournaments consistently receive high television ratings. Poker also has garnered the attention of many influential academics. It served as a key inspiration in the historical development of game theory. John Von Neumann and Oskar Morgenstern claim that their 1944 classic, Theory of Games and Economic Behavior, was motivated by poker. In the text, they described and solved a simplified game of poker. Other famous mathematicians/economists such as Harold Kuhn and John Nash also studied and wrote about poker. For all its popularity and academic interest, the legality of poker playing is in question. In particular, most regulations of gambling in the United States (and other countries) include poker. In the United States, each state has the authority to decide whether it is legal to play poker for money, and the regulations vary significantly. In Indiana, poker for money is legal only at regulated casinos. In Texas, poker for money is legal only in private residences. In Utah, poker for money is not legal at all. The popularity of online poker for money has raised further questions about the right (or ability) of states to regulate this activity. At the national level, the U.S. Department of Justice recently stated that the Federal Wire Act (the Interstate Wire Act) makes online casino games illegal (in addition to sports wagering), although the U.S. Fifth Court of Appeals subsequently ruled that interpretation incorrect. That said, there are heated arguments on both sides of the regulation debate. Those in favor of regulating argue that poker is primarily a game of luck, such as roulette or baccarat, and that it should be regulated in a manner similar to those games. Those in favor of lifting regulations argue that it is primarily a game of skill—a sport such as tennis or golf—and it should not be regulated at all. So, is professional poker a game of luck or skill? Several ‘star’ poker players have repeatedly performed well in high-stakes poker tournaments. While this suggests skill differentials, it is far from conclusive. In how
many poker tournaments have these stars participated in which they did not do well? Furthermore, even if poker competition among top players were random, we would expect a few players to get lucky and do well in multiple tournaments. We use data from high-stakes poker and golf tournaments and identify the rates at which highly skilled players are likely to place highly. We use golf as a comparison group, as it is an example of a game thought to be primarily skill-based. If the data from golf and poker have many similarities, especially in terms of repeat winners, those data could suggest poker is equivalently a game of skill.
Data In a large poker tournament, individuals pay an entry fee and receive a fixed number of chips in exchange. These chips are valuable only in the context of the tournament; they cannot be used elsewhere in the casino or exchanged for money. Players are randomly assigned to tables, typically including nine players and one professional dealer. Players remain in the tournament until they lose all their chips, at which point they are eliminated. Some tournaments include a “rebuy” option, where players can pay a second entry fee and receive more tournament chips. Others include an “add-on” option, where they can pay a small extra fee (often used to tip the dealers) and receive more tournament chips. At some point during the tournament, these options disappear. As players lose their chips, they are merged to create a roughly equal distribution of players per table.
CHANCE
25
Table 1—Descriptive Statistics Poker
Golf
% with 1 top 18 finish
70.1
24.3
% with 2 top 18 finishes
14.7
14.7
% with 3 top 18 finishes
6.9
18.9
% with 4 or more top 18 finishes
8.3
42.2
Number of tournaments
81.0
48.0
Number of individuals
899.0
218.0
Note: Poker summary statistics represent data from all high-stakes ($3,000 or greater buy-in) limit and no-limit Texas Hold’em tournaments between 2001 and 2005 from the World Series of Poker, the World Poker Tour, or World Poker Open. The 899 players represent those who finished in the top 18 of at least one of these 81 tournaments. The golf summary statistics represent data from all Professional Golfers’ Association tournaments in 2005. The 218 players represent those who finished in the top 18 of at least one of these 48 tournaments.
Percent of players with …
100 90 80 70 60 50 40 30 20 10 0
1 top 18 finish
2 top 18 finishes
Poker (899 individuals; 81 tournaments) Golf (218 individuals; 48 tournaments)
3 top 18 finishes
4 or more top 18 finishes
Figure 1. Descriptive statistics
Identifying skill discrepancies among top poker players is complicated by the lack of precise tournament data. The lists of entrants for large poker tournaments are not available, and outcomes are typically only recorded for players who finish in the final two or three tables. Thus, it is not possible to know the total number of tournaments for which a given player has participated. In our data, we have 899 poker players who finish in the top 18 of a high-stakes tournament at least once. The average tournament has between 100 and 150 entrants. Thus, a given person has an 11%–17% chance of entering a given tournament. Due to the lack of data on tournament attendance, it is impossible to know if players who frequently show up at final tables are more skilled than other players, or if they simply play in more tournaments. To circumvent this selection issue, we employ a strategy that focuses on individuals who finished in the top 18 in high-stakes tournaments (the two final tables). As data are typically available for all players who finish in the top 18 of a given tournament, we can overcome the selection issue by focusing on just these individuals. Thus, while we are unable to identify the number of tournaments an individual has played in, we are able to identify the number of times a player has played in a tournament of 18 players. We can analyze whether certain players consistently outperform other players conditional on being in the top 18, or whether the outcomes appear to be random. We use data from limit or no-limit Texas Hold’em tournaments that are part of the World Series of Poker, World Poker Tour, or World Poker Open. Texas Hold’em is a variant of poker in which all players are given two personal cards and there are five community cards that apply to all players’ hands. 26
VOL. 21, NO. 4, 2008
The goal is to make the best five-card hand from the two personal cards and the five community cards. Betting occurs after each player receives his or her cards, again after three of the five community cards are revealed, again after the fourth community card, and finally after the fifth community card. In limit Texas Hold’em, the bet amounts each round are fixed; whereas, in no-limit Texas Hold’em, a player can wager as many chips as he or she wants above a set minimum wager. Using information gleaned from pokerpages.com, we record outcomes for the top 18 finishers of tournaments since 2001 that had at least a $3,000 buy-in. For a small number of tournaments after 2001 (and for all tournaments prior to 2001), the top 18 finishers were not recorded or not available and, thus, were not included in the analysis. A total of 81 separate poker tournaments fit these criteria. Table 1 presents summary statistics for the poker players in these tournaments. We similarly collect data for all 48 Professional Golfers’ Association (PGA) tournaments in 2005. We record the name and final rank of each player who finished in the top 18 in each tournament. In golf, there are often ties. We record an average rank for these situations (i.e., if two players tie for third place, each player is given a rank of 3.5). Table 1 provides summary statistics for the golf players in these tournaments. Empirically, we are interested in using information about past performance to predict the outcome of individuals in a given tournament, conditional on them being among the final 18 contestants. Our main outcome variable will be the individual’s rank in this tournament of 18 (1 through 18), with lower ranks being better. If we are able to predict an individual’s rank in this tournament of 18 based on their past performance, this implies that outcomes are not random. We also will compare our predictive ability between golf and poker (see Figure 1).
Methods We fit the data using ordinary least squares regression. Thus, given standard notation, the coefficients ( βˆ ) are estimated such that βˆ = ( X ' X )−1 X ' y. The variance of this estimator is ( X ' X )−1 X ' ∑ X ( X ' X )−1, where ∑ = E [( y − Ey)( y − Ey)'] . Typical OLS estimation assumes homoscedasticity and the independence of error terms across observations. These assumptions imply that ∑= σ 2 I, thus the variance of the OLS estimator can be represented as ( X ' X )−1σ 2 . One might worry that one or more of these assumptions will fail in our case. For example, in many cases, we have observations for the same player across different tournaments in our data set. Thus, the error terms on these observations may not be independent. While we present typical OLS coefficient estimates, we adjust the standard errors in our model to account for the possibility of heteroskedasticity and that the error terms on observations from the same player may not be independent of each other. In other words, the standard errors we present are “robust” and “clustered” at the player level. Mathematically, this implies that, instead of assuming, ∑= σ 2 I , we allow ∑ to have off-diagonal terms that are not zero and to have diagonal terms that are different from each other. These terms are simply represented by the appropriate products of the residuals (( y − Ey)( y − Ey)') when calculating the standard errors on our OLS coefficients. The classic 2002 econometric text by Jeffrey Wooldridge supplies an even more detailed description of this process.
Table 2—OLS Regressions with Robust Standard Errors: Rank (1st–18th) Poker (1) Experience
(2)
Golf (3)
-0.781 [.278]*
(5)
(6)
-1.420 [.382]* -0.225 [.098]*
Finishes
-0.222 [.089]* 0.203 [.050]*
Previous Rank Constant
9.810 [.173]*
9.707 [.166]*
7.189 [.490]*
R-Squared
0.5%
0.9%
2.8%
1494
1494
595
Observations
(4)
0.033 [.056] 10.270 [.331]*
9.743 [.253]*
8.566 [.595]*
1.6%
1.2%
0.1%
811
811
586
Note: Columns (1)–(6) present coefficients and robust standard errors clustered at the player level from regressions with finishing rank (1st–18th) as the dependent variable. Experience is an indicator that equals one if the player had previously finished in the top 18 of a tournament in our sample (0 or 1). Finishes is the number of times the individual has previously appeared in the top 18 of a tournament in our sample (ranges 0 to10 for poker and 0 to 14 for golf). Previous Rank indicates the average rank for all previous tournaments in which the player finished in the top 18 in our sample (ranges from 1 to 18). * significant at 5%
Our baseline econometric specification is Ranki = α + β X i + ε i where Ranki is the rank at the end of a tournament for player i and Xi is a measure of previous tournament performance for player i. We will examine three measures of previous tournament performance to see how well they explain current rank. Our first measure is called “experience,” and it records whether a player has previously finished in the top 18 of another tournament prior to the one whose rank we are predicting (thus, it takes the value of either 0 or 1). Our second measure is called “finishes,” and it records the number of times a player has previously finished in the top 18 of another tournament prior to the one whose rank we are predicting (this variable ranges 0 to 10 for poker and 0 to 14 for golf). Our third measure is called “previous rank,” and it records the average rank of a player in all previous tournaments in which the player finished in the top 18. To assess the sensitivity of results to the chosen model, the analysis is repeated using an ordered probit model, a regression format designed to handle situations where the dependent variable has several discrete categories ordered in some way (such as rank). In comparison with least squares, the ordered probit is more robust, but also more computationally intensive. Results from the ordered probit are the same as those we find using OLS. We present OLS coefficients in this paper for purposes of clarity and ease of interpretation. (Results from the ordered probit are available from the authors.) We will conduct two types of statistical tests. The first focuses on only the poker data. If there are no skill differentials among poker players, we would expect the coefficient on experience, finishes, and previous rank to be statistically insignificant. This would indicate that, conditional on making it to the final 18, one’s final rank is not influenced by previous tournament performance. However, if some players are more skilled than others, we would expect to find statistically significant and negative coefficients for experience and finishes in the above specifications (past experience and success should be associated with a reduction in rank [e.g., from 7th place to 6th place]) and a positive coefficient for previous rank (a higher rank in previous tournaments of 18 should be associated with a higher rank in this one).
A second test we use is a comparison of the results between golf and poker tournaments. We compare the size of the coefficients of interest. If golf has statistically larger coefficients than poker (in absolute value), then there is more skill in golf than in poker. If the coefficients in golf are not statistically different than those in poker, we will conclude that poker has similar amounts of skill (and luck) as golf.
Results Table 2 presents the results. Robust standard errors are presented in brackets below the coefficient values. Our first analysis involves simply looking at the poker data and identifying whether previous success predicted current success. Clearly it does. The coefficient on experience (whether a player has previously finished in the top 18) is significantly and negatively correlated with a player’s rank in the given tournament, suggesting an increase in finishing (-.78 ranks, p<.01). The coefficient on finishes (the number of times a player has previously finished in the top 18) is significantly and negatively correlated with a player’s rank in the given tournament, suggesting an increase in finishing as well (-.22 ranks, p<.05). The coefficient on previous rank (the average rank for the player in previous tournament finishes) is significantly and positively correlated with a player’s rank in the given tournament (.20 ranks, p<.01). These results clearly suggest poker is, at least somewhat, a game of skill. But, how much skill? A comparison with golf can illuminate this question. If we compare the estimated coefficients on the experience variable, we find that these coefficients are not statistically different from each other (t = 1.35, p>.05). Similarly, there are no statistically significant differences between the estimated coefficients on finishes (t = 0.10, p>.05). For the final measure of previous performance, previous rank, the coefficient for poker is statistically larger than the coefficient for golf (t = 2.24, p<.05). Figures 2a and 2b show two of these relationships graphically. Figure 2a depicts the average rank in a given tournament as a function of finishes. Figure 2b depicts the average rank in a given tournament as a function of previous rank. Both show the average rank, as well as a linear fit of the data. These figures CHANCE
27
Average rank of current tournament
Average rank of current tournament
11
11
10
10
9 8 0
1
2
3
Number of previous top 18 finishes Poker
Golf
9 8 7
3
4
5
6 7 8 9 10 11 12 Average rank of previous tournaments
Poker
13
14
≥15
Golf
Figure 2a. Relationship between rank and number of previous top 18 finishes
Figure 2b. Relationship between rank and average previous rank
Note: This figure depicts the average rank for poker and golf players who finish in the top 18 for a given poker or golf tournament. The number of previous top 18 finishes (finishes in the analyses above) is the total number of previous top 18 tournament finishes for each player in our sample (0, 1, 2, 3, or 4 or more). The straight lines indicate linear fits of the data. Note that the slope of these lines is not exactly the same as the slope from the regressions, as we have simplified the variable finishes for ease of display.
Note: This figure depicts the average rank for poker and golf players who finish in the top 18 for a given poker or golf tournament. The average rank of previous tournaments (previous rank in the analyses above) is the average rank the individuals achieved in previous tournaments in which they made the top 18. The straight lines indicate linear fits of the data.
visually depict our regression results from Table 2: Both poker and golf show a significant negative relationship between current rank and finishes. Poker, but not golf, shows a significant positive relationship between current rank and previous rank. That said, the R-squared values for the regressions we report for both poker and golf are extremely low (ranging from .1%–2.8%). This suggests that, in general, it is very difficult to predict the ordering of a given set of poker or golf players who finish in the top 18 of a given tournament. Although our measures of previous performance are statistically significant predictors of current performance, they still only explain a small amount of the overall variation that exists in poker and golf, as one might expect to be the case in many sports and games, especially those with explicit randomization such as poker.
While we provide evidence for the impact of skill on poker outcomes, we cannot provide insight regarding the cause of this result. We do not know, for example, if poker players are skilled because they are good at calculating pot odds and probabilities, good at reading their opponents’ tells (subtle physical cues that signal the strength of a player’s hand), or simply better at bluffing or intimidating the rest of the table. Similarly, we cannot identify the source of skill differentials at golf. Are these due to better driving skills, better putting skills, or better strategies? Further research (with more data) is clearly needed to identify which skills are at play. However, our evidence argues that at least some portion of poker outcomes are due to skill, and we hope this will illuminate the raging regulatory debate in the United States and elsewhere.
Discussion and Conclusion We present evidence of skill differentials among poker players finishing in one of the final two tables in high-stakes poker tournaments. We show two main results. First, there appears to be a significant skill component to poker: Previous finishes in tournaments predict current finishes. Second, we find the skill differences among top poker players are similar to skill differences across top golfers. While our analysis provides evidence for skill being a factor in poker (significant regression coefficients), the current evidence needs further support from other analyses (primarily because of the small R-squared). Thus, this analysis should be considered a first attempt to answer this question, and we hope this article will stimulate further efforts. A second limitation of the present study is that models do not specifically account for repeated observations from some players in the analyses and that results within a tournament for different players are correlated. These aspects of the data would impact standard errors in analyses, but perhaps not too strongly. First, most players appear in just a few tournaments, so they are used not many times. In poker, this is especially true. Second, few pairs of players appear in the same pairs of tournaments. Thus, the amount of information that could be learned by modeling ranks for pairs of players is quite limited. This is especially true in poker. 28
VOL. 21, NO. 4, 2008
Further Reading Emert, John and Umbach, Dale (1996) “Inconsistencies of ‘Wild-Card’ Poker.” CHANCE, 9(3):17–22. Haigh, John (2002) “Optimal Strategy in Casino Stud Poker.” Journal of the Royal Statistical Society, Series D: The Statistician, 51:203–213 Heiny, Eric. (2008) “Today’s PGA Tour Pro: Long but Not so Straight.” CHANCE, 21(1):10–21. Kuhn, H.W. (1950) “A Simplified Two-Person Poker.” Contributions to the Theory of Games, I. H.W. Kuhn and A.W. Tucker (eds.). Annals of Mathematics Studies, Number 24. Princeton, New Jersey: Princeton University Press. Nash, J.F. and Shapley, L.S. (1950) “A Simple Three-Person Poker Game.” Contributions to the Theory of Games, I. H.W. Kuhn and A.W. Tucker (eds.). Annals of Mathematics Studies, Number 24. Princeton, New Jersey: Princeton University Press. Von Neumann, J. and Morgenstern, O. (1944) Theory of Games and Economic Behavior. Princeton, New Jersey: Princeton University Press. Wooldridge, J.M. (2002) Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: MIT Press.
The Value of Statistical Thinking: A Basketball Gambling Rookie Conquers March Madness Phillip Price
“The critical investment factor is determining the intrinsic value of a business and paying a fair or bargain price. ” — Warren Buffett, successful investor and world’s richest man
T
he National Collegiate Athletic Association (NCAA) men’s basketball tournament takes place each March. Sixty-four teams are chosen from across the country—some of them according to predetermined rules and some by a selection committee—and these teams compete in a single-elimination tournament to determine the national champion. Teams are grouped into four regions: East, Midwest, South, and West. As usual in single-elimination tournaments, the tournament is organized so the best teams will not face each other until late in the tournament. In the first round, the good teams in each region play the weaker teams. The diagram showing who plays whom in the tournament is called the tournament “bracket.” Figure 1 shows the tournament bracket for the Midwest region in 2008, with the winner shown for each game. Of course, before the tournament, only the names on the extreme left side are known. 1 Kansas 16 Portland State
1 Kansas 1 Kansas
8 UNLV 9 Kent State
8 UNLV 1 Kansas
5 Clemson 12 Villanova
12 Villanova 12 Villanova
4 Vanderbilt 13 Siena
13 Siena
MIDWEST
6 USC
1 Kansas
11 Kansas State 11 Kansas State 3 Wisconsin 3 Wisconsin 3 Wisconsin 14 CS Fullerton 10 Davidson
7 Gonzaga 10 Davidson
10 Davidson 10 Davidson
2 Georgetown 15 UMBC
2 Georgetown
Figure 1. Tournament bracket for the Midwest in 2008 with seedings and game winners shown. Before the tournament, only the leftmost column is known. CHANCE
29
Table 1—Historical NCAA Basketball Tournament Win-Loss (W-L) Performance, by Seed
Seed
Round of 64
Round of 32
Sweet Sixteen
Elite Eight
Final Four
Points per Year Finals
Total number
%
1
92 - 0
80 - 12
65 - 15
38 - 27
21 - 17
13 - 8
309 - 79
1,872
24
2
88 - 4
58 - 30
43 - 15
21 - 22
10 - 11
4-6
224 - 88
1,268
16
3
77 - 15
44 - 33
21 - 23
12 - 9
8-4
3-5
165 - 89
904
12
4
74 - 18
40 - 34
14 - 26
9-5
2-7
1-1
140 - 91
727
9
5
63 - 29
34 - 29
5 - 29
4-1
2-2
0-2
107 - 92
539
7
6
63 - 29
35 - 28
12 - 23
3-9
2-1
1-1
116 - 91
593
8
7
57 - 35
17 - 40
6 - 11
0-6
80 - 92
379
5
8
42 - 50
9 - 33
6-3
3-3
62 - 91
313
4
9
50 - 42
3 - 47
1-2
0-1
54 - 92
240
3
10
35 - 57
17 - 18
6 - 11
0-6
58 - 92
284
4
11
29 - 63
11 - 18
4-7
2-2
46 - 92
227
3
12
29 - 63
14 - 15
1 - 13
0-1
44 - 92
209
3
13
18 - 74
4 - 14
0-4
22 - 92
100
1
14
15 - 77
2 - 13
0-2
17 - 92
76
1
15
4 - 88
0-4
4 - 92
15
0
16
0 - 92
0 - 92
0
0
1-2
1-0
0-2
Note: Points show the number of points accumulated by all four teams with a given regional seed, averaged over 23 years. Points show the number of points accumulated using the point allocation system discussed in the text by all four teams with a given regional seed, averaged over 23 years.
Similar to the Super Bowl, the NCAA basketball tournament attracts a great deal of attention—and wagering—from people who do not normally follow the sport. Commonly, office betting “pools” are conducted based on tournament brackets: Each contestant makes a prediction for each of the 63 games in the tournament and is awarded points for each correct prediction using a scale that assigns more points to games in later rounds. At the end of the tournament, the contestant with the most points wins a prize; sometimes there are also prizes for 30
VOL. 21, NO. 4, 2008
second place, etc. (Editor’s Note: Betting pools are against the law in some states.) In “Contrarian Strategies for NCAA Tournament Pools: A Cure for March Madness?,” which appeared in Vol. 21, No. 1 of CHANCE, Jarad Niemi, Brad Carlin, and Jonathan Alexander discuss why the best bracket is one that has a fairly high chance of occurring, but that is not similar to many other entries. If your entry is similar to many others, you have a small chance of winning the prize, even if your predictions are good, as you must beat all the similar entries. You have
a better chance of winning if you predict good performances for one or two teams that are moderate underdogs, because their reduced chance of winning is more than counteracted by the fact that few other contestants will predict good performances by those specific teams. In 2008, I was invited to participate in a pool that did not follow the conventional bracket-based format. Instead, each contestant was given 100 ‘shares’ to invest in the teams, so each contestant owned a fraction of each team in which he invested (with a minor exception,
discussed below). Each winning team earns points, which are distributed among the pool contestants according to their fractional ownership. For instance, first round wins are worth 100 points, so someone who owns 5% of a team that wins in the first round receives five points for that win. (Second-round wins are worth 125 points, third-round wins are worth 150, fourth-round wins are worth 175, fifth-round wins are worth 200, and winning the championship game is worth 250). The pool winner is the person who accumulates the most points over the course of the tournament. The minor exception to the share allocation rules is that points cannot be allocated to individual teams with seeds from 13–16. Instead, shares allocated to these teams are spread across all the teams in a region in this seeding range. For example, you could place shares on the 13–16 seeds in the East Region, in which case all the points earned by these teams are divided between you and the other contestants who allocated shares to those teams. Importantly, there is no way of knowing in advance what fraction of a team you will buy with your shares, as all share allocations are made secretly and simultaneously. You may find that the 10 shares you placed on Xavier University bought you a 50% interest in the team or a 10% interest, depending on how many shares others invested in Xavier. One’s points depend not only on team performances, but also on the behavior of other contestants, thereby complicating the sort of statistics-cum-game-theory analyses that have led to proposed optimal strategies for conventional pools. People who follow college basketball often enjoy making tournament predictions based on knowledge, enthusiasm, or dislike for particular teams. But what if you don’t know much about college basketball and have little knowledge of any of the teams or their performances during the preceding season? Is there anything you can do with readily available data that will help you win the pool?
Valuing the Teams In this pool, contestants allocate shares to individual teams. Unless there is specific knowledge that one 3-seed should be better than the others, an initial exploration based on historical performance can lump all teams with a given
seeding together. Table 1 summarizes the tournament performance for each seed, obtained from www.sportsline.com, for the 23 years ending with the 2007 tournament (with one minor error corrected). The final column shows the resulting number of points earned, on average, by a team of a given seed using the points system of the betting pool. The top-seeded teams combined to earn 1,870 points on average, but the 15-seeds earned an average of only 17 points. Almost every 15-seed lost its first game, which is always against a 2-seed, and each of the few 15-seeds that managed to advance lost its second game. Although 23 years is a long time and each year gives data on four teams with each seeding, there is still evidence of stochastic variability in the table. For instance, 8-seeds have a losing record against 9-seeds, in spite of presumably being slightly better; an 8-seed has won the championship, while no 7-seed or even 5-seed has done so; and two 11-seeds have advanced all the way to the semifinals (known as the Final Four), but no other team worse than an 8-seed has ever made it so far. It is possible that some of the peculiarities of the points distribution are systematic, rather than being due to stochastic variation. For example, there are rules concerning the seeding of teams that qualify by winning a “conference” championship (a conference is a subset of teams that is, in some ways, similar to a sports league), so teams sometimes cannot be seeded as the selection committee might desire. Also, anecdotally, teams from some conferences used to be routinely under-seeded. For example, Salon.com sportswriter King Kaufman notes that “It wasn’t that long ago that the committee was routinely seeding big-conference mediocrities No. 5 or 6 while hanging a double-digit on fine teams like Gonzaga and Southern Illinois.” If the practices related to seeding have changed, then historical data on performance as a function of seeding will differ systematically from future results. The “expected” number of points a contestant will earn for each share they allocate to a given seed—using “expected” in its statistical sense—is equal to the expected number of points
earned by the teams of that seed, divided by the total number of shares allocated to the team by all the pool contestants. If, in contrast to the actual situation, shares could be freely traded, and if there were no information about teams other than what is implicit in their seedings, then expected points per share should be roughly the same for each seed. If investing in some seeds is expected to pay more than investing in others, shares would be shifted from the seeds with expected low payouts to those with expected high payouts until the expected payout for each seed is approximately even. So, in the hypothetical case of freely traded shares, we would expect about 24% of shares to be allocated to 1-seeds, 16% to 2-seeds, and so on. With 7,750 points available in the tournament as a whole, and 3,100 shares (100 for each of the 31 contestants in the pool), shares should be distributed so each seed is expected to pay 2.5 points. The importance of points per share, rather than just the number of points each team accumulates, is made clear when we consider a hypothetical 32nd competitor in the 2008 pool who knew in advance that Kansas was going to win the championship (Kansas did) and invested all their shares in that team. As it happens, this would have purchased only 24.6% of the team’s 1,000 overall points, so our hypothetical competitor would have scored only 246 points— almost 200 points short of the eventual
CHANCE
31
number of years
Seed: 1
8
8
6
6
4
4
2
2
0
0
number of years
0
2
4
6
8 10 number of wins
6 5 4 3 2 1 0
14
16
18
6 5 4 3 2 1 0
Seed: 4
0
2
4
6
8 10 number of wins
5 number of years
12
12
14
16
18
5
Seed: 7
4
4
3
3
2
2
1
1
0
0 0
2
4
6
number of years
5
8 10 number of wins
12
14
16
18
5
Seed: 10
4
4
3
3
2
2
1
1
0
0 0
2
4
6
number of years
10
8 10 number of wins
12
14
16
18
10
Seed: 13
8
8
6
6
4
4
2
2
0
0 0
2
4
6
8 10 number of wins
12
14
16
18
Figure 2. Histogram showing the number of times the teams with a specified seeding won a given number of games, from 1985–2007. For example, one year (1993) the 1-seeded teams won 18 games (one short of the maximum possible). Another year (2004), they combined to win only nine games. In nine of the years, they won exactly 13 games.
32
VOL. 21, NO. 4, 2008
pool winner. Allocating shares to teams that win some games is a necessary, but very far from sufficient, condition for winning the pool. Of course, we might expect some variation in points allocation, away from the theoretical value of 2.5 points per share, because the pool contestants sometimes have information not factored into the seedings. They might know a top-seeded team has had an injury to a key player after the seedings were assigned, for example. Or, they might know one top-seeded team is much better than another, or that the 1and 2-seeds are almost equally skilled in one region while there is a large disparity in other regions. Also, any individual or group might deliberately deviate from the expected allocation because winning the pool is not the same as maximizing the expected number of points won: The portfolio with the highest expected value of points per share will not necessarily win the pool in any particular year, or indeed any year. In the actual pool, share allocations are made blindly, so we would not expect the average number of points per share to be equal. Perhaps too many shares will be allocated to some seeds, so their expected points per share will be below 2.5. If this happens, then necessarily too few shares will be allocated to other seeds, which will then become relative bargains. In short, a way of increasing your expected number of points, and thus your chance of winning the pool, is to place shares on teams with high values of expected points per share … if you know which teams those are.
Variability Allocating shares to seeds (or teams) that are expected to do well is no guarantee of success. In any given year, teams might perform better or worse than expected. Figure 2 shows one way of looking at the year-to-year variation in performances by seeding. Top-seeded teams have combined to win as many as 18 games and as few as 9; 4-seeds have won as few as two games and as many as 11. It is clear that putting all your shares into one of the seeds is a risky strategy, even if you divide your shares among all four teams with that seeding and even if you succeed in choosing the seeding with the highest expected value per share. If the teams with that seeding have a bad year,
they might win many fewer games, thus fewer points in the pool than expected. Spreading the shares over two seedings (eight teams) would reduce the variability, and spreading them over three or four seedings would reduce the variability even more. Even so, variability can be substantial. Of course, shares could be spread over more teams, or even over all of them, but this will inevitably lead to having to allocate shares to teams that do not provide a high value of expected points per share. In another analogy to the stock market, if you spread your investment among many companies—a strategy called “diversifying”—you’ll pick some winners and some losers. Your portfolio probably won’t do much worse than the market as a whole, but you’re also unlikely to score a really big win. To win a pool, it’s not enough to do well, you have to beat all your fellow contestants, and that won’t happen if you diversify too much. Someone else who was lucky enough, or good enough, to put shares on a few high-value teams will beat you. Figure 2 doesn’t tell the whole story when it comes to interyear variability, even for the seeds shown, as it summarizes each seed’s performance separately. Actually, there are complicated relationships between performances of different seeds in any single year. For instance, if (as is likely) a 4-seed and a 5-seed from a region each win their first-round game, they face each other in the second round. So, for the 5-seeds as a group to perform better than average, the 4-seeds as a group have to perform worse than average. The relationships between team (or seed) performances introduce opportunities for “hedging”: investing so a poor performance by one class of investments is likely to be offset by a good performance by another class. The simplest example is a contestant who invests his points in both 8-seeds and 9-seeds. 8-seeds play 9-seeds in the first round, so half these teams are guaranteed to lose … but the other half are guaranteed to win. As with a more generalized diversification strategy, hedgCHANCE
33
Table 2—2008 Betting Pool Share Allocation Compared to the Efficient Allocation That Would Give Each Seed an Expected Value of 2.5 Points per Share Seed
Expected Points
Actual Allocation Shares
% of Shares
Expected Points per Share
Efficient Allocation Points Shares
% of Shares
1
1,872
1,153
37:3
1.6
750
24:2
2
1,268
545
17:6
2.3
510
16:5
3
904
315
10:2
2.9
360
11:6
4
727
252
8:1
2.9
290
9:4
5
539
205
6:6
2.6
215
6:9
6
593
130
4:2
4.6
240
7:7
7
379
100
3:2
3.8
145
4:7
8
313
52
1:6
6.0
125
4:0
9
240
55
1:8
4.4
95
3:1
10
284
58
1:9
4.9
115
3:7
11
227
60
1:9
3.8
90
2:9
12
209
58
1:9
3.6
85
2:7
13-16
193
117
3:8
1.6
75
2:4
Total
7,750
3,100
100
3,100
100
Note: Percentages and expected points don’t add to the exact totals shown in the final row due to rounding.
ing can guard against poor performance, but can also prevent great performance, as many of your teams (and shares) are guaranteed to be eliminated from contention.
Placing Your Bets How should you allocate your shares? Because the answer depends on how other people allocate their shares, this question takes us outside the realm of pure statistics and into the field of “game theory,” a discipline aptly named in this
34
VOL. 21, NO. 4, 2008
sports-related case but also applied to all sorts of serious competition. If you think you know how everyone else is going to allocate their shares, you can figure out how to allocate yours. For starters, you would want to look for teams whose fraction of expected points is higher than the fraction of shares allocated to them. If a team is expected to win 10% of the points, but has only been allocated 5% of the shares, that’s a good team to bet on. Without the ability to predict behavior of other contestants, and without data (which are not available for this pool) on how contestants have wagered in the past, it might seem hopeless to try to predict expected points per share.
The expected points part is easy, but how can anyone predict how the other contestants will spread their shares among teams? The answer is that a definite prediction would be difficult, but we might speculate that the favored teams—the 1- and 2-seeds—will be over-bet (i.e., too many shares will be allocated to them) because, according to Niemi et al., favorites in traditional bracket-based pools tend to be predicted as tournament winners by too many contestants. We might also speculate that the 13–16 seeds may be over-bet because, unless contestants have bothered to create Table 1, they may not realize just how few points these teams are expected to earn and the ability to allocate shares so they are automatically spread over four teams may be seductive. If favorites and 13–16 seeds are over-bet, then middle seeds will be under-bet, making them relative bargains.
The 2008 Contest Table 2 shows the average number of points by seed—the same as the last column of Table 1, except for seeds 13–16—and the number of shares contestants bet on teams with those seedings in this year’s pool. Seeds 13–16 are combined because, as discussed above, those teams are lumped together in this betting pool. In contrast to the theory that shares should be allocated such that each seed is expected to pay about the same, we see in Table 2 that the expected points per share varied widely in the actual 2008 contest. If this year’s tournament performances matched the historical average, then investing a given number of shares in an 8-seed would pay 3.5 times as well as the same investment in a 1-seed. What was wrong with the argument that the expected payout should be about the same for each seed? Why is there such variation in expected payout? In this case, it’s probably attributable to three factors. First, the contestants did not know the expected value of each seed. Although the values are readily calculated from information available online, finding and using this information takes effort and still does not come close to guaranteeing a win in the contest, so perhaps people were not sufficiently motivated to make the effort.
Table 3—Expected and Actual Points by Seed for the 2008 Tournament Seed
Expected Points
Actual Points
Points per Share Expected
Actual
2850
1.6
2.5
1
1872
2
1268
800
2.3
1.5
3
904
1200
2.9
3.8
4
727
325
2.9
1.3
5
539
325
2.6
1.6
6
593
300
4.6
2.3
7
379
425
3.8
4.3
8
313
200
6.0
3.9
9
240
200
4.4
3.6
10
284
375
4.9
6.5
11
227
100
3.8
1.7
12
209
450
3.6
7.8
1.6
1.7
13-16
190
200
Total
7750
7750
Note: Points are allocated according to pool rules. Expected points are based on the allocation of shares in the pool. Actual points are determined by the performance of the teams.
Second, and probably more important, contestants do not know how other contestants are allocating their shares, so the expected value per share can only be calculated post facto—too late to help choose ones’ own share allocation. In this sense, the system is unlike the stock market, where you know what fraction of a company you will own if you invest d dollars. But, in this betting pool, you do not know what fraction of a team you will ‘own’ if you invest s shares. And finally, past experience with conventional bracket-based pools—and with the shares-based system in earlier years— has encouraged contestants to think in terms of picking winning teams, rather than looking for classes of teams (such as 8–10-seeds) that are undervalued. If the share allocation were efficient—if there were no systematic advantage to betting on particular classes of teams (such as teams with certain seedings)— then picking individual winning teams would be important. The situation is somewhat analogous to the stock market. If there is an identifiable class of stocks likely to perform well, such as mid-size companies with stocks trading at below $10 per share, there is no need to think in terms of individual compa-
nies. But, if no such class exists, winning a stock-picking contest would require looking in detail at individual stocks (or, more likely, being lucky).
My Experience The question of how to allocate shares in the absence of data on how other people would held more than an academic interest for me, as the entry fee was large and the prizes valuable. (The entry fee was a bottle of wine worth more than $30, with 14, eight, five, three, and one bottles to be awarded to the top five contestants.) I followed an informed judgment (i.e., a hunch) that seeds 9–12 would be most under-bet. I put 36 of my 100 shares on the four 9-seeds, 28 shares on the 10-seeds, 20 shares on the 11-seeds, and 16 shares on the 12-seeds, divided equally among the regions. As the expected points per share values in Table 2 show, this was a fortunate choice. In fact, my allocation had the highest expected total number of points among the 31 competitors. My shares are included in the shares column, so if I had allocated my shares differently, the expected points per share value would have been different,
CHANCE
35
too. In fact, the 9–12 seeds would have had even higher expected points per share. If the tournament games resulted in an ‘average’ year, my allocation would have earned 406 points, compared to the 160 points that would be earned on average by someone putting all their shares on the top-seeded teams. Of course, the winner of the contest is not the one who has the most expected points, it’s the one with the most actual points. How did I do? Table 3 shows the number of points earned by teams of each seeding in the 2008 tournament, compared to the expected number. Stellar performances by teams seeded 10 and 12 were more than enough to counteract subpar performances by 9- and 11-seeds, so I did a bit better than expected: I got 444 points, rather than the ”expected” 406. This put me far ahead of the secondplace competitor’s 331.
Just Wait ’til Next Year This year’s winning share allocation probably won’t win next year. One complication is that next year’s scoring system is likely to be different. The number of games played goes down by half in each round: 32 in the first round, 16 in the next, then eight, and so on. But in this year’s scoring system, the number of points each game is worth increased slowly by round, so the total number of points available in a round decreased quickly: 3,200, 2,000, 1,200, 700, etc. Under this system, the later rounds are nearly irrelevant. The pool organizer already has indicated that he intends to change the scoring system next year so there will still be meaningful numbers of points available in the later rounds. A scoring system that rewards more points for late-round victories would increase the expected number of points for the top-seeded teams. It will be hard to anticipate how contestants will allocate their shares next year. The teams in the tournament will be different, the point system may be different, and, perhaps most important, most of the contestants in my pool will read this article! All of this may lead 36
VOL. 21, NO. 4, 2008
to share distributions very different from 2008’s. Still, having one year of data is much better than having none. For example, the 2008 data may be useful to help judge whether certain teams, conferences, or regions seem to garner more shares than others. We might expect that pool contestants tend to preferentially wager on their alma mater, on local teams, etc., and people may continue to behave that way, even if it hurts their chances a bit. An anonymous reviewer of this article pointed out that baseball’s New York Yankees are systematically under-bet outside New York because, the reviewer believes, many people outside New York don’t like the Yankees. Unlike bracket-based pools, sharesbased pools such as the one discussed here are rare. Without historical data on share allocations, it is not possible to perform an analysis such as that discussed by Niemi et al. or David J. Breiter and Bradley P. Carlin in the CHANCE article ”How to Play the Office Pools if You Must,” which looked at the statistics of how people fill out their brackets in bracket-based pools, based on large, multi-year databases. However, another feature of the analyses of Niemi et al. and Breiter and Carlin is that they performed and analyzed simulations of tournaments, and this idea could be useful even with the limited data available. The 2008 data may be useful for investigating two issues mentioned earlier: how much to diversify and whether and how to hedge. The easiest way to address these would be through simulating tournaments, using historical tournament data, or both. In practice, using simulations would probably be better than relying entirely on historical data because so many outcomes that could easily happen next year have never happened in the past. Just because a 5-seed has never won the tournament doesn’t mean it can never happen. Simulation can help answer questions such as the following: If the share distribution were fixed at the 2008 level for each seed in each region, would allocating shares among the 9–12 seeds (as I did this year) win in 40% of years, or 75%, or 90%? Would it increase the winning probability to allocate shares among more teams, or fewer teams?
Would hedging by betting on (for instance) both 4-seeds and 5-seeds increase or decrease the winning probability? Even if people don’t allocate their shares exactly the same way next year as last year, insights into hedging strategies and diversification should carry over to some extent. Another possibly fruitful area of research is to look at (and allocate shares based on) individual teams, not just seedings. Evaluations of individual teams are available—Las Vegas gambling odds are a popular source of information for this sort of thing—and can be compared to the 2008 wagering patterns to see if people over-bet or under-bet specific teams. Perhaps rather than allocating shares equally among all teams of a given seeding, they should be preferentially allocated to the better teams with that seeding, or to the worse teams, whichever is under-bet. Admittedly, it will be impossible to draw firm conclusions on any of the key points—whether some teams, conferences, or regions are systematically over- or under-bet; how much to diversify; and how to hedge—but one does have to make decisions somehow, and it makes sense to use what data you have. It always takes luck to win a pool, but a bit of statistical thinking helps improve the odds.
Further Reading Breiter, D.J., and Carlin, B.P. (1997) “How to Play the Office Pools if You Must.” CHANCE 10(1):5–11. Clair, B., and Letscher, D. (2007) “Optimal Strategies for Sports Betting Pools.” Operations Research, 55:1163–1177. Metrick, A. (1996) “March Madness? Strategic Behavior in NCAA Basketball Tournament Betting Pools.” Journal of Economic Behavior and Organization, 96:159–172. Niemi, J., Carlin, B., and Alexander, J. (2008) “Contrarian Strategies for NCAA Tournament Pools: A Cure for March Madness?” CHANCE, 21:35–41. Leonhardt, D. (2007) “Top Seeds Can Often Mislead in NCAA Bracket.” The New York Times, March 12.
When Is the Lead Safe in a College Basketball Game? Brian Schmotzer
2
=
Take the number of points one team is ahead. Subtract three. Add a half-point if the team that is ahead has the ball, and subtract a half-point if the other team has the ball (numbers less than zero become zero). Square that. If the result is greater than the number of seconds left in the game, the lead is safe. To help visualize this rule, consider Figure 1, which shows the boundary between safe and not-safe leads based on the time remaining. Above the lines represents safe leads; below the lines represents not-safe leads. The lower line represents the currently winning team having possession of the ball;
60
H
50
boundary when trailing team has possession
40
Safe Leads
boundary when leading team has possession
20
Margin
10
Not Safe Leads
0
ave you ever watched a sporting event, seen your favorite team slowly pull away from or fall behind its opponent, and wondered when the proverbial fat lady would sing? Have you ever predicted a game was essentially over because you thought the lead was insurmountable only to have to eat your words? Is there such a thing as a lead large enough that the game is, for all intents and purposes, over? To gain insight into questions such as these, I have attempted to quantify the safety of leads in men’s college basketball. Interestingly, the motivation for this study came from Bill James. An ever-increasing number of people interested in quantitative sports analysis are familiar with James and the pioneering work he has done in assessing the value of baseball players (as well as a slew of other interesting baseball-related work). However, in this case, James wrote an interesting column, titled “The Lead Is Safe: How to Tell When a College Basketball Game Is Out of Reach.” In it, James offered a heuristic for determining whether a game is out of reach for the trailing team based on the margin (the difference in the two teams’ scores), the time remaining, and who had the ball. His stated the following:
+
30
X
0
4
8
12
16
20
24
28
32
36
40
Time Remaining (min)
Figure 1. Bill James suggested boundaries between safe and not-safe leads based on margin of lead and time remaining. The end of the game is on the left with no time remaining. The lower line represents the boundary when the currently leading team has possession of the ball, while the upper line represents the boundary when the currently trailing team has the ball.
whereas, the upper line represents the losing team having the ball. So, if your team is ahead by 12 points with 80 seconds left (1.33 minutes), the lead is safe if you have the ball: (12-3+0.5)2 = 90.25 > 80. If, however, you do not have the ball, (12-3-0.5)2 = 72.25 < 80, and the lead is not safe. Plotting the point (1.33, 20) would fall between the two lines on the graph. Similar to many of James’ algorithms, this one is simple; easy to calculate; and based on his experience (he reports attending Kansas home games since 1967), intuition, and unique brilliance at formalizing his notions. But is it true? CHANCE
37
The Idea and a Definition My basic premise is that if a team is up by 10 points with two minutes remaining and goes on to win, then that lead at that time must have been safe for that game. To get an idea of whether a 10-point lead with two minutes remaining is safe in general, one can look at lots and lots of games that had the same lead at the same time and see what proportion went on to win. If 100% resulted in wins, then that would be a safe lead in anyone’s book. If 95% resulted in wins, we might still consider it quite a safe lead. So, all I have to do is get an estimate of this proportion for every combination of time remaining and margin. Perhaps that seems like a large task, but we’ll be able to handle it … after giving a definition of a “safe” lead. Definition: A lead is safe at a given time remaining if the team that is leading at that time goes on to win. This definition is certainly simple, but does it behave the way we want it to? An obvious example that seems to ‘break’ this definition is when the leading team loses the lead, regains it, and goes on to win. For example, your 10-point lead with two minutes remaining turns into a three-point deficit with a minute left, but you score the last several points of the game and win. Well, would you consider that initial 10-point lead safe? Or take the logic to the extreme. If you score the first points of the game and then engage in a back-and-forth struggle for the remainder, would you look back on your two-point lead with merely 2,360 seconds remaining as safe just because you end up winning? I’m not sure most people would, but I do, and I’ll give you two reasons why. First, what’s the alternative? A more conservative notion of safe would be if you had the lead and never relinquished it right through to the end of the game. I actually find this definition more problematic. It would imply that if an opponent launches a furious comeback to turn a 20-point deficit into a one-point lead, but then runs out of energy and goes on to lose, the initial 20-point lead was not safe. However, if the opponent fights back from a 20-point deficit and gets within one point, but never quite gets over the hump, then the lead was safe. These two scenarios are treated differently (oppositely, in fact), but, in my opinion, a 19-point comeback and a 21-point comeback are almost exactly the same. The world of statistics offers a good analogy: A p-value of 5.1% is ‘not significant’; whereas, a p-value of 4.9% is ‘significant,’ according to the 'magic number' of 5%. But, in reality, those p-values offer almost exactly as much evidence against the null and should be treated more similarly than differently. So the ‘never relinquished’ definition of a safe lead is problematic, and I have thought of no other simple definition to consider. I do have some thoughts on a fairly complicated definition that I’ll hold off on presenting until later. The second reason to go with my simple definition of safe is that, on average and in the long run, it should work out fine. Our intuition tells us that a two-point lead in the first minute of a game is not really safe. Our definition tells us we should call it safe anyway if that team goes on to win. But, at the end of the day, our estimate will be based on hundreds of games, not a single game. If such a small lead so early is really not safe, as our intuition suggests, then over the course of hundreds of
38
VOL. 21, NO. 4, 2008
games, we will see the opening scorer win some games and lose some games—probably close to 50%. On the other hand, if a 12-point lead with a couple minutes remaining is relatively safe, then we’ll see only relatively few comebacks of that magnitude and a correspondingly high percentage of safe leads will be found. And, of course, a 20-point lead with 30 seconds left will never be surmounted, and we’ll see 100% of such games end in victory. In short, the use of this definition, averaged over many games, will (hopefully) give us a nice smooth transition from about 50% safe (a toss-up game situation) to 100% safe (a truly insurmountable lead). It is worth noting that James used safety percentages differently in his article. His usage was on the scale of points: If a team needed 20 points at a particular time to be safe and it was only leading by 15, then it had a 75% safe lead. My usage is the more familiar probability scale: If a team is leading by 15 points at a particular point in time, and teams in that situation go on to win 75% of the time, then the lead is 75% safe. With a formal definition of safe settled, let us see how we can generate an estimate of safety for any lead at any time.
Getting Some Data There is a Scoreboard section at www.sportsline.com for men’s college basketball that gives scores for all the Division I games for the whole season and play-by-play transcripts for some games. Out of 5,540 total games in the 2007–2008 season, there were 1,357 transcripts available (24.5%). The total game count is by my own hand as I clicked through the site collecting transcripts, so it could be wrong by some small amount. Although I attempted to collect every game that had play-by-play available, I also possibly missed some small number of these. Because I want my results to extrapolate to new college basketball games that I watch, something must be said about sampling. Clearly, I have not taken a random sample. What are the limitations we might want to put on any conclusion? First, it would seem prudent to limit ourselves to Division I men’s basketball. Professional basketball, high-school basketball, women’s basketball, and even other divisions of men’s college basketball could perhaps follow different underlying truths regarding safety of leads (although I suspect none would be all that different). Second, I suspect the games for which I have play-by-play data are somewhat higher-profile games than all games from the season on average. So, when the Northeast South Dakota State Teachers College Molehillers play the Western Arkansas Seminary School of Mines Gravediggers, I figure that game is not getting play-by-play attention from CBS. On the other hand, whenever Duke and North Carolina wander near each other, the Earth stands still and we get a nonstop media circus to go with our play-by-play transcript. If there is any difference in safety of leads among strata of Division I, it would be difficult to find in the data I have, and it would be better to consider the conclusions from this study to apply to the more prominent programs. (All kidding aside, some low-wattage schools get play-by-play coverage for some of their games, but some games of big-time schools go without, so any effect here would likely be small.)
Last, all the data comes from the 2007–2008 season. If we want to extrapolate results into our viewing of next year’s games (and beyond), we have to assume the phenomenon of safety of leads is relatively stationary or consistent from year to year.
Preparing the Data for Analysis The raw data in the transcript includes the score of the game, the time remaining in the period, and a text description of the event that occurred at that time. The data set I created for each game contains the current point margin for every second remaining and notes which team won the game. To create this data set, a number of assumptions and preprocessing steps were necessary. First, there were problems with some of the reported times remaining in the raw data. It appears the mechanism used to generate the play-by-play transcripts (machine, person, etc.) was prone to typos. For example, there might be a sequence of times such as 17:45, 17:37, 19:34, 17:15, 17:01. It is clear 19:34 cannot be correct. There were 217 games that saw problems of this type. The vast majority was minor, and all of them were able to be corrected. The logic for deciding how to fill in the proper time sequence is straightforward. Consider two time points in the sequence of times remaining for a game: 3:15 followed by 4:15. This cannot be true, so either the 3:15 is too small or the 4:15 is too large. Which is it? We should decide based on the neighboring times of the potential culprits. If 4:15 is less than the time value listed before 3:15, then pulling 3:15 up would instantly solve the problem—we must have a ‘down spike’ in time (3:15 is the culprit). Here is an example of how that might look: 4:20
4:19
3:15
4:15
4:09
4:05
4:01
On the other hand, if 3:15 is greater than the time value listed after 4:15, then pushing 4:15 down would instantly solve the problem—we must have an ‘up spike’ in time (4:15 is the culprit). That might look like the following: 3:35
3:21
3:18
3:15
4:15
3:10
3:04
The same idea extends to the case where there are multiple times in error listed in a row and to the case where you need to go more than one listing backward and forward through the listings to determine whether it is an up spike or down spike. Basically, the criterion I used is to make the fix that requires changing the fewest number of time points, as in these examples where the correct fix requires changing only one time point, rather than (at least) four. This algorithm worked great in practice. Second, there were problems with some of the reported scores in the raw data. Again, it appears these were typos. For example, there might be a sequence of scores such as 17–20, 17–22, 170–24, 19–24, 21–24. It is clear that 170–24 cannot be correct (nor can its implied margin of 146). There were 20 games that saw problems of this type. The vast majority was easily corrected, but there were two games where a combination of score problems and severe time problems made it impossible to reconstruct a reasonable play-by-play. These games were thrown out of the data set, so the final data set contains data from 1,355 games.
2
< Third, how to deal with the margin changing with the clock stopped? This happens in the obvious case of free throws (by definition, a player is getting a free attempt to score with the clock stopped). But, it also effectively happens in a situation where you score twice in less than a second, which can happen if you score, steal the ball back from your opponent, and score again in quick succession. In either case, the margin for that time is simply based on the last event associated with that time. Fourth, what to do with buzzer beaters (shots attempted before time expires, but that go in the basket after time has expired)? If scoring takes place with ‘zero’ seconds remaining, that can change the outcome of the game, as when one team goes from one point down to one point up due the buzzer beating shot. So, the winning team is based on the final margin. But, no effort is made to analyze the safety of a lead with zero seconds remaining; for the sake of safety, the events with zero seconds remaining are not considered. Fifth, what if regulation time ends with the score tied and the game goes to overtime? I thought long and hard about what to do with overtime and finally decided to discard any overtime periods. My rationale was that I couldn’t include overtime without messing up the timeline. It is obvious that 750 seconds remaining means something different with respect to a regulation game compared to an overtime game. When assessing the safety of a lead with a certain time remaining, I think that assessment has to refer to a constant end point—and I made that end point the end of regulation. Basically, any lead at any time in a game that ends regulation in a tie is, by definition, not safe. If you’re sitting at five minutes out with a 10-point lead and asking if the lead is safe, and if you look to the end of regulation and see the game is tied, then the answer must be that the lead was not safe. Thus, the definition of safe from earlier should be amended by adding the words “in regulation” to the end: A lead is safe at a given time remaining if the team that is leading at that time goes on to win in regulation.
Data from a Single Game Each game offers 2,400 pieces of information about the safety of leads—one for each second of the contest (two halves * 20 min/half * 60 sec/min = 2,400 sec.). For example, consider a game where a team scores to go up by 10 with 120 seconds remaining (two minutes). Further, say the next score happens with 90 seconds left. Last, assume the leading team goes on to win. What do we learn about safety of leads? Well, the 10-point lead with 120 seconds remaining is a safe lead by our definition. Even though the next transcript entry doesn’t occur
CHANCE
39
14
40
no data ≥ 90% safe ≥ 80% safe ≥ 70% safe ≥ 60% safe ≥ 0% safe
0
20 0
2
10
4
6
Margin
Margin
8
10
30
12
Team 1 leading Team 2 leading
0
4
8
12
16
20
24
28
32
36
40
Time Remaining (min)
0
4
8
12
16
20
24
28
32
36
40
Time Remaining (min)
Figure 2. Information from one basketball game. The margin is plotted versus time remaining. The margin is the absolute value of the difference between the two teams’ scores. The leading team is indicated by shading. The end of the game is on the left with no time remaining.
Figure 3. Proportion of safe leads at each point of margin versus time remaining for data from 1,355 games. Lighter colors represent safer leads.
until 90 seconds, we still know something about the intervening period. We know the 10-point lead with 119 seconds left is a safe lead. This is similarly so for 118, 117, … all the way down to 91 seconds left. Not until the next score at 90 seconds left does the situation change. In this example, because there are no 10-point plays in basketball, the new situation would involve a safe lead at 90 seconds left for the eventual winning team, just at a new margin. Figure 2 shows the resulting information for one game. This was a closely contested game, with the margin in single digits for the vast majority of the time. The maximum margin was 14, and that team went on to lose. So their 14-point lead with about three minutes left in the first half was not safe. The winning team took the lead with about 11 minutes left in the game; all of their leads are safe, even though the game is tied up for a period of time. From about 10 minutes to about eight minutes, they hold a four-point lead that is safe and, for the next two minutes, hold a one-point lead that is safe. From about six minutes to about three minutes, the teams are tied and no one has a safe or not-safe lead as no one has the lead at all. Finally, the winning team regains the lead with about three minutes remaining and holds it until the end of the game. So, for example, we now have evidence that a four-point lead is safe with 10 minutes left, based on one game. We need to include information from other games to come up with an estimate of how safe that lead is with that much time left in general.
margin and time remaining. Figure 3 shows this raw proportion of safe leads. At each point on the graph, the number of games that went through that point was tabulated, the number of those times it was a safe lead was tabulated, and those two numbers were divided to get the proportion. As you can see in Figure 3, the raw proportion is rather rough, but it basically conforms to the shape we would expect. Low margins have low safety; high margins have high safety; and the later in the game, the safer the margin. Notice also the lack of information in the upper right portion of the graph. It is difficult to build a 30-point margin in the first five minutes of a game. To give you an idea of the amount of information available to calculate each proportion, Figure 4 shows the sample size at each point. The sample size conforms to what we would expect with a large number of games showing a relatively small margin early in the game and few games showing big leads toward the end of the game.
Data from All the Games Conceptually, what I do now is overlay many plots such as the one in Figure 2. Each game gives dichotomous (safe/ not-safe) information for some combinations of margin and time remaining. When all the games are considered, we can calculate the proportion of safe leads at every combination of 40
VOL. 21, NO. 4, 2008
A Two-Dimensional Kernel Smooth of Safe Proportions Our overall goal is to define a boundary between safe and notsafe leads. There is an inkling of such a boundary in Figure 3, but it is not that clear. One technique we could use to improve the visualization is a two-dimensional kernel smoother. The goal of a smoother is just that, to smooth out a scatterplot, to take out the bumps and wiggles and show a smooth picture. We need a two-dimensional smoother here because we are smoothing safety proportion over two dimensions (margin and time remaining). The smoothness is created by taking the weighted average of values in a neighborhood surrounding the value of interest. So, we want to replace a raw value with a smooth version that incorporates information from surrounding values. The term “kernel” refers to the type of function used to determine what the weights should be.
40
40
30
no data 90% safe 80% safe 70% safe 60% safe 0% safe
10
20
Margin
20 0
0
10
Margin
30
no data ⱖ 1 ⱖ 10 ⱖ 30 ⱖ 50 ⱖ 100
0
4
8
12
16
20
24
28
32
36
40
Time Remaining (min)
0
4
8
12
16
20
24
28
32
36
40
Time Remaining (min)
Figure 4. Number of data points at each point of margin versus time remaining for data from 1,355 games. Lighter colors represent smaller sample sizes.
Figure 5. Two-dimensional, kernel-smoothed plot of proportion of safe leads at each point of margin versus time remaining. The kernel is a weighted average of points in neighborhood where weights are proportional to inverse distance and proportional to sample size. The neighborhood is five points away from estimation point in margin and five seconds away from estimation point in time remaining. Lighter colors represent safer leads.
Consider the example in the tables below (simplified compared to the real data) with the raw data first.
Notice that the example data is not smooth. As you progress across margins for a fixed time remaining, the proportion safe sometimes goes up and down (similarly across time remaining for a fixed margin). To smooth this picture, we need to define our smoother algorithm. For simplicity, define the neighborhood to be the points one cell away (including diagonal)—that is, a square with side of length three. For simplicity, define the kernel function to be uniform. This means we will take the average of the points in our neighborhood with equal weights—equivalent to a simple arithmetic average of those points. For example, the smoothed value at (240, 14) is based on the nine points in its neighborhood (shaded in the raw table), resulting in a smoothed value of 0.93. The full picture is shown in the smoothed table. As you can see, this is much smoother. Across the rows and columns, there are fewer peaks and valleys and the magnitudes of the jumps are considerably smaller. The smoothness of the result is highly dependent on the size of the chosen neighborhood. As the neighborhood gets larger, the result gets smoother because more of the same points are being used in neighboring calculations. I applied the smoother to the real data by first defining the neighborhood to be a square with side of length 11 (similar to the example, but points up to five cells away are included, instead of just one). Second, I defined the kernel function to be an inverse distance and sample size weighted average. Logically, a point two cells away should contribute more to the average than a point four cells away, so the inverse distance defines the weight to be ½ for the former and ¼ for the latter. Also logically, a point calculated based on a sample size of 20 should contribute twice as much to the average as a point based on 10. The resulting smoothed plot can be seen in Figure 5 (compare to the roughness of Figure 3). The kernel-smoothed picture has the obvious advantage of being smoother and showing the general trends in the data more clearly. However, it suffers from a considerable
Time Remaining (sec)
Margin
Raw
238
239
240
241
242
18
1.0
1.0
1.0
1.0
1.0
17
1.0
1.0
1.0
1.0
1.0
16
1.0
0.8
1.0
1.0
1.0
15
1.0
1.0
1.0
1.0
0.9
14
0.8
1.0
1.0
0.9
1.0
13
0.9
0.9
0.8
0.8
0.9
12
0.6
0.6
0.7
1.0
0.9
11
0.3
0.5
0.5
0.5
0.6
10
0.6
0.3
0.4
0.3
0.7
Margin
Smoothed
Time Remaining (sec) 238
239
240
241
242
18
1.00
1.00
1.00
1.00
1.00
17
0.97
0.98
0.98
1.00
1.00
16
0.97
0.98
0.98
0.99
0.98
15
0.93
0.96
0.97
0.98
0.97
14
0.93
0.93
0.93
0.92
0.92
13
0.80
0.81
0.86
0.89
0.92
12
0.63
0.64
0.70
0.74
0.78
11
0.48
0.50
0.53
0.62
0.67
10
0.43
0.43
0.42
0.50
0.53
CHANCE
41
40
40
20
Margin
20
Margin
30
30
95% safety contour LOWESS smooth
100% 95%
10
10
90% 80% 70%
0
0
60%
0
4
8
12
16
20
24
28
32
36
40
Time Remaining (min)
0
4
8
12
16
20
24
28
32
36
40
Time Remaining (min)
Figure 6. Duplicate of Figure 3, but with a 95% safety contour overlaid (solid line) and a LOWESS-smoothed version of the contour (dashed line).
Figure 7. Several LOWESS-smoothed safety contours. From the top to the bottom, the safety levels are 100%, 95%, 90%, 80%, 70%, and 60%.
disadvantage when trying to identify boundaries between safe and not-safe leads. To see this, consider the example data again and imagine finding a boundary defining when a lead is 100% safe. Based on the raw data, it would appear the lead is 100% safe somewhere around 15 points. Based on the smoothed data, the conclusion would change to around 17 or 18 points. This represents a biased estimate of the boundary, which will always tend to happen (it is not particular to this example). There is a hard upper boundary of 1.0 on proportions, of course. When you widen your net and include data points from the surrounding neighborhood, their values can only bring the smoothed value down from 1.0. For every 0.9 or 0.8 that is included, there can be no corresponding 1.1 or 1.2 to balance out and leave the mean at 1.0. Therefore, the only way to get a smoothed value of 1.0 is for the entire neighborhood to be composed of 1.0 values. If the true boundary is at a certain margin, the estimated boundary will tend to be found at a larger margin because you will need to proceed to larger margins to find an entire neighborhood of all 1.0 values. This effect is found when searching for boundaries other than 100%, although the effect lessens the further you get from the hard upper limit of 1.0. The reason is similar to that just described. So for a 95% boundary, you will see a bias if a significant proportion of the values in the neighborhood is below 0.9. (There isn’t enough ‘room’ above 0.95 to balance out these low values.) The details of when bias will appear depend on the specific data set. Rather than try to solve this problem, I will pursue a different option for defining safety boundaries.
and will be applied to other safety proportions.) We want the margin/time-remaining combinations where the probability of the leading team going on to win is 95%. To help find this boundary, I’m first going to note that for any particular time remaining, as the margin increases, the level of safety should increase. I speak here of the underlying phenomenon driving the data; the underlying pattern should be monotone. It would not make sense for a 10-point lead to be safe, a 12-point lead to be not-safe, and then a 14-point lead to be safe again. As we have sample data, such dips and jumps are to be expected, but in our final analysis, they should be discouraged. First, at each time remaining, I find the smallest margin such that the raw proportion estimate of safety is at least 95% for that margin and every larger margin. So, there may be smaller margins that show a 95% estimate, but I don’t consider the matter settled until we’ve gotten to a margin where no larger margins ever dip below 95%. This represents a conservative approach to estimating the 95% safety contour. A plot of this set of points is understandably rough, as it follows a rough contour of Figure 3. So, the second step is to smooth this contour directly. To do so, I use a “LOWESS” smoother, which is a common scatterplot smoother that calculates locally weighted polynomial regression. This would be considered a one-dimensional smoother because we are smoothing margin (specifically, those margins that correspond to 95% safety) over one dimension (time remaining). The method is local because it uses data in a neighborhood. It is weighted because nearby points are weighted more heavily than distant points. It is polynomial because it allows for the fit to be based on polynomial regression, although local linear fits are most common. The assessment of the fit of the regression is based on the size of the residuals (the vertical distance between data point y-values and the fitted line), although it is based on absolute value distances rather than squared distances, as in simple linear regression. The advantage of LOWESS is its flexibility. Fitting a linear or quadratic equation to scatterplot data assumes the form of the underlying phenomenon corresponds to your chosen equation form. The LOWESS fit can follow the true form
Defining Safety Boundaries Using LOWESS Rather than apply a kernel smoother to the raw proportions, one can estimate boundaries first, and then smooth the boundaries. Estimating a boundary directly will lead to a rough boundary, but at least it will be unbiased. Then, we can apply a smoother to the boundary and get a smooth and unbiased estimate of the boundary between safe and not-safe leads. What boundary would be of interest? I’ll surprise no statistician by focusing on the 95% safety line. (The method is general 42
VOL. 21, NO. 4, 2008
better because it allows the local data to dictate the fit, rather than being constrained by a global form. The results are plotted in Figure 6, where the rough contour and its LOWESS-smoothed version have been overlaid on the raw proportions. As expected, the 95% safety contour is quite rough, but the LOWESS does an admirable job of smoothing the contour so you can see the general shape. The same procedure can be used to make contours at any safety level desired. In Figure 7, I show several smoothed contours without the clutter of the raw data or the unsmoothed contours. The basic shape of the smoothed contours is what you would expect. As the game draws to a close, a large lead becomes safer. However, there are other intriguing possibilities in the shape of the boundaries. Notice there appears to be humps and bumps in the contours around halftime (refer to Figure 7 and 20 minutes remaining). The halftime break in a basketball game is well known as a time for teams to regain their composure or change their strategy. It is conceivable we’re seeing the effects of runs at the end of the first half, runs to start the second, and/or halftime adjustments on the safety of leads. Even more tantalizing, there appears to be a hump only four minutes into the game. This can be seen in the 80% safety line in Figure 7, and it can be seen as a smattering of dark spots on Figure 3 (about nine to 14 margin). Perhaps this represents games where there was an early run by one team to a large margin that would seemingly be safe, but then the other team ’wakes up’ and that lead actually isn’t that safe after all. More study and probably more data would be required to assess whether these sorts of interpretations are grounded in underlying truth or are fanciful over-interpretations of limited data.
Simplifying the LOWESS Boundary The LOWESS function does a good job of smoothing the contours, but does not result in a nice, simple, closed-form algorithm for determining the safety of a lead. To that end, I offer the following algorithms that are similar in spirit to James’ suggestion. I arrived at these ‘by eye’ by trying to find integer values that would still lead to good fits with the smoothed contours. For a 100% safe lead: • If there are 12 or fewer minutes remaining Subtract 5 from the margin Square that value Multiply the result by 3 If the result is greater than the time remaining in seconds, then the lead is 100% safe • If there are more than 12 minutes remaining If the margin is greater than 20, the lead is 100% safe For a 95% safe lead: • If it is the second half Subtract 4 from the margin Square that value Multiply the result by 6 If the result is greater than the time remaining in seconds, then the lead is 95% safe • If it is the first half If the margin is greater than 18, the lead is 95% safe
For those who prefer an algebraic representation, consider a current margin m with a time remaining in seconds of t, then the lead is safe if: • James: (m – 3 – ½ )2 > t, (m – 3 + ½ )2 > t,
trailing team has possession leading team has possession
• 100% safe: 3(m – 5)2 > t, m > 20,
t < 12(60)=720 t > 12(60)=720
• 95% safe: 6(m – 4)2 > t, m > 18,
t < 20(60)=1,200 t > 20(60)=1,200
All these boundaries are shown in Figure 8. I’ve fudged a little bit and shown the boundaries as continuously connected to keep them ‘pretty’; whereas, with the algorithms above, the piecewise parts ‘miss’ each other by a little bit. As an example of these formulas in action, you can verify that with five minutes remaining, James would consider a lead safe if the margin was 20 or 21 (depending on possession); whereas, the new boundaries would suggest a 100% safe lead with a margin of 15 and a 95% safe lead with a margin of only 12. Based on James’ original article, it is clear he is considering a 100% boundary. Even still, based on this work, it appears his boundary is substantially conservative. Near the end of the game, however, he is right on the mark. If I may be so bold as to guess, I would suppose he formulated his method based primarily on the end of games. For example, in the era of the three-pointer, no three-point lead should ever be considered safe, hence the “margin minus 3” portion of his formula. Also, adding or subtracting a half point based on possession seems more suited to the end of the game—it is hard to believe possession affects the safety of a lead with 10 or 15 minutes left. But in any case, his method seems solid at the end of the game; it gradually becomes too conservative with more time remaining; and with 10 or more minutes left, it is increasingly inaccurate. Note that I propose a flat top to the boundaries. On one hand, it makes sense for the boundaries to be ever increasing with more time remaining because, from the point of view of a comeback, the more time left, the larger a comeback that could conceivably be made. On the other hand, it makes no sense for the boundaries to keep increasing because, from the point of view of building a lead, it is increasingly impossible to build a gigantic lead as the time elapsed is less. This leads to a lack of sample size necessary to estimate what the boundary would look like early in the game (refer to Figure 4 to see the dwindling sample size as time remaining increases for large margins). Until North Carolina starts spotting 20 points to The S. Freud Institute of Art Education (the Shameful Inkblots, I think), we won’t be able to resolve this issue. That is, we would need an artificial situation where the margin was non-zero at the start of the game to begin to fill out the small sample sizes for large margins early in the contest. CHANCE
43
40
with the accuracy of point estimates (unbiasedness). The methods used in this study do not rely on tests or intervals, so they should be relatively unaffected by this issue. One could consider extending this study to attempt to find confidence bands for the safety boundaries. In that case, the correlated data issue would likely need to be addressed. The last area of future work I’d like to suggest is moving into other sports. This same type of analysis would be interesting in every major sport, and each sport would bring unique challenges. For example, in football, there are big jumps in score (2, 3, 6, 7, 8 points can be scored at any one time point if extra point tries are not timed) and relatively low scoring. In hockey or soccer, there are very low scores in general. In golf and baseball, there is no “time remaining” and the unit until the end of the game would be holes and outs, respectively. I think all these obstacles would be surmountable and the results would be as fascinating as the basketball results have been.
20 0
10
Margin
30
James boundaries lowess smooth simplified
0
4
8
12
16
20
24
28
32
36
40
Time Remaining (min)
Figure 8. Plot of safety boundaries comparing James’ suggestions (solid), LOWESS-smoothed contours of 100% and 95% safety (dashed), and simplified algorithmic boundaries of 100% and 95% safety (heavy solid)
Future Directions There are myriad directions to go. Regarding refining the current study, the first improvement would be to include more data. With a larger sample size—encompassing other seasons with any luck—we could have more confidence that we identified the correct contours of safety. Also, more work could be done in developing simplified boundary algorithms. I chose the quadratic form James suggested, although I added a piecewise constant term for earlier parts of games. I think piecewise linear or other approaches could be fruitful. The key, of course, is to end up with something easy enough to remember so it has some chance of being used. Another area of interest would be refining the definition of safety. The definition used here is simple: Every lead is either safe or not safe depending on whether the leading team goes on to win the game (in regulation). But reality suggests there is a degree of safety to every lead. A 20-point lead is safer than a 10-point lead at a particular time. An iterative approach could be envisioned where the naïve definition of safety is used to estimate contours of relative safety, and then the relative safety measure would be used to update the estimate in the next iteration. Perhaps this would convey more information at each margin/time-remaining point due to using a continuous measure of safety, rather than a dichotomous one. Another important point is that the data used in this study is correlated. For an individual game, it is obvious that the margin with 59 seconds left is highly dependent on what the margin was with 60 seconds left. Therefore, when looking at all games, the information found at 59 seconds remaining shares much of the same information as that found at 60 seconds remaining. For correlated data in general, the concern is with inaccurate estimation of standard errors, over-representation of the effective sample size, and the effect on resulting hypothesis tests and confidence intervals. In general, there is not a concern 44
VOL. 21, NO. 4, 2008
Conclusion In this study, I have attempted to quantify how safe a lead is in college basketball games. I estimated boundaries between safe and not-safe leads based on the time remaining in a game and I reported simple algorithms for estimating 100% safe leads and 95% safe leads. Now based on my analysis, it appears that overcoming a 20-point deficit with 12 minutes remaining is not possible (it is a 100% safe lead). This is my study, and even I don’t believe that. As sure as the sun will rise tomorrow, someone somewhere is reminiscing about the time the heroic Holy-Moleys from Moldy Old U came back from 80 points down with only seven shakes of a lamb’s tail remaining against the nefarious Nostalgia State Ne’er-Do-Wells while fielding only half a squad of convicts and third-graders back when men were men and you needed a ladder to get the ball out of the peach basket (Did you get all of that?). I don’t dispute the point. The lesson, I think, is that it is difficult (impossible, really) to estimate when a lead is 100% safe. For just about any rule you could devise, someone could find an exception—until you devise a rule so conservative as to be useless. At the end of the day, it is safer to stick with 95% safety. That boundary is almost certainly estimated more accurately, and it has a more comforting and believable interpretation. If a lead is 95% safe, then we would expect 19 out of 20 games that find themselves in such a situation to resolve themselves as expected. And if the unexpected happens, don’t blame me, just celebrate the mystery of sport.
Further Reading Cleveland, W.S. (1979) “Robust Locally Weighted Regression and Smoothing Scatterplots.” Journal of the American Statistical Association, 74:829–836. Cleveland, W.S. (1981) “LOWESS: A Program for Smoothing Scatterplots by Robust Locally Weighted Regression.” The American Statistician, 35:54. James, Bill. (2008) “The Lead Is Safe.” Slate, www.slate.com/ id/2185975. Simonoff, Jeffrey S. (1996) Smoothing Methods in Statistics. Springer: New York.
Light It Up Predicting the winner of an NBA game before the end Katherine McGivney, Ray McGivney, and Ralph Zegarelli
T
he idea for this article occurred near the end of a closely fought basketball game at the University of Hartford. “Was there,” one of the authors wondered, “a way to know the eventual winner at that moment and thus spare him the nail-biting moments he had endured in the past?” In other words, where was Red Auerbach when we needed him? Longtime followers of the Boston Celtics may recall a practice of their brilliant, but controversial, former coach. While sitting on the bench late in a game, Auerbach would ceremoniously reach into his coat pocket, take out a cigar, and slowly light it as a signal that his team was assured a victory. As far as anyone can recall, including veteran Boston sportswriters, Auerbach was never wrong, mainly because he didn’t light up until the Celtics held a substantial lead with little time to play. Consequently, the cigar did, indeed, signal a ‘sure bet.’ While gamblers dream about a sure bet, we provide the next best thing— easy to apply tests that predict, with surprisingly high probability, the winner of a basketball game before it ends.
The First Test To be practical for a fan watching a game in an arena or on TV, the test should satisfy two conditions: it should depend on only data observed during a game (e.g., not records of past performances) and it should be easy to apply. An obvious source of relevant data is the scoreboard, which records the current scores of each team and the time remaining in a period (either a half or a quarter, depending on the level of play). Test 1 satisfies both the stated conditions. Let D represent the absolute value of the difference in the scores and T represent the number of minutes left in the game (seconds are disregarded). Test 1: Select the team that is ahead when D > T is first satisfied. For example, consider the play-byplay data from the fourth quarter of a
Table 1 — Part of the Play-by-Play Account of a Game on the ESPN Web Site Time Left in the Game
Los Angeles Lakers Have the Ball
6:20 6:20
Bryant personal foul (Barbosa draws the foul)
Score Lakers-Suns
Phoenix Suns Have the Ball
82–86
Leandro Barbosa defensive rebound
82–86
6:20
82–87
Leandro Barbosa makes free throw 1 of 2
6:20
82–88
Leandro Barbosa makes free throw 2 of 2
6:03
Kobe Bryant makes driving lay-up
84–88
5:48
PREDICT THE WINNER HERE
84–90
Leandro Barbosa makes two-point shot
5:35
84–90
Boris Diaw shooting foul (Bryant draws the foul)
5:35
Phoenix full timeout
5:35
Kwame Brown enters the game for Brian Cook
professional basketball game between the Los Angeles Lakers and the Phoenix Suns played on April 23, 2006. (Similar data can be found at http://scores.espn. go.com/nba/scoreboard.) Values for T can be found in Column 1. Corresponding values for D can be calculated from the data in Column 3. By checking the entire play-by-play sheet, it was established that neither team satisfied the conditions of Test 1 until the time frame in Table 1. With 5:48 left to play, Leandro Barbosa made a two-point shot that gave the Suns a six-point lead with five minutes to play. At this moment, the projected winner—
84–90
the Suns—was selected. And, indeed, the Suns did go on to win, 107–102. Before continuing, it is important to underscore two aspects of Test 1. The team that is ahead when Test 1 is first satisfied is the projected winner. That team may never satisfy Test 1 again. Furthermore, the other team may satisfy Test 1 at a later stage of the game. Whatever the case, the criterion is met at most once. Second, it is possible that Test 1 may never be applied. This case occurred in the first round of the 2006 NCAA Division I Men’s Basketball Tournament in a game between Tennessee CHANCE
45
Early Results
Table 2 — Results of the 192 Games Charted Test 1
Winners/Predicted
Percentage of Winners
All Games
179/192
93.2
Home Team
104/109
95.4
Away Team
75/83
90.4
Table 3 — Games Won per Quarter When Test 1 Was Invoked Test 1
3rd Quarter 4th Quarter Only Only
Last Six Minutes of the Game
Last Three Minutes of the Game
Wins/ Predictions
73/73
103/116
50/60
12/17
Percentage
100
88.8
83.3
70.6
20
Number of times test1 was invoked
18 16 14 12 10 8 6 4 2 0 0
5
10
15 20 Time remaining in the game
25
30
During the 1987–1988 NBA season, long before ESPN kept play-by-play records, a friend of one of the authors agreed to chart 25 Boston Celtics games basket-by-basket before he asked for relief. His data showed Test 1 selected the eventual winner in 22 of the 25 games (88%). In March of 2006, we applied Test 1 to 60 of the 63 games played in the NCAA Men’s Division 1 Basketball Tournament, better known as March Madness. (At that time, ESPN did not retain play-by-play data as long as they now do, and we could not reclaim scores of three games.) Test 1 predicted the winner in 55 of these 60 games, a 91.7% record. In March of 2007, we charted 63 games of March Madness (excluding only the “play-in” game that determines the 64th seed). Four of these games went into overtime. Of the 59 games that did not go into overtime, Test 1 chose the winner in 55 (93.2%). If you count the four overtime games as failures for Test 1, the percentage is 87.3% (55/63). The only first round ‘losers’ involved Virginia Tech, which had a 12-point lead over Illinois with 9:54 left and lost, and Virginia Commonwealth University, which had an 11-point lead over Duke with 10:54 to play and lost. The remaining two games were in the second round, where Ohio State had an eight-point lead over Xavier with 7:54 to play, and Texas A&M, which had a six-point lead over Louisville with 5:59 remaining. Both Ohio State and Texas A&M eventually lost in regulation time.
Series 1
Figure 1. A frequency distribution for Test 1. Each ordered pair (T, N) represents the number of times (N) Test 1 was invoked with T minutes left in a game. For example, Test 1 was invoked only once when T = 0 (during the last minute of play) and 19 times with five minutes to play. The earliest it was invoked was with 25 minutes to play.
and Winthrop. The score was tied the last minute of a game in which the test criterion was never satisfied. Tennessee won when one of its players made a basket as time ran out.
Questions A number of questions come immediately to mind. How ‘good’ is Test 1; that is, what is the probability the selected team wins the game? 46
VOL. 21, NO. 4, 2008
Does the home team win a significantly higher percentage of the games in which it is the predicted winner? Predictions made ‘late’ in a game are probably less reliable than those made earlier. How much so? What percentage of times will the prediction be made in the last quarter? In the third quarter? What is the distribution of the prediction times?
Data for This Article Although March Madness is an exciting spectacle, we decided to base this article on professional basketball games for several reasons. First, as March Madness allows only the top 64 teams in the country to compete, it does not represent a good cross-section of college basketball teams. Furthermore, because there are more than 30 conferences playing NCAA men’s basketball, finding a reasonable sampling method was daunting. Also, ESPN keeps NBA play-by-plays for the entire year, which enabled us to doublecheck data as necessary. Finally, in 1992, Harris Cooper, Kristina M. DeNeve, and Frederick Mosteller published a similar article, “Predicting Professional Game Outcomes from Intermediate Game
Scores,” in CHANCE, and we thought it would be interesting to compare some of our results with theirs. A professional basketball game consists of four 12-minute quarters. Between October 14, 2006, and March 28, 2007, we analyzed ESPN play-by-plays of 212 NBA regular season games before deciding to preserve what was left of our collective sanity. On most days NBA games were played, we recorded every fourth score on the ESPN web site, moving down from the upper left corner of the screen. Of these 212 games, 20 ended in a tie at the end of regulation and were deleted from further analysis. The remaining 192 games represented 15.6% of the games played during the regular season that year. Several values were collected for each game: the date, the home and away teams, which team scored first, the predicted winner using each of our tests, the actual winner, scores at the end of three quarters, and values for D and T. In 2006–2007, the NBA was organized in two conferences, each having three divisions. Our data were weighted a bit toward the Eastern Conference teams, which appeared in 57% of the games, as opposed to Western Conference teams. The three eastern divisions (Atlantic, Central, and Southeast) appeared in 19%, 20%, and 19% of the games, respectively, and the western divisions (Northwest, Pacific, and Southwest) appeared in 12%, 15%, and 16% of the games, respectively.
1.2
1.0
0.8
P
0.6
0.4
0.2
5
0
10
15
20
25
T -0.2
Figure 2. Each ordered pair (T, P) represents the probability that Test 1 successfully predicted the winner with T minutes left in a game. The logistic model for these data is 0.987 . P= 1+ 0.960 e−0.597 T
14
Results 12 Number of times test 2 was invoked
Of the 192 games charted, Test 1 selected the eventual winner in 179 games; that is, 93.2% of the time. (See Table 2.) This figure is consistent with earlier data we collected. Test 1 had an even higher accuracy when the home team first satisfied the criterion. This occurred in 109 games, of which the home team won 104 (95.4%). The away team first satisfied the criterion in 83 games. It went on to win 75 of these games—an accuracy of 90.4%. Figure 1 shows the frequency distribution for the time when the criterion for Test 1 was satisfied. The ordered pair (T, N) represents the number of times (N) that Test 1 was invoked with T minutes left in a game. The mean is 9.95 minutes, the mode is five minutes, and the range is 0–25 minutes. This scatterplot has a decreasing trend, suggesting that few bets are
10
8
6
4
2
0 0
5
10
15
20 25 30 Time remaining in the game
35
40
45
Series 1
Figure 3. A frequency distribution for Test 2. Each ordered pair (T, N) represents the number of times (N) that Test 2 was invoked with T minutes left in a game. For example, Test 2 was invoked most frequently (19 times) with 10 minutes to play in the game. The earliest it was invoked was with 41 minutes to play. CHANCE
47
made early in a game (large values of T) and more bets are made later in a game (small values of T). Note that Test 1 was applied in the first half only once (T = 25). Furthermore, previous experience suggested Test 1 would be invoked more frequently in the last quarter than in the third. Table 3 verifies this. Test 1 was invoked 73 times in the third quarter and 116 times in the fourth quarter. On the other hand, we might expect the accuracy of Test 1 would be an increasing function of T, as ‘early’ predictions (large values of T) indicate large leads, which are generally difficult to overcome. Figure 2 plots the ordered pairs (T, P), where P is the probability that Test 1 successfully predicts the winner with T minutes left in a game. A logistic model for these data also is graphed. One last note: The team that scored first in the game won 53.6% (103/192) of the games. Home teams enjoyed a great advantage in this area. Home teams scoring first went on to win 61.4% (62/101) of the time; away teams 45.1% (41/91). We’ll return to these statistics later.
1.2
1.0
0.8
P
0.6
0.4
0.2
0 10
20
30
40
T -0.2
Figure 4. Each ordered pair (T, P) represents the probability that Test 2 successfully predicted the winner with T minutes left in a game. The logistic model for these data 0 .930 is P = . − 2.80 T 1+ 3797 e
Probability of test1 (test 2) correctly predicting the winner
1.0
0.8
0.6
0.4
0.2
0.0 0 Variable Test 2 Test 1
5
10
15
20
25
30
35
40
Minutes remaining in the game
Figure 5. A composition of the scatterplots in Figures 2 and 4 comparing the relative accuracy of Test 1 and Test 2. The boxes represent the data for Test 1 (0 T 20), and the circles represent the data for Test 2 (0 T 41). Once again, each ordered pair (T, P) represents the probability that the given test successfully predicted the winner with T minutes left in a game. Test 1 is the more accurate predictor in all but one case (T=10).
48
VOL. 21, NO. 4, 2008
A Second Test We then wondered how drastically the results for Test 1 would change if we made predictions even earlier in a game using Test 2. Test 2: Select the team that is ahead when D > T/2 is first satisfied. For example, a team with a six-point lead could now be chosen with 11 minutes to play, instead of five. Given that the team that is behind has roughly twice the time to catch up, it seems Test 2 would not be nearly as reliable as Test 1. Yet, the corresponding data for the two sets are remarkably similar (see Table 4). Of the 192 games we charted, Test 2 predicted the eventual winner in 171 games—an accuracy rate of 89.1%. (Test 1 accuracy was 93.2%). In addition, the home team satisfied the condition of Test 2 in 107 games and was the eventual winner in 99 of them, an accuracy of 92.5% (compared with 95.4% using Test 1). The away team satisfied the criterion of Test 2 in 85 games and went on to win 72 of them, an accuracy of 84.7% (compared with 90.4% using Test 1) Figure 3 shows the frequency distribution for the time when the criterion for Test 2 was satisfied. The ordered pair (T, N) represents the number of times
(N) Test 2 was invoked with T minutes left in a game. The mode and mean for T using Test 2 are about twice those using Test 1, which is not surprising. The mean is 17.9 minutes, the mode is 10 minutes, and the range is 0–41 minutes. Test 2 can often be invoked much earlier than Test 1 (see Figure 4). On March 23, 2007, the Dallas Mavericks led the Atlanta Hawks by 21 points when there were 41 minutes left to play in the game (that is, after only seven minutes had been played.) Dallas went on to win the game by seven points. Of the 48 minutes in a game, the time interval from 12 to 24 minutes is referred to as the “second quarter.” The data in Table 5 show that in 42 games, Test 2 was invoked in this quarter, and, in each case, the team predicted to win did win. Finally, the predictive accuracy of Test 1 is compared with that of Test 2 in Figure 5. The light boxes represent the probabilities that Test 1 picks the correct winner; the dark circles represent the probability that Test 2 predicts the correct answer. (Note that for the games we charted, Test 1 was invoked only for values of T 25, while Test 2 was invoked for values of T 41.) In every case but one (T=10), Test 1 proved to be the more accurate predictor, which is what we would expect.
Article by Cooper et al In the 1992 CHANCE article by Cooper, DeNeve, and Mosteller, the authors studied how teams leading “late in the game” eventually fared in professional basketball, baseball, football, and hockey games. For basketball, “late in the game” meant after three quarters. They found that teams leading at that point went on to win 79.4 % of the time. They further determined that home teams had a huge advantage, with this test winning 90% of the time, while the visiting team won only 68% of the time. Our data yielded similar results. Excluding games tied at the end of the third quarter and several games whose data was irretrievable, teams ahead at the end of three quarters won 84.3% of the time (151/179), and home teams in this group won 87.4% of these games. Our result for away teams was somewhat higher (80.3%) than the authors’ result.
More to Think About Other tests to explore are limited only by one’s imagination. It would be interesting to analyze web sites in more detail to estimate shooting percentages for home teams and away teams and investigate what effect this information would have on the accuracy of the three tests. For the truly ambitious, shooting percentages might be estimated for the third and fourth quarters separately. Suppose the winner would not be predicted until the test criterion is satisfied for two consecutive minutes. For example, using Test 1, we would require that for some T, D > T for the first time in the game and that D > T – 1 as well. The accuracy of Test 1 should be improved. By how much? One might also investigate a test where the criterion is D > T ± K for different natural numbers K. A similar set of tests might be designed for baseball, where the test criterion is “the first team to be ahead by more runs than there are innings left to play” is selected as the winner.
Table 4 — Results of the 192 Games Charted Test 2
Winners/Predicted
Percentage of Winners
All Games
171/192
89.1
Home Team
99/107
92.5
Away Team
72/85
84.7
Table 5 — Results of 42 Games Charted Test 2
2nd Quarter 18–23 12–17 6–11 Only Minutes Left Minutes Left Minutes Left
Wins/ Predictions
42/42
44/50
38/47
33/40
Percentage
100%
86%
80.9%
82.5%
The authors also found that home teams won roughly 64% of their games. Using our data, the home team won 112 of the 192 games (58.3%). This corresponds nicely with information from the Association for Professional Basketball Research that home teams in the NBA won 59.7 % of the games in 2006–2007.
Further Reading Caudill, Steven B. (2003) “Predicting Discrete Outcomes with the Maximum Score Estimator: The Case of the NCAA Men’s Basketball Tournament.” International Journal of Forecasting, (19):313–317.
Caudill, Steven B. and Godwil, Norman H. (2002) “Heterogeneous Skewness in Binary Choice Models: Predicting Outcomes in the Men’s NCAA Basketball Tournament.” Journal of Applied Statistics, 29:991–1001. Cooper, Harris; DeNeve, Kristina M.; and Mosteller, Frederick. (1992) “Predicting Professional Sports Game Scores from Intermediate Game Scores.” CHANCE: New Directions for Statistics and Computing, 5(3–4):18–22. Hu, Feifang and Zidek, James V. (2002) “Forecasting NBA Playoff Outcomes Using the Weighted Likelihood.” In A Festschrift for Herman Rubin, 385–395. IMS Press: Fountain Hills, Arizona.
CHANCE
49
Monte Carlo Simulation One of the authors wrote a program to simulate NBA basketball games using Java 6.0, which can be downloaded at www. java.com/en. The program can be found at http://athena.cs.hartford.edu/basketball. To obtain source code for the program, contact the authors at
[email protected]. Each game was played under the following conditions: • The total playing time was 48 minutes. • Each team scored 0, 1, 2, or 3 points on each possession. (A team can score 4 or more points on a possession, but this is extremely rare.) Surveying a dozen or so NBA games, we estimated the probabilities of these outcomes were 0.507, 0.046, 0.370, and 0.077, respectively. • We also estimated the average possession for each team was 15 seconds. Consequently, each team had 96 possessions per game. In the NBA, each team can have possession of the ball for 24 seconds at most. • A random number generator was used to determine the points scored on each possession. • Selections of projected winners were made only during regulation time, and games that were tied at the end of regulation time were eliminated. Each simulation consisted of 1 million games, which is equivalent to more than 800 seasons of NBA play. Table 6 lists data for one simulation using Test 1. It is interesting to note that the crucial variable (Percent winners (of games bet)) compares favorably with the result we found earlier using data for 192 NBA games (92.2% to 93.2%). Out of longtime rooting interests of the authors, we labeled one team Celts and the other Mavs, but any questions about home teams versus away teams were not answered by our simulation. The mean for T in the simulation is 11.54. The mean using NBA games is 9.95. Test 1 was invoked in 999,734 games, and the predicted winner won 921,255 of these—an accuracy rating of 92.2%. Note that 266 games were sufficiently close throughout that no prediction could be made. TL represents the time left in a game when the bet (prediction) is made. On average, predictions were made when there were 11 minutes left of the clock; that is, early in the fourth period. Table 7 provides minute-by-minute analysis of the simulation. Note that one team had a 33-point (or more) lead with 32 minutes to play and yet lost. 50
VOL. 21, NO. 4, 2008
Table 6 — Statistical Results for 1 million Simulations of NBA Basketball Games Usingg Test 1
1.2 1.0 0.8
P
0.6 0.4 0.2 0 10 -0.2
20
30
T
Figure 6. A regression model for the data in F 0 .930 Table 7 P = − 2.80 T 1+ 3797 e
Table 7 — Minute-by-Minute Statistical Results for 1 Million Simulations Usingg Test 1
Note: “Time” represents the time left in a game when the prediction is made. “Prob” represents the probability that the predicted winner actually won.
Impossible? Not to those of us who suffered through a game on Christmas Day in 1986 when the New York Knickerbockers, after falling behind by 30 points at halftime, rallied to beat the Celtics. The probabilities versus time data in Table 7 (columns 5 and 1) have a strongly increasing trend. Figure 6 graphs these data and the corresponding logistic model. We also simulated games using Test 2 and found that 85% of the predicted winners actually won, which again compares favorably with the result of 89.1% in the 192 NBA games. Because of the close correlations between the simulated and empirical data using Tests 1 and 2, we used the program to simulate the outcomes using an even more extreme third test for which we did not gather data from NBA games.
Test 3: Select the team that is ahead when D > T/4 is first satisfied. For example, a team with a six-point lead could now be chosen with 23 minutes to play. This seems like ample time for the team that is behind to catch up and win, and, indeed, it may. But the team selected by Test 3 still won 75.1% of the games. It is interesting that Test 3 allows most predictions to be made midway through the first half of the game. We finally considered the test criterion D > T/48. Since an NBA game lasts for 48 minutes, the condition is satisfied when either team first scores. Therefore, this test dictates that the team scoring first be selected as the winner. A simulation showed this test predicts the winner 55% of the time. Our empirical result for the 192 NBA games was 53.6%.
Here’s to Your Health Mark Glickman, Column Editor
‘We All Survived’ and Other Failings of Risk Perception Stephanie R. Land
The following prose has been making the rounds of widely circulated email: Congratulations to all the kids who were born in the 1930s 40s, 50s, 60s, and 70s! >First, we survived being born to mothers who smoked and/or drank while they carried us. >They took aspirin, ate blue cheese dressing, tuna from a can, and didn’t get tested for diabetes. >Then, after that trauma, our baby cribs were covered with brightcolored, lead-based paints. >We had no child-proof lids on medicine bottles, doors, or cabinets, and when we rode our bikes, we had no helmets, not to mention the risks we took hitch-hiking. >As children, we would ride in cars with no seat belts or air bags. Riding in the back of a pick-up on a warm day was always a special treat. >We drank water from the garden hose and NOT from a bottle. >We shared one soft drink with four friends from one bottle and NO ONE actually died from this. >We ate cupcakes, white bread, and real butter and drank soda pop with sugar in it, but we weren’t overweight because … >WE WERE ALWAYS OUTSIDE PLAYING!! >We would leave home in the morning and play all day, as long as we were back when the streetlights came on. No one was able to reach us all day. And we were O.K. >We would spend hours building our go-carts out of scraps and then ride down the hill, only to find out we forgot the brakes. After running into the bushes a few times, we learned to solve the problem. We fell out of trees, got cut, broke bones and teeth, and there were no lawsuits from these accidents. >We ate worms and mud pies made from dirt, and the worms did not live in us forever. >We were given BB guns for our 10th birthdays, made up games with sticks and tennis balls, and, although we were told it would happen, we did not put out very many eyes. >We had freedom, failure, success, and responsibility and we learned HOW TO DEAL WITH IT ALL! And YOU are one of them! CONGRATULATIONS! >You might want to share this with others who have had the luck to grow up as kids, before the lawyers and the government regulated our lives for our own good. Kind of makes you want to run through the house with scissors, doesn’t it?!
CHANCE
51
The Selectivity of Risk Perception What are the statistical arguments around these issues? One is selection bias, that those who perished undertaking risky behavior are not adequately represented in the present discussion. If we judge by the experiences of our current friends and acquaintances, it would seem the probability of surviving childhood is extremely high. Not surprisingly, Google found no sites with the title of this prose and the phrase “selection bias,” nor did any of the discussions I read hint at this reasoning. There are social forces at work here too, because tragic events in the past are not often recalled in ordinary conversation. The cause of a child’s death is an even greater taboo than the death itself, especially when benign negligence of a parent is a possible cause. The matter is handled differently, however, when the cause of death is a crime. A local public radio news representative acknowledged a few years ago, after being asked about the radio station’s sensational content, that they report a death when it reaches the police blotter. Murders, abductions, and Halloween candy tampering are the subject of national news. We need to know. But, if harm comes to a child because of a well-intentioned parent’s failure to take ordinary safety measures, that news causes too much discomfort for the listener. So, there is selection bias at multiple levels affecting the risk information we receive, both in who is present to discuss childhood antics and what information will be shared about those who have been victims. This selective information, where those who survive are discussed while those who succumb are not, plays into another feature of human risk perception: We are more responsive to our experiences than we are to 52
VOL. 21, NO. 4, 2008
probabilistic information. For that reason,wearemoreinfluencedbythesurvival experiences of our friends than by the epidemiologic data regarding a risky behavior. Authors have suggested that this is explained by evolution. The availability of epidemiologic data is modern. Our ancestors learned by witnessing what happened to them and their neighbors when exposed to dangers such as wild animals and poisonous plants. In the Internet discussion of the “congratulations” piece, there were those rational discussants who disputed the conclusions, but their arguments were often based on individual accounts of children hurt or lost. Our experiential risk perception leads to over-estimates of personal risk for those who have witnessed harm and under-estimates for those who have not. Selectivity of information is also pertinent in a discussion of child abduction. A recent U.S. Department of Justice study found that the sort of abduction we fear most, “stereotypical kidnapping” (i.e., a kidnapping by a stranger or slight acquaintance that lasts overnight), was perpetrated on an estimated 115 children in the United States in one study year (0.0002 % of the U.S. child population in that year). Of the other types of abductions, children were almost always (>99%) returned without physical injury. This example raises another challenge in risk communication. When presented with small probabilities, people tend to either overestimate them or truncate them to zero. The overestimation of risks among those who have witnessed harm was evidenced last year in Pittsburgh when a woman and her young children were kidnapped and held at gunpoint for two hours. The perpetrator wanted only money and left them untouched, but the incident shocked the community of mothers who frequented the place where the kidnapping occurred. Those women did not literally witness the event, but were affected much more profoundly because of the familiarity of its circumstances.
Good Practice in Risk Communication How can we convey risk information in a way that enables people to make informed decisions? Behavioral
researchers have found that communicating risk is more effective when provided in terms of incidence, rather than probabilities or proportions, and with comparative information to provide context. The following example from www.quitsmokingsupport.com includes these features: Smoking kills over 400,000 people a year—more than one in six people [who die] in the United States— making it more lethal than AIDS, automobile accidents, homicides, suicides, drug overdoses, and fires combined. Comparative information might also be personalized. In one study, researcher Isaac Lipkus and colleagues gave college student smokers their comparative lung age based on lung function tests. (For example, a student might be told he has the lung function of an average 30-yearold.) However, with tailored risk information, another bias appears: Information is downplayed when it is not favorable. With increasing lung age, smokers rated the feedback as less relevant and reported breathing more easily while undergoing lung function testing than smokers with lower lung age. Isaac Lipkus also found, in another study, that in conveying to smokers their personalized genetic susceptibility to lung cancer, those who received good news were more likely to remember and accurately interpret the result than those who were told they were at increased risk. Some authors have found that risk information is better communicated graphically, rather than numerically, whereas some have found that graphs do not necessarily improve accurate interpretation. Authors J. Ancker and I. Lipkus have provided systematic reviews of the graphical formats being used to display risk. Icon arrays, in which an image such as a stick figure is repeated to show a discrete count, make up one example that seems to be effective. This might be for reasons already discussed: The icons are interpreted as people, so the information is closer to being experiential. If our minds interpret the images as a group of neighbors or peers, we will respond more strongly to the risk information. Icon arrays may include icons for the number of subjects at risk, but with some icons darkened to display the number of victims. In some icon arrays, the darkened icons are randomly
placed. This is more effective at conveying randomness, but keeping the darkened icons clustered better enables the reader to interpret the probability. Bar charts also can be effective. As with icon arrays, the choice to display the number of victims as part of a larger bar with the number at risk, or to display only the number of victims, will affect the interpretation of comparative risk information. Figure 1 is an icon array illustrating the 10-year risk of lung cancer and “hard” coronary heart disease (myocardial infarction and coronary death) for a 67-year-old woman who has been smoking since age 17; typically smokes 25 cigarettes per day; and has borderline measures of blood pressure, HDL, and total cholesterol. This graph displays the first two major causes of death due to smoking (illustrated as mutually exclusive events). The U.S. Surgeon General reports that, since 1964, there have been 12 million smoking-related deaths in the United States, comprising 4.1 million deaths from cancer, 5.5 million from cardiovascular diseases, 1.1 million from respiratory diseases, and 94,000 fetal and infant deaths. Despite the literature about communicating risk and a wealth of numerical information, I found no source in my own Internet search that graphically conveyed the multiple health risks attributed to smoking. That is particularly unfortunate given that the same selection biases described for the congratulations piece are at play when discussing the risks of smoking. We hear loud proclamations about the grandfather who smoked until he was 90 years old, but a loved one’s suffering from emphysema is a sensitive subject. A person’s difficulty running to catch a bus or a man’s loss of erectile function are not subjects for conversation at all. There is an opportunity for the medical research community to present risk information in a format that will better inform the public and better counter the biases of anecdotal information. Breast cancer is another leading cause of death in which preventive measures are available, and much work has been done regarding the communication of risk in this setting. The task is even more challenging when multiple risks and benefits must be combined to make a decision about breast cancer prevention. For example, the drug tamoxifen reduces
Woman who will not develop lung cancer or heart disease within 10 years
Woman who will develop heart disease within 10 years
Woman who will develop lung cancer within 10 years
Source of estimates: www.mskcc.org/mskcc/html/12463.cfm (Bach model tool to compute lung cancer risk) and http://hp2010.nhlbihin.net/atpiii/calculator asp?usertype=prof#hdl (National Heart Lung and Blood Institute tool for calculating the risk of coronary heart disease)
Figure 1. Icon array illustrating the 10-year risk of lung cancer and “hard” coronary heart disease (myocardial infarction and coronary death) for a 67-year-old woman who has been smoking since age 17; typically smokes 25 cigarettes per day; and has borderline measures of blood pressure, HDL, and total cholesterol. (Events are portrayed as mutually exclusive.) a. Among 1000 women with your risk factors, who do not receive tamoxifen, the following diagram shows the expected number of cases in the next 5 years. Life-threatening events
Other severe events
b. Among 1000 women with your risk factors, who do receive tamoxifen, the following diagram shows the expected number of cases in the next 5 years. Life-threatening events
Other severe events
Legend Invasive breast cancer Hip fracture
Uterine cancer Stroke Blood clot in lung
In situ breast cancer Blood clot in large vein
Figure 2. (a) Benefits and risks associated with the breast cancer prevention drug tamoxifen, as presented to a hypothetical woman being recruited to the National Surgical Adjuvant Breast and Bowel Project (NSABP) STAR Trial. (b) The same benefits and risks in an icon array. To see this graph in color, go to www.amstat.org/publications/chance.
the risk of breast cancer in women and helps prevent osteoporosis. However, it also has some dangerous side effects, including blood clots, strokes, and endometrial cancer. For this reason, tamoxifen is approved for the prevention of breast cancer only in women who are at high risk of breast cancer. The National Cancer Institute sponsored a workshop in 1998 to develop materials to help women decide whether
to use tamoxifen. One of those materials was a table providing the number of various life-threatening events that would be expected among women either taking tamoxifen or not taking tamoxifen. The table was adapted for women being recruited to enter the National Surgical Adjuvant Breast and Bowel Project (NSABP) STAR trial, a breast cancer prevention study in which one treatment group would receive CHANCE
53
a.
8,000
Deaths per 100,000
7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 30 935 940 945 950 955 960 965 970 975 980 985 990 995 000 005 2 2 1 1 1 1 1 1 1 1 1 1 1 1 19 1 Source of data: Centers for Disease Control and Prevention, www.cdc.gov/nchs/datawh/statab/unpubd/mortabs/ hist290.htm for data from 1930–1998; www.cdc.gov/nchs/datawh/statab/unpubd/mortabs/gmwk23a.htm for data from 1999–2005
b. 600
Deaths per 100,000
500 400 300 200 100 0 30 35 40 45 50 55 60 65 70 75 80 85 90 95 00 05 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 Source of data: Centers for Disease Control and Prevention, www.cdc.gov/nchs/datawh/statab/unpubd/mortabs/ hist290.htm for data from 1930–1998; www.cdc.gov/nchs/datawh/statab/unpubd/mortabs/gmwk23a.htm for data from 1999–2005
Figure 3. All-cause mortality from 1930–2005 for infants (a) and children (b)
tamoxifen. When a woman’s risk factor information was entered, a table was generated for her based on the risks and benefits of tamoxifen. Figure 2 displays the same information in an icon array. Unlike Figure 1, this icon array displays icons for events only, not as a proportion of a large number of women at risk. It remains to be explored whether such a graphic would better enable women to interpret the multiple benefits and risks of their options. Communicating risk information is also difficult when the risks are small, such as in the kidnapping example. Risks presented as incidences are problematic because people have trouble differentiating between 1 in 100,000 and 1 in 1,000,000. No bar graph or icon array is feasible on this scale. Some authors have introduced magnifier risk scales 54
VOL. 21, NO. 4, 2008
to graphically display small risks, in which the lowest portion of a linear scale is magnified in an inset, but this has been shown to distort the interpretation of risk. Comparative information can be useful, but is subject to the reader’s beliefs about the comparator. For example, the risk of a child being stereotypically kidnapped in one year is about double the risk of a person dying in a bathtub. As we’re talking about rare events, the reader is unlikely to have accurate knowledge of the risk of the event being used for comparison. Such comparisons can easily be manipulated to allay or raise fears.
Are ‘Congratulations’ in Order? How, then, can we convey the danger associated with the behaviors in the
congratulations piece? Let’s start with the original premise: We all survived. How many of our cohort did not? Figure 3 displays infant and childhood death rates from 1930–2005. Being a child became much safer between 1930 and 1950 (about when mass production of penicillin began) and has continued to become progressively safer since 1950. Figure 4 displays the trends in death rates per major cause of death from 1968–1992. The selected causes of death include most of the deaths that might result from behaviors applauded in the congratulations piece. Accidents not involving motor vehicles include, among other things, accidental poisoning from medicine bottles or cleaners kept in unlocked cabinets, deaths attributable to bicycle or go-cart accidents, and deaths from falling out of trees. Motor vehicle accidents might be the cause of death for a child allowed to roam free during the day, to hitchhike, or to ride in the car without a safety restraint. Hitchhiking or playing outside unsupervised also might result in homicide. Sharing a soft drink might result in a death due to influenza or pneumonia. Congenital anomalies would include those caused by maternal tobacco and alcohol use. The most dramatic decrease was in nonmotor vehicle accidents, and there were substantial decreases in all other causes as well, with the exception of homicide. Similar trends were seen for children in other age groups (data not shown). Are these improvements in child safety attributable to medicine safety caps, child safety car seats, and other ways our society regulates the lives of children? At least in part, yes. The National Highway Traffic Safety Administration estimated in 2006 that child safety seats reduce motor vehicle fatalities among infants by 71% and toddlers by 54%.
25
Rate per 100,000 population
MVA
NON-MVA
Congenital Anomalies
Cancer
Homicide
P&I
20
15
10
5
0 1968
1973
1978
1983
1988
1992
Source: Singh, G. K. and S. M. Yu (1996). "US childhood mortality, 1950 through 1993: Trends and socioeconomic diffferentials." Am J Public Health 86(4): 505-12. Reprinted with permission.
Figure 4. Mortality for children ages 1–4 due to selected major causes of death from 1968–1992
Maternal smoking during pregnancy, which has decreased nationwide from a prevalence of 25% in 1985 to 11.4% in 2002 according to data from the Centers for Disease Control and Prevention, results in babies that have 30% higher odds of being born prematurely and are 1.4 to 3.0 times as likely to die of sudden infant death syndrome. The Consumer Product Safety Commission estimates that child-resistant medicine caps are attributed with saving the lives of 700 children since the early 1970s, when these products became required. Congratulations are in order, not to those of us who survived the dangers of our youth, but to those parents then and now who are taking advantage of the many opportunities to make our children safer. Some of the improvements in childhood safety must also be due to factors such as changes in medical care and traffic safety. Perhaps the challenges of estimating and conveying the overall improvement in childhood survival that can be attributed to child-safety regulations are too great. Perhaps we will not succeed in communicating the comprehensive improvement in health outcomes to a person who ceases to smoke cigarettes. At the very least, we must continue to improve our understanding of biases in risk perception and our means of communicating risk, and we must ensure that those who did not survive are not lost in the discussion.
Lipkus, I.M. and Prokhorov, A.V. (2007) “The Effects of Providing Lung Age and Respiratory Symptoms Feedback on Community College Smokers’ Perceived Smoking-Related Health Risks, Worries and Desire to Quit.” Addict Behav, 32(3):516–32. Slovic, P., Monahan, J., and MacGregor, D.G. (2000) “Violence Risk Assessment and Risk Communication: The Effects of Using Actual Cases, Providing Instruction, and Employing Probability versus Frequency Formats.” Law Hum Behav, 24(3):271–96. Slovic, P., Finucane, M.L., Peters, E., and MacGregor, D.G. (2004) “Risk as Analysis and Risk as Feelings: Some Thoughts About Affect, Reason, Risk, and Rationality.” Risk Anal, 224(2):311–22. Tamoxifen risk Gail, M.H., Costantino, J.P., Bryant, J., Croyle, R., Freedman, L., Helzlsouer, K., and Vogel, V. (1999) “Weighing the Risks and Benefits of Tamoxifen Treatment for Preventing Breast Cancer.” J Natl Cancer Inst, 91(21):1829–46.
Further Reading Risk communication and perception Ancker, J.S., Senathirajah, Y., Kukafka, R., and Starren, J.B. (2006) “Design Features of Graphs in Health Risk Communication: A Systematic Review.” J Am Med Inform Assoc, 13(6): 608–18. Klein, W.M.P. (2002) “Comparative Risk Estimates Relative to the Average Peer Predict Behavioral Intentions and Concern About Absolute Risk.” Risk, Decision and Policy, 7:193–202. Lipkus, I.M., McBride, C.M., Pollak, K.I., Lyna, P., and Bepler, G. (2004) “Interpretation of Genetic Risk Feedback Among African American Smokers with Low Socioeconomic Status.” Health Psychology, 23(2):178–188. Lipkus, I.M. (2007) “Numeric, Verbal, and Visual Formats of Conveying Health Risks: Suggested Best Practices and Future Recommendations.” Med Decis Making, 27(5):696–713.
Tobacco use and effects Surgeon General’s Report, 2004: www.cdc. gov/tobacco/data_statistics/sgr/sgr_2004/00_ pdfs/SGR2004_Whatitmeanstoyou.pdf Maternal smoking in 1985: www.cdc.gov/ nchs/images/hp2000/childhlt/8mch.gif Maternal smoking in 2002: www.cdc.gov/ mmwr/preview/mmwrhtml/mm5339a1.htm Outcomes for smoking during pregnancy: www.cdc.gov/reproductivehealth/ TobaccoUsePregnancy Childhood mortality Singh, G.K. and Yu, S.M. (1996) “U.S. Childhood Mortality, 1950 Through 1993: Trends and Socioeconomic Differentials.” Am J Public Health, 86(4):505–12. Consumer Products Safety Commission report regarding medicine child-resistant packaging: www.cpsc. gov/cpscpub/pubs/5019.html
“Here’s to your Health” prints columns about medical and health-related topics. Please contact Mark Glickman (mg@ bu.edu) or Cindy Christiansen (cindylc@ bu.edu) if you are interested in submitting an article. CHANCE
55
Visual Revelations Howard Wainer, Column Editor
Looking at Blood Sugar Howard Wainer and Paul Velleman
I
56
VOL. 21, NO. 4, 2008
80
Prevalence of diabetes in the U.S. (per 1000 population)
t is estimated that there are 20.8 million children and adults in the United States, or 7% of the population, with diabetes. Of these, 14.6 million have been diagnosed and 6.2 million (or nearly one-third) are unaware they have the disease (www.diabetes.org/about-diabetes.jsp). The large proportion of undiagnosed diabetics adds uncertainty to the estimates of the total number, but through the use of multiple sources and statistical adjustment, we can obtain a rough view (see Figure 1). Using adjusted estimates of the prevalence of this disorder, we can see that, throughout the past 70 years, the growth of diabetes in the United States has been exponential. There have been many causes proposed for this explosion, including increased obesity, decreased physical activity, a shift toward more processed foods, and an aging population. Though there are many other risk factors, we would like to focus on making the treatment of the disease more efficacious through improved communication of the consequences of treatment choices and behaviors. Left untreated, diabetes has many serious consequences, including—but not restricted to—liver and kidney damage and circulatory problems leading to blindness, neuropathy, and loss of limbs. Effective treatment requires close cooperation between the physician and the patient. The physician can prescribe
60
40
20
0
1930
1940
1950
1960
1970
1980
1990
2000
2010
Figure 1. Estimates of the prevalence of diabetes in the United States shown as the number per 1,000 in the population. The fitted curve is an exponential function.
• How large are the variations in blood sugar level that take place over the course of the day due to normal daily events such as eating meals, exercising, and sleeping? drugs and changes in diet and exercise regimes, but only the patient can implement those changes. Often, the patient is given both general and specific guidelines, but the effects of following or not following those guidelines can only be seen after implementation. The record of success cannot be described as an unqualified success. Indeed, as Jinan Saaddine and coauthors describe in “Improvements in Diabetes Processes of Care and Intermediate Outcomes: United States, 1988–2002,” “Despite the many improvements, two in five people with diabetes still have poor cholesterol control, one in three have poor blood pressure control, one in five have poor glucose control.” Why? Surely the consequences of not controlling cholesterol, glucose, and blood pressure are both dire and clearly understood. Just as surely, the answer to the causal “Why?” has many parts. We would like to consider just one of them here: The communication of current status to the patient is not as effective as it could be.
What Is Once a patient is diagnosed with diabetes, a number of actions take place. Among them are extensive counseling about the importance of keeping blood
sugar under tight control and the role diet and exercise play in so doing. To aid the patient in controlling blood sugar, s/he is given a small device that (i) reads a small blood sample and indicates its glucose content within seconds; (ii) records the reading in (i), along with the time and date it was taken; (iii) allows the inputting of various classifying characteristics (e.g., before a meal, after a meal, or none); (iv) calculates mean blood sugar levels for the past seven days, 14 days, and 30 days; and (v) shows these averages for each of the subclassifications indicated in (iii). All in all, this small instrument is a remarkable device. As remarkable as it is, however, it could be more useful still with a few minor modifications, namely modifying how it summarizes and displays the data.
What Might Be Before we reconsider how to compute and display summaries of blood sugar data, let us first consider the key questions these data may help answer. By our estimation, there are three: • How am I doing overall? This is a long-term question that focuses on the blood sugar levels over extended periods of time to evaluate the efficacy of the various strategies being employed to control it.
• Are there any unusual excursions in blood sugar? How large are they? What causes them? Each question has obvious medical implications. The first is the overall evaluation of the therapy. If the therapy has the character desired, whatever is being done is working. But, if all we find is that the mean blood sugar is too high, we have no immediate clues to aid in remediation. The second question is primarily intermediate in character. We must know how much variation is usual before we can answer the third question: What is unusual? Management of blood sugar means more than keeping its typical amount in a particular range. It also is important to keep its variation within prespecified bounds (i.e., between 80 and 140 mg/dl). The answer to this question can be both descriptive and prescriptive. If blood sugar shows unusually large variation before and after dinner, then one should consider eating less at dinner. Such a result is important, but unavailable solely from the answer to question one. The third question is of least importance to the physician, but of potentially greatest import to the patient, for large excursions from what is normal will invariably have an associated story (e.g., a large cookie, an ice cream cone, CHANCE
57
Table 1—A Sequence of Seven Blood Sugar Measurements (mg/ dl) and a Running Mean of Three Smooth of These Measurements
Blood sugar (mg/dl)
250
Blood Sugar
200
150
100
65
70
Residuals smooth Residualsfrom from 33smooth
80 80 40 40 00
-40 -40
70 70
75 75
Days Days Figure 3. The residuals of the data points from the smooth curve shown in Figure 2
too much beer, or too many pretzels). Because each excursion has a specific story, it also suggests an obvious remediation: Don’t do it any more. It is the immediate, clear, and specific feedback from the answer to question (iii) that has the greatest likelihood of aiding the patient in shaping his/her behavior and thus better controlling blood sugar. In addition, as we will see, these are not independent questions. By estimating the pattern of daily variation, we should be able to obtain a more reliable estimate of long-term trends. Even an isolated large excursion can have a significant effect on both the estimated daily variation and the long-term trend when these are represented by averages. We propose simple methods to address both of these problems with the result of providing better information to both doctor and patient. What is the best way to answer these three questions and convey the answers to the patient? The current approach is a listing of numbers—either the actual 58
VOL. 21, NO. 4, 2008
104
•
117
110.7
111
114.3
115
116.3
123
116.7
112
117.3
117
101.7
75
Figure 2. Glucose tolerance test. Thirty-seven data points taken over 10 days. The smooth connecting function is a running mean every three points.
65 65
Smooth
readings or three different averages (e.g., 7-, 14- and 30-day averages). Neither a list of numbers nor the use of averages is the best we can do, and, in combination, they are worse still. Our approach to summarization uses resistant statistical methods, and our approach to presentation is graphical. In the latter, we join with the brothers Farquhar in their belief, written in Economic and Industrial Delusions, in the following: The graphical method has considerable superiority for the exposition of statistical facts over the tabular. A heavy bank of figures is grievously wearisome to the eye, and the popular mind is as incapable of drawing any useful lessons from it as of extracting sunbeams from cucumbers. What is a resistant method? In short, it is a method not affected by a few unusual points. A median is resistant; a mean is not.
Why resistant? The current method of summarizing blood sugar results is taking averages. An average is fine for some things, but because it uses all the information equally, it has some weaknesses. The idea of a summary is that it essentially says “This is typical of the data, most of them lie nearby.” The arithmetic mean satisfies this heuristic when the data follow a bell-shaped curve. But, it does not follow it when there are only a few very unusual points. For example, consider the following blood sugar readings: 90, 93, 95, 102, 210 The mean is 118. Thus, the average is not in the middle of the data, nor is it near any of the readings. In fact, if we subtract the mean from each reading, we get a vector of residuals: -28, -25, -23, -16, 82 Using the mean as the summary statistic has had two unfortunate results. First, it located the middle, where there were no scores nearby. Second, it distributed the mislocation among all the observations. A better characterization would have told us that most of the scores were near 95, but one was very far away at 210. That latter piece is of critical importance, for it is by having that one outlying observation pointed out that the patient can then look for a cause. If a probable cause can be identified, the patient can modify his or her behavior in hopes of avoiding similar excursions in the future. A much more resistant alternative to the mean is the median, 95. The
Smoothing When we have data arranged over time, it is often helpful to find a smooth trace through an otherwise scattered or choppy plot of the data. A smooth trace can show the overall pattern, free of the ‘noise’ of point-to-point variation. And—often more important—it can provide a central summary from which to notice isolated exceptions to the overall pattern. A smooth trace through a sequence of values serves much the same function as a summary of the middle of a single batch of values: It provides a central summary and facilitates identifying exceptions. And, for reasons that will be clear presently, smooth traces can suffer from the same sensitivity to isolated excursions as we saw with the mean. One common way to find a smooth trace is to take local averages of values in the sequence. The result is called a running mean, or sometimes a moving average. For example, Table 1 shows a sequence of blood sugar measurements and a running mean of three of these measurements. The first smooth value, 110.7, is found as the mean of 104, 117, and 111. But running means suffer from the same sensitivity to outlying values as we saw for means. Figure 2 shows a sequence of blood sugar measurements with a single outlying value (due to a glucose tolerance test) and the smooth found by running means of three.
260 260 240 240
200 200 180 180
160 160
Diagnosed as diabetic
Blood sugar (mg/dl)
220 220
Blood Sugar (mg/dl)
residuals are then -5, -2, 0, 7, 115, leaving no doubt which observation is the extreme outlier. The median and mean are the extremes of a continuum of summaries of the middle. Although the mean averages all the values, the median classifies them only as “big” and “small” to pick out the middle one. A measure intermediate in its calculation between the median and the mean is the trimmed mean, which symmetrically lops off some the largest and smallest values and averages those middle ones that remain. By adjusting the percentage of values trimmed off, we can ‘tune’ a trimmed mean to be like the median (by trimming just less than 50%) or like the mean (by trimming very few). In our toy example, a 20% trimmed mean would trim off the largest (210) and smallest (90) values and average the three remaining ones to obtain 96.67.
140 140
120 120 100 100 80 80 1990 1990
1995 1995
2000 2000
2005 2005
2010 2010
Figure 4. Fifteen years of fasting blood sugar tests showing both a pre-diabetic condition indicated by steady increases and the obvious onset of diabetes in 2006
The data points are measured blood sugar levels. The connecting curve represents the smooth values. Note the contamination of three smooth values by a single outlying value. If we used a larger smoothing window, say a running mean every five points instead of three, the excursion caused by the one outlying point would be smaller, but more points would be affected. Because each smooth value averages three data values, the extreme value contaminates three of the values in the smooth trace. What may be of greater concern is that the residuals— the difference between the data and the smooth—are also contaminated. It is the residuals a patient would look at to be alerted to a deviation from the overall trend that might require attention. But, as Figure 3 shows, this approach could raise false alarms. When looking at the residuals from a running average of successive sets of three blood sugar levels, if one original value is unusually high, the residuals show the spike, but also suggest two blood sugar levels that only appear to be low because the spike has contaminated their smooth values. The solution is to use a resistant smoothing method. Smoothers can be based on running medians or running trimmed means. Resistant smoothers often give a less smooth trace, so special methods (beyond the scope of this article) can be use to improve the smoothness of the trace.
An Example The subject is a 63-year-old male in generally good health. He is 6’4” and weighed 230 pounds at the time of diagnosis. He exercises robustly five days a week, and has done so for his entire adult life. Shown in Figure 4 is a plot of his fasting blood sugar taken over the past 15 years. From 1992 until 2005, it was under 140 mg/dl, but, even in 2005, was trending upward portending a prediabetic condition. In 2006, the annual result increased profoundly to 183 and the patient was told to lose five pounds and come back in six months. Six months later, the patient returned 10 pounds lighter with fasting blood sugar of 249. He was diagnosed at that point as a type 2 diabetic and the following corrective actions were taken: • A one-gram-a-day dosage of Metformin was prescribed and then titrated up to two grams a day over a two-week period. • The patient limited his food intake to 2,700 calories a day—divided into 40% carbohydrates, 30% fats, and 30% proteins—and the caloric intake was spread more evenly across the day. • Exercise was increased from an hour to 90 minutes a day • He was given a blood sugar sensing meter and began to check glucose levels four to five times a day. CHANCE
59
250 250
BloodSugar sugar (mg/dl) Blood (mg/dl)
200 200
150 150
100 100
50 50
0
10 10
20 20
30 30
40
50 50
60 60
Day Day
15
5
5
-5
-5
-15
-15
-35
0
5
Lunch
Exercise
-25
Noon
10
Snack
15
Dinner
25
Breakfast
25
Hourly blood sugar effects (mg/dl)
Hourly blood sugar effects (mg/dl)
Figure 5. Taken over two months, 215 blood sugar readings beginning at diagnosis. Superimposed over the readings is a curve indicating the outcome of the treatments being used to control blood sugar. Blood sugar levels declined after treatment began and seem near asymptote after the first month.
-25
-35 15
20
24
Time of day
Figure 6. The blood sugar readings collapsed over days and summarized to show the typical daily variation. The plot is annotated to explain the variations. Data from March 8 until April 23, 2007.
The readings, taken using the device described earlier in this article over a period of two months, are displayed in three plots. The first, shown in Figure 5, reflects all the readings shown sequentially with a resistantly smoothed curve superimposed over them. It shows a steep decline of typical blood sugar as treatment began with a leveling off after the first month. We also note that the variation around the plotted curve continues to diminish beyond the initial month. 60
VOL. 21, NO. 4, 2008
The decline in blood sugar has four plausible causes: medication, change in diet, change in exercise, and the patient’s loss of 25 more pounds. Because all these possible causes took place simultaneously, we cannot partition the improvement in blood sugar levels among these four changes. While we can see the variation across days, and roughly within each day, the within-day variation is more obvious if we make a plot that has the 24 hours of the day on the horizontal axis and the
hourly blood sugar effects on the vertical. Blood sugar effects, in this instance, are what result after we subtract the resistant curve in Figure 5 from the actual blood sugar levels—thus what we plot has been adjusted for the long-term trend. We then aggregate across days and fit a resistant curve. Such a plot is shown as Figure 6. The curve shown in Figure 6 indicates clearly that the typical range of variation over the course of a day is about 40 mg/ dl. We see obvious spikes after each meal and a big drop when blood sugar is measured after noon-time exercise. The third descriptive plot is of the residuals. This is a plot of blood sugar levels after removing the long-term trend effects shown in Figure 5 and the daily effects shown in Figure 6. What remains are the unusual changes in blood sugar not accounted for by those other two effects. The residual plot is shown as Figure 7. One can easily see how it is the most immediately useful diagnostic plot for the patient. Each large residual should have a story associated with it. We have indicated just a few. We note that most large residuals occurred early in treatment before the patient’s blood sugar had stabilized. The unusually low readings were invariably due to exercise. The two large positive residuals that occurred more than two weeks after diagnosis were due to specific eating events. This plot makes it clear that these are to be avoided in the future.
How Were the Data Summarized? The underlying ideas behind robust smoothing are not strictly based on some form of mathematical optimization. They are meant to be far more rough-and-ready than that, although the sweetness of elegance lingers on. Hence, many alternative approaches will serve well. The one that we have used here was developed by John Tukey in a preliminary edition of his now classic Exploratory Data Analysis and dubbed by him “53h twice.” This odd title is completely descriptive. The original ordered data are smoothed by taking running medians every five (the 5 in 53h). Next, these medians are smoothed by taking running medians every three (the 3 in 53h). Next, these twice-smoothed values are “hanned,” which means they too are smoothed three at a time with a weighted linear combination (smoothed value of
Last, it seems sensible to discuss implementation of this approach. The current blood meters are wondrous devices that have in their innards the hardware necessary to implement our methods. They have plenty of storage, and even if they didn’t, adding more would be very cheap. Their screens can show statistical graphs as easily as they now show letters and other icons. To boost their capabilities in the ways we suggest, they want only some programming.
Blood Sugar Residuals (mg/dl)
Blood sugar residuals (mg/dl)
100 100 80 80
Indian Indian restaurant Restaurant
60 60
Pretzel Pretzel
40 40 20 20 00 -20 -20 -40 -40
Further Reading
Extra exercise exercise Extra
-60 -60
Farquhar, A.B. and Farquhar, H. (1891) Economic and Industrial Delusions. New York: G. P. Putnam’s Sons.
-80 -80 -100 -100 0 0
10 10
20 20
30 30
40 40
50 50
60 60
70 70
Day
Figure 7. A plot of the residuals in blood sugar readings after subtracting the long-term trend and daily variation. Some unusual points are annotated to indicate probable cause.
x(i) = [x(i-1) + 2x(i) + x(i+1)]/4). This yields an initial smoothing, which is then subtracted from the original values and the residuals are calculated. Next, the residuals are smoothed by 53h again. The sum of the two smooths is the final smooth. Then, the final smooth is subtracted from the original data to yield the residuals. Tukey was fond of describing the final result as an equation in which the data = smooth + rough. The smooth is the result of 53h twice, and the rough is the vector of residuals. Less arcane methods may work just as well, but for reasons of trust—borne of prior use— and sentiment, we chose this one. This method is what yielded the smooth shown in Figure 5. The residuals from the first smooth were then ordered by time of day, collapsing over days, and smoothed again using 53h twice. The resulting smooth was shown as Figure 6. Residuals from this smooth were then ordered by date and shown as Figure 7. Large residuals, in either direction, were interpreted.
Conclusions Diabetes is a dangerous disease that is spreading quickly. To be fully effective, its treatment requires the full participation of the patient. Moreover, it typically requires the patient to lose weight and exercise regularly—two things that are difficult to do for most people. It is our
contention that providing immediate and easily understood feedback on the relationship between the patient’s diet and exercise on blood sugar will help to reinforce proper behavior. We suggest that the use of average blood sugar has flaws that are easily ameliorated through the use of resistant/robust methods such as the one we have described here. In addition, by providing the results in a graphic form, the feedback is more vivid. Note that the word “graphic” in ordinary speech means “literal lifelikeness,” but when used to describe an iconic visual display of data, its meaning is almost exactly the opposite. We also contend that by partitioning the blood sugar results into three pieces, we highlight information that both clinician and patient need. Specifically, the overall trend curve illustrated in Figure 5 provides a clear and accurate summary of the efficacy of the treatment vis-à-vis blood sugar. At the same time, the residual plot illustrated in Figure 7 allows the patient to see immediately the extent to which a specific action on his/her part has affected blood sugar, hence implicitly suggesting changes in future behavior. The depiction of residuals from a resistant smooth makes explicit how important is the accurate recording of the details of eating and exercise, for only through careful records can the stories of the residuals be told and acted upon.
Saaddine, B., Cadwell, B., Gregg, E., Engelglau, M., Vinicor, F., Imperatore, G., and Narayan, V. (2006) “Improvements in Diabetes Processes of Care and Intermediate Outcomes: United States, 1988–2002.” Annals of Internal Medicine, 465. Available for download at www.cdc.gov/diabetes/news/ docs/diabetescare.htm. Tukey, J.W. (1977) Exploratory Data Analysis. Reading, MA: AddisonWesley. Data sources for Figure 1 Early release of 2006 data in Chapter 14 of the National Health Interview Survey, prepared under the auspices of the Centers for Disease Control and prevention. Available at www.cdc.gov/nchs/data/nhis/earlyrelease/ earlyrelease200703.pdf. Kenny, S.J., Aubert, R.E., and Geiss, L.S. (1994) Prevalence and Incidence of Non-Insulin-Dependent Diabetes. Chapter 4 in Diabetes in America (2nd ed.). (http://diabetes.niddk.nih.gov/dm/ pubs/america). LaPorte, R.E., Matsushima, M., and Chang, Y-F. (1994) Prevalence and Incidence of Insulin-Dependent Diabetes. Chapter 3 in Diabetes in America (2nd ed.). (http://diabetes.niddk.nih.gov/ dm/pubs/america)
Column Editor: Howard Wainer, Distinguished Research Scientist, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104; hwainer@ nbme.org CHANCE
61
CHANCE Graphic Display Contest: Burtin’s Antibiotic Data The year 2008 marks the 100th anniversary of the birth of Will Burtin (1908–1972). Burtin was an early developer of what has come to be called scientific visualization. In the post-World War II world, antibiotics were called “wonder drugs,” for they provided quick and easy cures for what had previously been intractable diseases. Data were being gathered to aid in learning which drug worked best for which bacterial infection. Being able to see the structure of drug performance from outcome data was an enormous aid for practitioners and scientists alike. In the fall of 1951, Burtin published a graph showing the performance of the three most popular antibiotics on 16 bacteria. The data used in his display are shown in Table 1. The entries of the table are the minimum inhibitory concentration (MIC), a measure of the effectiveness of the antibiotic. The MIC represents the concentration of antibiotic required to prevent growth in vitro. The covariate “gram staining” describes the reaction of the bacteria to Gram staining. Gram-positive bacteria are those that are stained dark blue
or violet; whereas, Gram-negative bacteria do not react that way. Contest: Submit a graphical illustration of these data and an accompanying written description of the graph. The graphs are due January 1, 2009. The three best entries will be published in CHANCE, and the authors will receive a complimentary one-year subscription (extension) to CHANCE. If multiple authors submit a winning entry, one author will receive the subscription. Winners will be announced in Volume 22, Issue 2. Entries will be judged by representatives of CHANCE’s editorial board based on clarity, insightfulness, succinctness, originality, and aesthetic appeal. Please email your entries to Howard Wainer (hwainer@ NBME.org), preferably as a PDF file, by the due date given above. Include “CHANCE graphics contest submission” in the subject line.
Table 1—Burtin’s Data Antibiotic Bacteria Aerobacter aerogenes
Penicillin
Streptomycin
Neomycin Gram Staining
870
1
1.6
negative
Brucella abortus
1
2
0.02
negative
Brucella anthracis
0.001
0.01
0.007
positive
Diplococcus pneumoniae
0.005
11
10
positive
Escherichia coli
100
0.4
0.1
negative
Klebsiella pneumoniae
850
1.2
1
negative
Mycobacterium tuberculosis
800
5
2
negative
0.1
0.1
negative
2
0.4
negative
1
0.4
0.008
negative
10
0.8
0.09
negative
Proteus vulgaris Pseudomonas aeruginosa Salmonella (Eberthella) typhosa Salmonella schottmuelleri
3 850
Staphylococcus albus
0.007
0.1
0.001
positive
Staphylococcus aureus
0.03
0.03
0.001
positive
Streptococcus fecalis
1
1
0.1
positive
Streptococcus hemolyticus
0.001
14
10
positive
Streptococcus viridans
0.005
10
40
positive
Goodness of Wit Test Jonathan Berkowitz, Column Editor
This issue’s puzzle is a “variety cryptic,” so named because the puzzle has a gimmick or theme. Not only do you need to solve the cryptic clues, but you have to discover the thematic mystery within. Two of the leading constructors of variety cryptics are Henry Cox and Emily Rathvon, whose creations appear in The Atlantic Monthly. I am a big fan of their work and rate their puzzles among the most satisfying to solve. My puzzle is based on a theme of theirs. Variety cryptics generally look different from the usual black-square grids in most cryptics. Bar cryptics use bars to indicate the beginnings and end of words, as well as unchecked squares (i.e., letters with no crossing word). In black-square grids, crossing words give only every second letter on average. In bar grids, there is much more crossing—about two-thirds of the letters in every answer are crossed by letters of intersecting words. This can be a big aid in solving. As well, the shape of a bar grid does not have to be square or rectangular. It can be a circle, heart, or anything else. However, bar cryptics are more difficult to construct because there are more constraints on the words that can fit into the grid. When constraints due to the gimmick are added, construction is even harder.
This puzzle is harder than recent ones and will provide, I hope, a good challenge for seasoned solvers. But, that shouldn’t frighten novice solvers; you can crack the clues too. (A guide to solving cryptic clues appeared in the last issue of CHANCE) By the way, you are encouraged to use solving aids such as electronic dictionaries, Google, anagram-finders, etc. It is not considered cheating! Crosswords are meant to teach and entertain. You may even learn a new word or two, or a new definition of a word you already know. Have fun and good luck. A one-year (extension of your) subscription to CHANCE will be awarded for each of two correct solutions chosen at random from among those received by the column editor by January 15, 2009. As an added incentive, a picture and short biography of each winner will be published in a subsequent issue. Please mail your completed diagram to Jonathan Berkowitz, CHANCE Goodness of Wit Test Column Editor, 4160 Staulo Crescent, Vancouver, BC Canada V6N 3S2, or send him a list of the answers by email at
[email protected], to arrive by January 15, 2009. Please note that winners to the puzzle contest in any of the three previous issues will not be eligible to win this issue’s contest.
Winners of Statistical Cryptic #22, by Thomas Jabine JJacqueline M. Gough earned her bachelor’s and master’s degrees in mathematics at the University of Waterloo, where she learned to do cryptic crossword puzzles (including during convocation). She S has worked as a statistician in the pharmaceutical industry for several se years and is currently a data scientist for the oncology team at a Eli Lilly and Company. What little spare time she has left is spent with w her husband (high-school math teacher) and two children (yes, they are good at math).
Mark Joseph works a meaningless office job, but is in his last year M o of school at California State University at Northridge for his maste ter’s degree in mathematics, focusing on probability and statistics. W When he graduates, he plans to teach at the community college le level. Joseph has been married to his wife, Charlene, for 31 years and h has two daughters, Nicky and Lisa. In his free time, Joseph enjoys re reading, studying, playing baseball, and doing cryptic crosswords. His main social concern is the environment.
Solution to Statistical Cryptic Puzzle No. 22 This puzzle appeared in CHANCE, Vol. 21, No. 2, Spring 2008, pp. 63-64. Across: 1. MEASUREMENT [rebus with container: mea(sure + men)t] 7. TOO [homophone: two] 9. WEALTHIER [anagram: either law] 10. AMASS [rebus: a + mass] 11. COHORTS [rebus: coho + RTs] 12. RIPENER [hidden word: gRIPE NERvously] 13. EASED [hidden word: oversEAS EDward] 15. ORCHESTRA [hidden word: blowtORCHES TRAp] 17. FREEMASON [rebus: free + mason] 19. PEPSI [rebus with reversal: pep + si(is)] 20. CODDLES [double definition: cooks, babies] 22. CARRION [homophone: carry-on] 24. EVOKE [container: ev(OK)e] 25. ABOMINATE [rebus with container: a + bo(min)at + e] 27. TEE [homophone: tea] 28. REFUSAL RATE [anagram: restful area]. Down: 1. MOW [initial letters: man of war] 2. AWASH [rebus: a + wash] 3. UTTERED [hidden word: aflUTTER EDgar] 4. EMISSIONS [rebus: E-missions] 5. ERROR [rebus: terror – t] 6. TRAMPLE [reverse hidden word: hELP MARTin] 7. TRAIN STOP [rebus with reversal: train + stop(pots)] 8. OBSERVATION [anagram: Borneo vista] 11. COEFFICIENT [anagram: ionic effect] 14. SPEED ZONE [rebus: speed + zone] 16. CONSCIOUS [rebus: cons + C + IOUs] 18. MALHEUR [hidden word: norMAL HEURistics] 19. PARTIAL [double definition: biased, type of derivative] 21. STAFF [double definition: aides, fill vacancies] 23. IVANA [rebus: Ivan + a] 26. EVE [container: r(eve)rie]. CHANCE
63
Goodness of Wit Test #2 (with thanks to Emily Cox and Henry Rathvon) Every clue answer discards one letter before entry into the diagram. These 34 thrown-away letters are not wasted, however. Taken in order, with Acrosses preceding Downs, they combine to spell out a complete cryptic clue for the otherwise unclued word in the center of the diagram at 20 Across (which is entered intact). All discarded letters (except for the one at 1 Across) occur at squares with crossing words.
ACROSS
DOWN
1 For example, follows every sea mammal swimming west at a somewhat lively pace (10) 10 Reversing tide harbors a tribute song (5) 11 Policeman’s report back about nothing (7) 12 Bed made from pad and hair (8) 13 Starts on my eastern Cape Cod annual vacation heaven (5) 14 So intimate, complex way to get a statistic (10) 16 NIOSH reorganized atomic particles (5, 2 words, abbrev.) 17 Silly sport on 16 Across (7) 20 See instructions 21 Hotel convenience ordered anytime (7) 23 Defeat or attach by the sound of it (5) 26 Still normal after beer adjusted the Spanish kind of curve (10) 28 Coordinate replacing start of episodes of 60 Minutes – not mine (5) 29 Morsel of lean meat makes a small noise? (8) 31 Derivative coarse and high (7) 32 Starts to key input on South Korean information booth (5) 33 Fools beginning to send older boys to student’s test evaluation (10)
1 2 3 4 5 6 7 8 9 15 18 19 20
64
VOL. 21, NO. 4, 2008
22 24 25 27 30
One-way axis indifference (6) Fluke pair in the middle of farm has left (8) Flat in Oxford houses Cuban-American immigrant (6) Attendee Gore confused (4) Formal documents about small arm pieces (6) Sailboats decorate with plastic wrap top to bottom (9) Place in routine floral emblem (7) Strange optic in the work quoted (5, 2 words, abbrev.) Most average? Most contemptible (7) Lotions more frequently found in steamships (9) Obvious single harmonic (8) Scanner label misread brocade (7, 2 words) Second new work unit between two clubs’ combined operation (7) Recognize fish in Scottish river (6) Short introduction from runner rising to pressure (6) Buy shares in sleeveless clothing item (6) Northern Europe has nerve prefix (5) Capacities for memory performances (4)