About the Authors Jim Albert is professor at Bowling Green State University. His research interests are in Bayesian modeling, statistical education, and the application of statistics in sports (especially baseball). He is a former editor of The American Statistician, enjoys playing tennis, and is an avid baseball fan. His dreams recently came true when the Philadelphia Phillies won the 2008 World Series.
Brian Schmotzer is an associate faculty member at Emory University’s Rollins School of Public Health in Atlanta, Georgia. He teaches introductory statistics and statistical computing to graduate students in the school. He also works on a variety of medical research studies and public health studies, helping to design them from a statistical point of view as well as analyzing the data results. In his spare time he tackles interesting problems in applied statistics, especially in sports such as basketball and baseball.
Don M. Chance, PhD, CFA, is the
Todd Schwartz is a research assistant
William H. Wright, Jr. Endowed Chair for Financial Services at Louisiana State University. His teaching, research, and consulting are primarily in the area of financial markets and instruments.
professor of Biostatistics and Nursing at the University of North Carolina at Chapel Hill, where he maintains collaborative relationships in the Schools of Public Health, Nursing, Medicine, and Pharmacy. He received his DrPH in 2004 from the Department of Biostatistics at UNC-CH.
Patrick Kilgo is an associate faculty
Jeff Switchenko is currently a third year PhD student at Emory University in Atlanta, Georgia. He teaches introductory SAS programming, and is engaged in research ranging from disease mapping and nutrition studies, to baseball statistics. While growing up in the Boston area, he graduated from St. John’s Prep in 2002, and later attended Bowdoin College in Brunswick, Maine where he earned his B.A. in mathematics.
member at Emory University’s Rollins School of Public Health in Atlanta, Georgia. In addition to teaching and research activities, he has a particular interest in the game of baseball and has presented his findings at national meetings. He is an active member of the Society for American Baseball Research (SABR) and enjoys a variety of baseball-related topics.
Herbert Pang is an assistant professor in the Department of Biostatistics & Bioinformatics at Duke University School of Medicine and a faculty statistician in the Cancer and Leukemia Group B (CALGB) Statistical Center. He earned his PhD in Biostatistics from Yale University in 2008.
Xiaofei Wang is an assistant professor in the Department of Biostatistics & Bioinformatics at Duke University School of Medicine and a faculty statistician in the Cancer and Leukemia Group B (CALGB) Statistical Center. He earned his PhD in Biostatistics from the University of North Carolina at Chapel Hill in 2003.
CHANCE
3
Editor’s Letter Mike Larsen, Executive Editor
Dear Readers, The two themes of this issue of CHANCE are baseball and graphics. Mark Glickman’s Here’s to Your Health column, Jonathan Berkowitz’s Goodness of Wit Test puzzle column, and a couple of letters to the editor complete the issue. For the baseball lover, there are three articles. For statisticians in general, issues of defining measures of performance, selecting a comparison group, adjusting for confounders, and modeling also appear in these articles. Jim Albert compares the career trajectories of pitchers. The rate at which pitchers allow hits and walks, adjusted for the pool of pitches and number of batters, is modeled using piecewise quadratic functions. Flexible, multilevel modeling is used to create individual trajectories that adjust for important factors and allow nuanced comparisons across pitchers. Did drugs play a role in Roger Clemens’ later career performance? Brian Schmotzer, Pat Kilgo, and Jeff Switchenko estimate the effect of performance-enhancing drugs on offensive production in baseball. The outcome measure is runs created per 27 outs. Information about drug use comes from the “Mitchell Report.” To address sensitivity to the statistical model, several models are estimated and contrasted. Finally, Don Chance takes a look at Joe DiMaggio’s 1941 56-game hitting streak. Adjusting for hit opportunities and making other assumptions, does DiMaggio have the highest probability of ever having such a streak? The three winners of the Will Burtin graphics contest (announced in Volume 21, Issue 4) are Mark Nicolich, Dibyojyoti Haldar, and Brian Schmotzer. Their graphs and biographies begin on Page 43. Howard Wainer’s Visual Revelations column presents several additional graphs and examines the lessons learned through the contest. We wish to thank the 64 entrants to the contest for their efforts. Based on the positive response to the contest, we plan to conduct a similar one in the near future. In Mark Glickman’s Here’s to Your Health column, Xiaofei Wang, Herbert Pang, and Todd Schwartz study cancer biomarkers: What are they? How are they used for prediction and classification? What role can statisticians/biostatisticians play in design, modeling, and validation?
In his Goodness of Wit Test puzzle column, Jonathan Berkowitz gives us a bar-type cryptic puzzle with one additional solving requirement. Last, but not least, two readers submitted letters. Susan Aref comments on Stephanie Land’s article (Volume 21, Number 4), “‘We All Survived’ and Other Failings of Risk Perception,” and confirms that it is, indeed, safer to be born now than in the past. Ray Stefani discusses efforts to reproduce results from tables discussed in articles, also in Volume 21, Number 4, by Brian Clauser and Stephen Stigler. The details presented here could be of use to instructors and others working to understand and explain the tables. In other news, CHANCE, working through the American Statistical Association (ASA), conducted a survey of lapsed subscribers. Approval for the survey was obtained from the ASA’s Survey Review Committee. An email message with a link to a web survey was sent to individuals who had let their subscription to CHANCE expire in the last few years. The purpose of the survey was to learn something about why people stopped subscribing and what they liked or disliked about CHANCE. Forty-four former subscribers responded. According to them, the technical level is about right and the amount and quality of graphics are acceptable, although there appears to be some room for improvement. Topics of highest interest were current events, education, environment, and health/medical. Other topics of strong interest were clinical trials, economics, government, graphics, legal issues, and surveys. One respondent added the following comment: “I want to see interesting applications. The area doesn’t matter.” In summary, we interpret the results of the survey to support the mission of CHANCE: provide accessible and interesting articles on diverse topics in which probability and statistics play critical roles. I look forward to your comments, suggestions, and article submissions. Enjoy the issue! Mike Larsen
CHANCE
5
Letters to the Editor Dear Editor, I enjoyed the article “‘We All Survived’ and Other Failings of Risk Perception” by Stephanie R. Land very much. [CHANCE,Volume 21, Issue 4] I vaguely recall seeing the email about all the childhood fun and freedom we oldies (I am from the ’50s) had and simply dismissed it as absurd. It was nice to see the graphs in Figure 3 solidly rejecting the myth about how we ALL survived so well without any new-fangled restrictions on our lives. However, there is a slight problem in the graph in Figure 4 on Page 55. In the graph in the original paper (G. K. Singh and S. M. Yu (1996) U.S. childhood mortality, 1950 through 1993: Trends and socioeconomic differentials. American Journal of Public Health 86(4):505–12) both NON-MVA and Homicide were labeled with a solid line, though the NON-MVA should have been dashed. In the copy in CHANCE, the label for Homicide got “dashed,” leading one to look at the wrong curve. Looking closely at the graph in Figure 4, the homicide rate shows a rather steady increase between 1968 and 1992, even though the other rates in Figure 4 decreased and the overall child death rates between 1930 and 2005 (Figure 3 a and b) decreased. That disturbed me. All other causes of child deaths improved except for the one of killing kids. There here is an o the accompanying graph in Singh and Yu for years five through 14 to one for years zero to four reproduced in Figure 4. That graph shows the same trends for the different causes of death. In the graph, there is also a suicide cause, which behaves much like the homicide cause at half the rate. Both the two homicide rates and the suicide rate in the two graphs in Singh and Yu were increasing between 1968 and 1992. So, not only did kids get killed more, they also became so depressed that they committed suicide at a higher and higher rate. rI wondered what happened after 1992. The data on the Materrc/ nal and Child Health Bureau web page, www.mchb.hrsa.gov/mchirc/ de chusa_04/pages/0436cm.htm, show that the three rates for homicide and suicide have all decreased from the 1992 levels. Adding the 2002 rates to the 1968–1992 rates shows the P&I (pneumonia and influenza) cause to have bottomed out, while all other rates keep decreasing. So it continues to be safer and safer to be born now, ratherr than in the fun, free, good-old days … Susanne Aref Aref Consulting Group LLC
Stephanie Land responds: I thank the reader for her interesting observations. Regarding the graph, the unfortunate aspect of the original image is that the legend does not distinguish between the dashed line (non-MVA) and solid (homicides).
6
VOL. 22, NO. 2, 2009
Dear Editor, In CHANCE Volume 21, Issue 4, the articles “War, Enmity, and Statistical Tables” by Brian Clauser and "Fisher and the 5% Level" by Stephen Stigler provided insight into the dysfunctional relationship between R. A. Fisher and Karl Pearson. Hidden within those articles was an equally interesting interaction between William Gosset and Karl Pearson. Gosset had the enviable position of brew master at Guinness Brewery, which objected to him publishing his statistical work under his own name, hence his pseudonym "student." I thought I would use the Excel TDIST command to duplicate the probabilities in Clauser's Figure 1, showing a fragment of Gosset's (Student's) table from "The Probable Error of a Mean," Biometrica, 6(1), published March 1908. Gosset's table is parameterized using z = x/s, where x is the difference from the mean and s is the standard deviation of n observations. I assumed that Gosset used the unbiased s2, found by dividing by n–1 when estimating the variance of n independent observations. To find t, as is common practice today, I divided the square root of the unbiased s2 (multiplied z) by the n. With n–1 degrees of freedom, TDIST did not duplicate Gosset's probabilities. For example, in Table 1 with z=.1 and n = 4, then t would be the 4 times .1 or .2 with n–1 = 3 degrees of freedom. The cumulative probability using TDIST is 0.5729, not 0.5633. Also, with z =.5 and n = 6, then t would be the square root of 6 times
.5 or 1.225 with n–1 = 5 degrees of freedom. The cumulative probability from TDIST is 0.8624, not 0.8428. I realized that Gosset must have used the biased s2, found by dividing by n; hence, it was necessary to find t by dividing the square root of the biased s2 (multiplying z) by the n−1. With n–1 degrees of freedom, TDIST duplicated Gosset's probabilities. For n = 4 observations, the values in column one are multiplied by the square root of 3 to get t and using 3 degrees of freedom, we get all the values in column 2. Similarly, for the n = 5 column, the values in column one are multiplied by the square root of 4 to get t and using 4 degrees of freedom, we get all the values in column 3. I downloaded a copy of Gosset's 1908 paper, and indeed, on page 3, the variance s2 was found by dividing by n; but why? The answer is contained in "Student's z, t, and s: What If Gosset Had R?" by Hanley, Julien and Moodie in The American Statistician, 62(1), February 2008. Here is what they wrote: "Gosset defined s2 as the sum of squared deviations divided by n, rather than n–1 (suggested in Airy's textbook) that yields an unbiased estimator of s2–a decision influenced by his professor Karl Pearson. Gosset would have preferred to use n–1: he wrote to a Dublin colleague in May 1907, ‘when you only have quite small numbers I think the formula with the divisor of n–1 we used
i better.’ b ’ Even E iin 1912 K ill a is Karll P Pearson—still large sample person—remarked to him that it made little difference whether the sum of squares was divided by n or n–1 ‘because only naughty brewers take n so small that the difference is not the order of the probable error’ (Pearson 1939)." True to his pseudonym, Gosset was the dutiful student to his professor, Karl Pearson. It is noteworthy that "Student" effectively parameterized his own t different from today's practice. Ray Stefani California State University, Long Beach
Correction In Volume 21, Issue 3, part of the “Children 2–5 year olds” graph for Figure 6 is missing from the article “Healthy for Life: Accounting for Transcription Errors Using Multiple Imputation— Application to a study of childhood obesity.”
CHANCE
7
Is Roger Clemens’ WHIP Trajectory
?
Unusual
Jim Albert
I
n 2006, George Mitchell was appointed by Major League Baseball’s (MLB) Commissioner of Baseball Allan H. Selig to investigate the use of performance-enhancing drugs. The “Mitchell Report” was submitted in December of 2007, and one of the ballplayers who allegedly used steroids was Roger Clemens of the Houston Astros. A general question of interest is whether Clemens had an unusual pitching career that might suggest the use of performance-enhancing substances. Hendricks Sports Management provided a detailed look at the statistical record of Roger Clemens’ career, concluding that his pattern of pitching performance was not unusual and resembled the performance patterns of other pitchers who had long careers. In Volume 21, Issue 3, of CHANCE, authors Eric Bradlow, Shane Jensen, Justin Wolfers, and Abraham Wyner re-examined Clemens’ statistics using a broader selection of pitching measures and a wider comparison set of pitchers,
reaching the different conclusion that Clemens’ career was atypical with respect to his peer group. How do you decide if a specific pitcher’s career is unusual? You begin by collecting career pitching data for pitchers comparable to that pitcher. As Clemens was a pitcher whose career spanned the years 1984–2007, it seemed reasonable to consider the careers of other recent pitchers with long careers. So, we collected pitching statistics for all pitchers who debuted between 1948 and 2007 and faced at least 5,000 batters, which is equivalent to pitching (approximately) 1,111 innings. We wanted to use a measure that reflects the ability of the one pitcher and is not heavily affected by chance variation, so we used the WHIP statistic— which measures the rate at which pitchers allow hits and walks—although the ERA (earned run average)— the rate at which pitchers allow runs—would also be a reasonable choice of pitching measure.
Roger Clemens, playing for the Double-A Corpus Christi Hooks, throws a pitch during the third inning of a baseball game against the San Antonio Missions on June 11, 2006, at Whataburger Field in Corpus Christi, Texas. AP Photo/Paul Iverson
CHANCE
9
0.36
WHIP
0.34
0.32
0.30
0.28
1940
1960
1980
2000
Year Figure1. Graph of talent WHIP distribution of pitchers for years 1934 to 2007. The solid line corresponds to the mean WHIP probability m, and the light lines correspond to the lower and upper quartiles of the WHIP probabilities.
Gibson 0.40
WHIP
0.35
0.30
0.25
0.20 20
25
30
35
40
Age Figure 2. Illustration of adjustment method for Bob Gibson. The points correspond to the season WHIP values. The solid line corresponds to the mean of the predictive distribution of WHIPs, and the dashed lines correspond to the mean plus and minus one standard deviation.
10
VOL. 22, NO. 2, 2009
Can we compare the WHIP statistics of pitchers who played in different eras? Actually, no, as the ability of batters to get hits and walks changed significantly from 1948 to 2007. We will describe a method for adjusting a pitcher’s WHIP statistic in a particular season for the talent pool of pitcher abilities and the number of batters faced that season. After this adjustment procedure is applied for all pitchers for all seasons, we can obtain a collection of standardized scores. Also, this model produces smooth estimates for the individual pitcher trajectories. Using these fitted trajectories, we will learn about the general shapes of pitchers’ careers. When studying a trajectory, two features of interest are the pitcher’s peak ability and the age at which the pitcher achieves this peak ability. We were able to select 34 great pitchers who had the greatest peak abilities with respect to the standardized WHIP measure. In this group, we categorized the trajectory shapes and determined which pitchers had trajectories that differ substantially from the typical shape. By this work, we were able to see whether Clemens really had an unusual career trajectory.
The Analysis Trajectory models Using Sean Lahman’s database (www.baseball1.com), pitching data were collected for many seasons. We limited our analysis to pitchers who debuted between 1948 and 1997 and who faced at least 5,000 batters (that is, BFP ≥ 5,000). There were 462 pitchers in this group. For each pitcher, we collected {yj}—the numbers of hits and walks (H + BB)—and {nj}—the counts of BFP—for all seasons of his career. In addition, we collected the pitchers’ ages {xj} for each season. We are interested in the pattern of the “hit and walk rates” {yj/nj} as a function of the ages {xj}. With a slight abuse of notation, we will call this hit and walk rate WHIP, although WHIP is actually defined in baseball as the number of hits and walks allowed per inning. Let pj denote the probability that the pitcher allows a hit or walk during the jth season. To summarize the pattern of WHIP probabilities {pj} as a function of age, one could use the quadratic model β0 + β1xj + β2xj2. Why is probability the motivation for the use of a quadratic model in describing the career trajectory of a pitcher’s WHIP? First, the parabolic shape of this model corresponds to the common belief that a ballplayer increases in ability until midcareer (typically in the late 20s or early 30s), and then declines in ability until retirement. Second, by use of a quadratic model, there are simple parameters that describe the key aspects of a career trajectory. The peak ability of the pitcher—the minimum value of B2 the quadratic function—is given by PEAK = B 0 1 , and 4 B2 the estimate at the age where the player achieves this peak
B1 . Although this qua2B2 dratic model is attractive for its simplicity, it does make several restrictive assumptions that are questionable in this situation. First, the quadratic model assumes the rate of increase in ability is the same as the rate of decline for a ballplayer. Second, the quadratic model is not flexible enough to allow ‘interesting’
ability is given by PEAK AGE =
trajectories, such as a pitcher who has a constant ability for many years and sharply declines toward the end of his career. One way to generalize a quadratic model is by means of a spline function, a collection of piecewise-defined quadratic functions that meet at specified points in the domain, called knots. Here, we consider the use of the spline regression model 0 + 1x j + 2 x 2j + 3( x j b1 ) 2 I ( x j b1 ) FIT = +4 ( x j b2 ) 2 I ( x j b2 ) + 5 ( x j b3 ) 2 I ( x j b3 ), where b1, b2, and b3 are fixed values of the knots. We choose b1 = 25, b2 = 30, and b3 = 35 because 25, 30, and 35 correspond to early, middle, and late periods of a pitcher’s career. This model allows flexible fitting of career trajectories of pitchers. Also, as this is a linear regression model, it is a straightforward generalization of the quadratic model. If the regression coefficients β3, β4, and β5 are set equal to zero, you get the simpler quadratic model. Adjustment of WHIP rates Our goal is to simultaneously fit WHIP career trajectories for all pitchers in the years of our study who faced at least 5,000 batters. Over time, baseball has gone through many changes for the players, and the ballpark dimensions and these changes have caused substantial changes in the ability of the players to get on-base. So, we cannot simply compare our pitcher’s WHIP of 0.300 in one season with a second pitcher’s WHIP of 0.320 in a different season at face value. We can understand the size of a WHIP of 0.320, say, by comparing this WHIP value relative to the WHIP values of other pitchers who played the same season. It is necessary to adjust the raw WHIP rates {yi/ni} for all seasons that these pitchers played. To understand how to adjust pitchers’ WHIP rates, suppose we look at a collection of WHIP values for all pitchers for a particular season. There are two principle causes for the variation we see in these values. First, pitchers possess different talents for preventing batters from getting on-base. We measured these talents by the probabilities of the pitchers allowing baserunners. So, part of the variability in the WHIP values is because of the different WHIP probabilities of the pitchers. Second, even if a few pitchers have the same WHIP probability, there will be differences in the season WHIPs for these pitchers due to luck, or chance variation. For a given season, we want to determine how much of the total variation in the observed WHIP rates results from differences in the players’ abilities (or differences in the pitcher probabilities) and how much is because of chance variation. We applied a familiar random effects model to separate the two types of variation. Suppose there are N pitchers, and for the ith pitcher, we observe yi on-base events in ni plate appearances. We assume yi is binomial (ni, pi), where pi is the probability that this pitcher allows a batter to either hit or walk. The probabilities p1, …, pN represent the talents of the N pitchers in preventing baserunners. We assume these probabilities follow a beta (a, b) distribution with density g (p|a, b) A p a 1(1 p)b 1 , 0 < p < 1. It is convenient to reparameterize the beta parameters a and b by the mean m = a/(a + b) and the precision K = a + b. The mean m is the average talent of the pitchers that season, and K measures the precision or spread of the pitching talents. The variance of the pitcher probabilities is given by m(1 − m)/(K + 1). CHANCE
11
Mussina 0.40
WHIP
0.35
0.30
0.25
0.20 20
25
30
35
40
Age Figure 3. Illustration of adjustment method for Mike Mussina. The points correspond to the season WHIP values. The solid line corresponds to the mean of the predictive distribution of WHIPs, and the dashed lines correspond to the mean plus and minus one standard deviation.
Clemens 0.40
WHIP
0.35
0.30
0.25
0.20 20
25
30
35
40
45
Age Figure 4. Illustration of adjustment method for Roger Clemens. The points correspond to the season WHIP values. The solid line corresponds to the mean of the predictive distribution of WHIPs, and the dashed lines correspond to the mean plus and minus one standard deviation.
12
VOL. 22, NO. 2, 2009
For each season from 1948 through 1997, we fit this random effects model to the data for all pitchers and got estimates of m and K. Figure 1 graphs these fitted talent distributions against the season. The solid line corresponds to the mean WHIP probability, and the lighter lines correspond to the middle 50% of the talent distribution. This graph illustrates the substantive differences in pitcher talents in preventing baserunners across these seasons. We used these fitted random effects models to make suitable adjustments to the WHIP rates. Specifically, we looked at a pitcher’s observed WHIP relative to the predictive distribution of WHIPs for players during the same season facing the same number of batters. Suppose a pitcher allows yi on-base events in ni batters faced in a particular season for a WHIP rate of yi/ni. Suppose the talent distribution for the WHIPs in that season is estimated to be a beta distribution with parameters m and K. Then, the predictive distribution of yi/ni can be shown to have a mean m and standard deviation SD m(1 m)(1/ ( K 1)1/ ni ). Using these predictive moments, a reasonable adjustment of a pitcher’s WHIP is based on the z-score y / n m zi i i . SD This z-score is a simple statistic that can be interpreted as the number of standard deviations above or below the mean for a hypothetical group of pitchers in the same season facing the same number of batters. For the great pitchers in our study, the season z-scores at their peak are smaller than −1.5, indicating the WHIP values for these pitchers were less than one and a half standard deviations below the typical WHIP for that season where the peak performance was achieved. Consider the application of this adjustment procedure for three pitchers: Bob Gibson, Mike Mussina, and Clemens. Figures 2 and 3 show the WHIP values for Gibson and Mussina. In a quick examination, the pitchers seem to have similar abilities to prevent baserunners, as both pitchers had WHIP values in the neighborhood of 0.280. But, we have to be cautious in this assessment because these pitchers played in different baseball eras. In each graph, the solid line represents the mean m of the predictive distribution and the dashed lines represent the mean plus and minus one standard deviation. In Figure 2, we see that Gibson’s WHIP values were approximately one standard deviation below the mean in the middle of his career. In contrast, Mussina’s WHIP values are between one and a half and two standard deviations below the mean. Mussina appears to be a better pitcher, relative to his peers, than Gibson, based on the adjusted WHIP criteria. Clemens’ WHIP values and comparison predictive distribution lines are displayed in Figure 4. Like Mussina, Clemens was a superior pitcher to Gibson, with many WHIP values smaller than one standard deviation below the mean. A three-level model to simultaneously estimate career trajectories The adjustment procedure was applied for the WHIP rates for all pitchers for all seasons, and all WHIP rates were re-expressed
to z-scores. For the ith player in our data set, we obtained a set of z-scores {zij} with corresponding ages {xij}, where the subscript j ranged over the career seasons of the player. We learned about a player’s ‘true’ career trajectory by fitting the regression spline model z ij i 0 + i1x ij + i 2 x ij2 + i 3( x ij b1 ) 2 I ( x ij b1 ) +i 4 ( x ij b2 ) 2 I ( x ij b2 ) + i 5 ( x ij b3 ) 2 I ( x ij b3 ) +ij , where I(A) is the indicator function equal to one if A is true and the errors {ij} are assumed to be normally distributed with mean 0 and standard deviation z. Let βi = (βi0, βi1, βi2, βi3, βi4, βi5) denote the vector of regression coefficients for the ith player. This model assumes each player possesses a unique career trajectory, which means the pitchers can exhibit different peaks, different peak ages, and different rates of improvement and decline. The regression spline defines the sampling or first level of the model. We then used a two-stage prior model to simultaneously estimate the career trajectories of the z-scores from a collection of pitchers. For a particular year, we considered the collection of pitchers (with at least 5,000 BFP) who debuted within two years of the given year. We assumed each pitcher in this collection followed a trajectory regression spline model where the regression parameters of the ith pitcher are given by βi. If we have N pitchers in our collection, the first stage of the prior assumes the regression vectors β1, ..., βN follow a common normal distribution with mean β and variance-covariance matrix ⌺β. The parameter β represents the average, or typical, career trajectory, and the variance-covariance matrix ⌺β represents the variation of the collection of true player trajectories about the average. As we had little knowledge about the locations of these parameters, we assigned β and ⌺β vague prior distributions. To model vague beliefs, the mean vector β was assigned a multvariate normal distribution with mean vector 0 and a precision matrix (inverse of the variance-covariance matrix) with small entries. Likewise, the unknown parameter matrix ⌺β was assigned a Wishart distribution, where the inverse of the scale matrix is assigned small values. This three-level model is an effective way of combining the career trajectory data for a group of pitchers. If we fit the spline model separately for the N pitchers, we obtain ‘rough’ estimates at the career trajectories due to the high variation in the WHIP data for pitchers. By simultaneously fitting the trajectories for the N estimates by the multilevel model, the individual trajectory estimates are smoothed toward the average trajectory. Average and individual trajectory estimates The mean trajectory fits
The estimated multilevel model can be used in several ways. First, by estimating the mean regression vector β, we learn about the typical career trajectory of the standardized WHIP values for pitchers (with at least 5,000 BFP) who debuted during a particular five-year interval. Second, this multilevel model allows smooth estimates of the individual pitcher trajectories {βi}, and from these estimates one can obtain estimates at the pitchers’ peak performances and ages at which they achieved these peak performances.
CHANCE
13
Year = 1954
2
2
2
1
Z WHIP
3
1 0
0
−1
−1
−1
25
30
35
20
40
25
Age
30
35
40
20
Year = 1972 2
2
Z WHIP
2
Z WHIP
3
1 0
0
−1
−1
−1
30
35
40
20
25
30
35
40
20
30
35
Age
Age
Year = 1984
Year = 1990
Year = 1996 3
2
2
2
Z WHIP
3
Z WHIP
3
1 0
0 −1
25
30
Age
35
40
40
1 0
−1
20
25
Age
1
40
1
0
25
35
Year = 1978
3
1
30
Age
3
20
25
Age
Year = 1966
Z WHIP
1
0
20
Z WHIP
Year = 1960
3
Z WHIP
Z WHIP
Year = 1948 3
−1
20
25
30
Age
35
40
20
25
30
35
40
Age
Figure 5. Graph of the mean trajectory fits in the multilevel model fitting for the pitchers who debuted in the years 1948, 1954, 1960, 1966, 1972, 1978, 1984, 1990, and 1996
14
VOL. 22, NO. 2, 2009
29.0
Peak Age
28.5
28.0
27.5
27.0
1950
1960
1970
1980
1990
Year Figure 6. Graph of the mean peak age fits in the multilevel model fitting. A given point represents the mean peak age fit among all players within two years of the given debut year. A lowess fit is drawn to show the general pattern of the graph.
Figure 5 displays the posterior mean of the mean trajectory fit for pitchers who debuted in the years 1948, 1954, 1960, 1966, 1972, 1978, 1984, 1990, and 1996. It is interesting to see that all the mean trajectories of standardized WHIP have the same basic shape. The average pitcher slowly increases in ability until the late 20s, and then rapidly decreases in pitching ability toward retirement. This figure demonstrates that the use of the quadratic model, with a symmetric shape, would be unsuitable in this application. To look for subtle changes in the mean trajectory across years, suppose we focus on the average peak age of players who debuted in different years. Figure 6 displays the mean peak age against the debut year for the years 1948 to 1997. (Remember, in each multilevel model fitting, we used data from pitchers who debuted within two years of the particular year.) There is an interesting pattern in this graph. A smoothing (lowess) curve is placed on the graph to pick up the main pattern. The mean age that pitchers peaked (with respect to the standardized WHIP measure) declined steadily from 1950 though 1960, increased until 1970, decreased until 1975, increased until 1985, and then declined in recent years. It is interesting that the large values of peak age occurred during the so-called “steroids era.” The individual trajectory fits
The remaining graphs focus on the individual pitcher trajectory estimates. After we fit a set of multilevel models, we estimated the individual trajectories {βi} for all 462 pitchers in the study, and each trajectory estimate gives an estimate at the pitcher’s
peak age and the peak ability value. We focused on the 34 pitchers whose peak standardized score is smaller than z = −1.5. The WHIP abilities of these 34 pitchers, at their peak, were one and a half or more standard deviations below the mean. Table 1 displays players sorted by the peak value. For each pitcher, the table displays the debut year, the estimate at the pitcher’s peak ability, and the estimate at the peak age. Figure 7 displays the debut year and peak age for these great pitchers. Note that most of these pitchers peaked in their late 20s, although there are a number of pitchers— Gibson, Gaylord Perry, Kevin Brown, John Smoltz, and Dick Hall—who peaked in their early 30s. Fernando Valenzuela and Dwight Goodon’s best abilities were at the beginning of their careers, and Hoyt Wilhelm (famous knuckleball pitcher) and Nolan Ryan peaked near 40 years old. Figure 8 displays the career trajectories for four representative pitchers: Pedro Martinez, Randy Johnson, Andy Messersmith, and Clemens. The points in each graph correspond to the WHIP values (on the z scale), and the smooth curve corresponds to the estimated trajectory from fitting the multilevel model. From viewing these sets of trajectories, we see a variety of trajectory shapes. It seems these trajectories can be distinguished with respect to (1) the age at which peak ability is achieved, (2) the pattern of decline, and (3) ‘unusual’ shapes that deviate greatly from the general pattern. Continued on Page 18. CHANCE
15
Table 1—Top 34 Players With Respect to Peak WHIP Rank
Pitcher
Debut Year
Peak Ability*
Peak Age*
1
Pedro Martinez
1992
−3.11
28.1
2
Greg Maddux
1986
−2.80
29.1
3
Sandy Koufax
1955
−2.69
30.0
4
Dennis Eckersley
1975
−2.52
33.6
5
J.R. Richard
1971
−2.37
30.0
6
Robin Roberts
1948
−2.24
27.4
7
Randy Johnson
1988
−2.19
35.7
8
Kevin Brown
1986
−2.13
33.5
9
Catfish Hunter
1965
−2.09
27.5
10
Hoyt Wilhelm
1952
−2.09
41.0
11
Tom Seaver
1967
−2.08
28.7
12
Juan Marichal
1960
−2.03
27.8
13
Curt Schilling
1988
−2.02
32.0
14
Sid Fernandez
1983
−1.93
27.9
15
Don Sutton
1966
−1.91
29.1
16
Fergie Jenkins
1965
−1.90
29.0
17
Mike Scott
1979
−1.88
32.1
18
Rich Gossage
1972
−1.86
28.4
19
Dwight Gooden
1984
−1.79
19.0
20
Dick Hall
1952
−1.76
33.1
21
John Smoltz
1988
−1.76
32.1
22
Roger Clemens
1984
−1.72
27.7
23
Don Drysdale
1956
−1.70
26.7
24
Dennis Martinez
1976
−1.67
36.8
25
Bob Gibson
1959
−1.67
32.9
26
Don Newcombe
1949
−1.66
27.7
27
Andy Messersmith
1968
−1.65
26.6
28
Mike Mussina
1991
−1.60
28.9
29
Nolan Ryan
1966
−1.60
39.9
30
Fernando Valenzuela
1980
−1.58
20.0
31
Gaylord Perry
1962
−1.58
33.3
32
Jim Palmer
1965
−1.54
29.0
33
Pascual Perez
1980
−1.54
31.6
34
Jim Bunning
1955
−1.50
29.7
*Peak ability is the number of standard deviations below average on the standardized scale, and peak age is the age where one estimates that this peak ability was attained.
16
VOL. 22, NO. 2, 2009
Wilhelm Ryan
40
Martinez
Johnson
35 Perry Hall
Eckersley
Brown
Peak Age
Gibson
Scott Smoltz Perez Schilling Richard Maddux Sutton Jenkins Mussina Gossage Newcombe Seaver Martinez Marichal Fernandez Roberts Clemens Hunter Messersmith Drysdale Koufax Bunning
30
25
Valenzuela Gooden
20
1940
1950
1960
1970
1980
1990
2000
Debut Year Figure 7. Scatterplot of debut year and peak age for 34 great pitchers
20
25
30
35
40
45
Randy Johnson
Pedro Martinez
1 0 −1 −2 −3
Z.WHIP
−4 −5 Andy Messersmith
Roger Clemens
1 0 −1 −2 −3 −4 −5 20
25
30
35
40
45
Age Figure 8. Trajectories of z-scores and multilevel quadratic fits for Pedro Martinez, Randy Johnson, Andy Messersmith, and Roger Clemens
CHANCE
17
What about Roger Clemens?
Recall from the introduction that we were interested in the unusual nature of Clemens’ career trajectory. To see how a pitcher’s trajectory differs from an ‘average’ pitcher’s trajectory, it is useful to define a residual trajectory. We first used the three-level model to estimate simultaneously the trajectories for the 34 great pitchers. In the fit, we have individual pitcher trajectory estimates { Bˆ i } and a mean trajectory estimate Mˆ B . If x denotes the vector of covariates for the ith pitcher for the jth season, we define the residual trajectory Ri for the ith pitcher as the difference between the individual and mean trajectory fits.
B12 , 4 B2 The residual trajectory Di measures the deviation of a pitcher’s trajectory from a typical trajectory on the z scale. By graphing the residual trajectory against the player’s age, we are interested in seeing if the residual displays unusually large negative values for large ages. (We are most interested in the pattern for large ages because Clemens was accused of using steroids toward the end of his career.) This pattern would indicate that the pitcher’s ability is deteriorating much less than would be expected by the average trajectory for great pitchers. After inspection of the residual trajectories for the 34 pitchers, Figure 9 displays the residuals for a group of pitchers, including Clemens, who displayed an unusual pattern for mature ages. The knuckleball pitcher, Wilhelm, clearly stands out because he had an unusual career that spanned from age 29 to 49. Also, Ryan had an unusually strong pitching performance into his 40s. Clemens does display an unusual residual trajectory, but his trajectory is pretty typical of the trajectories of the seven pitchers displayed in Figure 9. B0
1. Standard shapes Pedro Martinez, Greg Maddux, Robin Roberts, Catfish Hunter, Tom Seaver, Juan Marichal, Sid Fernandez, Fergie Jenkins, Goose Goosage, Don Newcombe, Mike Mussina, Jim Palmer, Pascual Perez, and Jim Bunning all exhibited similar trajectories. They all had significant ascents to their peak, peaked about age 29, and showed substantial decline toward retirement. 2. Mature peak ages Dennis Eckersley, Randy Johnson, Kevin Brown, Hoyt Wilhelm, Curt Schilling, Mike Scott, Dick Hall, John Smoltz, Dennis Martinez, Bob Gibson, and Gaylord Perry all peaked at relatively old ages. Wilhelm (the knuckleball pitcher) was especially unusual in that he peaked at age 41 and pitched until he was 49. 3. Slow declines Don Sutton, John Smoltz, and Roger Clemens all exhibited slow declines after their peak. 4. Early bloomers Don Drysdale and Andy Messersmith peaked a little early in their careers, but otherwise exhibited a normal career trajectory. Dwight Gooden and Fernando Valenzuela both peaked very early, and their trajectories showed a constant decline until retirement. 5. No decline Sandy Koufax and J. R. Richard had abrupt retirements at the peak of their abilities, so exhibited no decline. 6. Strange shapes Nolan Ryan was distinctive in that he displayed a general increase in ability from ages 20 to 40. 18
VOL. 22, NO. 2, 2009
What Did We Learn? There are several innovative aspects of our study of pitcher trajectories that distinguish it from the analysis of “A Statistical Look at Roger Clemens’ Pitching Career.” First, we adjust all pitchers’ WHIP rates for the years they played. Michael J. Shell, in his book Baseball All-Time Best Hitters, makes a persuasive argument for the need to adjust player statistics. In evaluating hitters, Shell adjusts raw hitting rates of players for late career declines, seasons of hitting feasts and famines, pool of batting talent, and the ballparks. He is implicitly recognizing hitters’ career trajectories by adjusting for late career declines. Although the authors of the CHANCE article do not directly make any adjustment, they implicitly make an adjustment by considering only pitchers in the ‘modern era’ with a lower pitching mound. Here, we adjust a pitcher’s WHIP for a particular season by fitting a model to the pitching data
computing a z-score based on the predictive distribution of the WHIP rate. Second, we provide a flexible approach for estimating pitcher career trajectories. In the Journal of Quantitative Analysis of Sports, Ray Fair, in the article “Estimated Age Effects in Baseball,” assumed baseball players all peaked at the same age and exhibited the same rates of ascent and decline. Likewise, the authors in the CHANCE article model trajectories by means of quadratic functions. In contrast, we assume each player has a unique career trajectory, modeled by a general spline function. This approach
Gaylord Perry 0 2 4 6 8 0 Don Sutton
Roger Clemens
Nolan Ryan 0 −2 −4 −6 −8 −10
Dennis Eckersley
Randy Johnson
Hoyt Wilhelm
0 2 4 6 8 0 20
25
30
35
40
45
50
20
25
30
35
40
45
50
Age
Figure 9. Residual plots for seven pitchers who displayed unusual ability for mature ages. The residual is defined as the difference between a pitcher’s fitted trajectory and the mean trajectory fit for the 34 great pitchers.
CHANCE
19
Further Reading for “Is Roger Clemens’ WHIP Trajectory Unusual?” There have been some statistical studies of career trajectories of athletes. The Bill James Baseball Abstract, Ballantine Books In a baseball context, Bill James, explains that there is a bias in simply averaging player performances over age, since only the better hitters and pitchers are playing at advanced ages. James provides evidence that baseball players generally peak at age 27. “Bridging Different Eras in Sports,” Journal of the American Statistical Association, S. M. Berry, S.C. Reese, and P. D. Larkey In the 1999 the authors performed an extensive study in which they estimated the career trajectories for athletes in baseball, hockey, and golf. They used a nonparametric aging function in their modeling and rated the top 25 hitters of all time using the criteria of batting average and home run rate. “Career Trajectories of Baseball Hitters,” Jim Albert In this 2002 Technical report, Jim Albert performed an extensive analysis of career trajectories of hitters. A linear weight measure of batting performance was used together with a multilevel model with a quadratic trajectory function. Estimates of the peak ages were made for players from different eras; this analysis suggested that players peaked about age 28-29.
allows much flexibility, and the fitted trajectories have a variety of shapes corresponding to different aging patterns. Last, this study has illustrated the benefits of multilevel modeling that assumes players born in similar years have similar trajectories. This modeling produces attractive smooth estimates at the individual trajectories and allows the estimation at the average trajectories for different seasons. In our analysis of 34 top pitchers, we see that many pitchers have trajectories where the peak ability is achieved at about age 29. Other pitchers, such as Clemens, show unusual trajectories that are unusually flat or have peaks achieved at young or old ages. By use of residual trajectories, we saw that Clemens displayed unusually strong pitching ability toward the end of his career, but there were six other pitchers in our study who also displayed similar residual trajectories and none of these pitchers has been accused of using steroids.
More Trajectories Plots of the career trajectories of all 34 great pitchers can be found at www.amstat.org/publications/chance. In addition, the above adjustment and multilevel fitting procedure were
20
VOL. 22, NO. 2, 2009
“Estimated Age Effects in Baseball,” Journal of Quantitative
Analysis of Sports, R. C. Fair In the 2008 article Fair performs an extensive analysis of career trajectories of pitchers (ERA measure) and hitters (OBP and OPS measures) in baseball. No adjustments were made to the performance measures and each player was assumed to assume a “quadratic shape” trajectory where the improvement and decline were described by separate parameters. The improvement and decline and peak age parameters was assumed to be the same across players and each player had an unique intercept parameter.
Semiparametric Regression by David Ruppert, M. P. Wand, and R. J. Carroll In the 2003 book the authors, provide an extended discussion of the use of the class of spline regression models that are used in this modeling. Data Analysis Using Regression and Multilevel/Hierarchical Models, Andrew Gelman and Jennifer Hill In this book authors Andrew Gelman and Jennifer Hill provide a number of illustrations of the benefits of multilevel modeling when one is fitting a regression model over groups, where the regression parameters can change over the grouping variable.
applied using the ERA (earned run average) pitching statistic, instead of WHIP, and fitted trajectories for the great pitchers can also be found on the web site. The ERA trajectories are similar in shape to the WHIP trajectories described here.
Further Reading Bradlow, E., Jensen, S., Wolfers, J. and Wyner A. (2008), “A Statistical Look at Roger Clemens’ Pitching Career,” CHANCE, Vol. 21, 24-30. Fair, R. (2008), “Estimated Age Effects in Baseball,” Journal of Quantitative Analysis of Sports, Vol. 4. 11 Hendricks, R., Mann, S. and Larson-Hendricks, B. (2008), “An Analysis of the Career of Roger Clemens,” www. rogerclemensreport.com. Schell, M. (1999). Baseball All-Time Best Hitters, Princeton University Press. Schell, M. (2005). Baseball’s All-Time Best Sluggers: Adjusted Batting Performance from Strikeouts to Home Runs, Princeton University Press.
‘The Natural’? The Effect of Steroids on Offensive Performance in Baseball Brian Schmotzer, Patrick D. Kilgo, and Jeff Switchenko
Oh, somewhere in this favored land the sun is shining bright; The band is playing somewhere, and somewhere hearts are light, And somewhere men are laughing, and somewhere children shout; But there is no joy in Mudville — mighty Casey has struck out. “Casey at the Bat” — Ernest Thayer
F
irst published in the San Francisco Examiner on June 3, 1888, “Casey at the Bat” captured the spirit of the national pastime, and to this day, many baseball fans know by heart this famous last stanza. Earlier in the poem, Casey is introduced as a giant among men, in complete control, and incapable of failure. As the poem continues, the tension builds, while Casey’s confidence—his cockiness—grows in step with the magnitude of the moment. It rises to a crescendo as a triumphant success is foreshadowed for the protagonist in the penultimate stanza. That Casey strikes out in the end is a classic sucker punch. The fans in Mudville, and by proxy the reader, are left with a shattered view of their hero, their team, their season, and their sport.
CHANCE
21
to the ‘Mitchell Report,’ “Prior investigations into steroid use amounted to speculation about use by individual players and ad hoc analyses of their performance.
”
The parallel with the modern game is striking. Today’s baseball players, like many professional athletes, are cast as giants among men. But just like Casey, they are capable of profound moral failure. In recent years, there has been rampant speculation that professional baseball players have used illegal steroids to increase their performance. With the evidence of abuse mounting, there has been no joy in the proverbial Mudville, as fans and non-fans, alike, confront the sullying of America’s national pastime. For decades, anabolic steroids have been used by some athletes to gain an edge in performance. As steroids are thought to increase protein synthesis and, therefore, muscle mass and strength, it is logical that some aspects of performance in sports would be enhanced. Indeed, there is little doubt that athletes in sports that rely on explosive power—such as track and field, swimming, and bicycling—can improve their performance through the careful administration of steroids into their training regimens. However, it is not immediately clear that a baseball player would see an improvement in performance by using steroids. On the one hand, the logic of increased strength implies that a hitter in baseball could bat the ball with more force. On the other hand, the hitter must first make contact with the pitched ball, and increased muscle mass may inhibit his ability to do so.
22
VOL. 22, NO. 2, 2009
Therefore, the purpose of our study was to assess whether there was an observable increase in offensive performance due to steroid use. We note here that steroids are just one type of drug to which athletes may turn. The larger class of so-called performance enhancing drugs (PEDs) includes human growth hormone (HGH). We have extended our study to include HGH and will briefly describe the results below. The focus of this article, however, is steroids.
The ‘Mitchell Report’ In the spring of 2006, the commissioner of Major League Baseball, Bud Selig, appointed former Sen. George Mitchell to investigate the extent to which PEDs, including steroids, had proliferated throughout baseball. On December 13, 2007, the “Mitchell Report” to the commissioner of baseball was released, following more than a year and a half of investigation. The 409-page report summarized the alleged abuse of PEDs among major league baseball players. In total, 89 current and former players were identified as alleged PED users. The report drew heavily on the testimony and paper trail of trainers who purportedly were involved as intermediaries between PED distributors and players. The evidence included detailed information about specific seasons and types of PED abuse allegedly undertaken by the accused players.
We used this information to compile a new database containing the following: •
The offensive players alleged to have abused PEDs
•
The seasons during which the players allegedly abused PEDs
•
The types of PEDs allegedly abused by the players (steroids, HGH, or both)
•
Other ancillary items, including the source of the allegations and whether there is a paper trail of evidence
The “Mitchell Report” is surprisingly complete with respect to this information. To ensure an objective analysis, our approach was to treat the “Mitchell Report” as inerrant and to closely record its contents without regard to their veracity or accuracy.
Every attempt was made to use conservative measurements for alleged steroid use and non-use by players. Conservative, in this sense, means that even when the “Mitchell Report” listed anecdotal evidence or implied guilt by association, these seasons were not labeled as PED seasons unless there was a definitive statement in the report (specific dates and paper trails of evidence). A more liberal reading of the report would have led to more player seasons labeled as PED seasons. We did not attempt this because it would lead to a slippery slope of which seasons to label as PED seasons without firm criteria. Such an attempt could open the study to investigator bias.
Previous Studies Prior to the “Mitchell Report,” investigations into steroid use amounted to speculation about use by individual players and ad hoc analyses of their performance. The release of the “Mitchell Report” prompted increased scrutiny. Ten days after the release, J. R. Cole and Stephen Stigler, in a 2007 non-peerreviewed newspaper editorial, checked whether player performance changed in the season steroid abuse supposedly started, compared to previous years. Using 48 hitters and 23 pitchers, they concluded that steroid use did not positively impact offensive or pitching performance—and may have hurt it. The release of the report also prompted more pseudoscientific research into individual players. For example, on January 28, 2008, Roger Clemens, a prominent pitcher and one of the accused players, released a 45-page report (www. rogerclemensreport.com) via his legal team that concluded his sustained performance into the latter years of his career was not aberrant and thus served as insufficient grounds for allegations of steroid abuse. In a February 2008 New York Times article, a group of professors—Eric Bradlow, Shane Jensen, Justin Wolfers, and Adi Wyner—countered Clemens’ claims on the basis that the comparison group chosen in the professors’ analysis exhibited selection bias. This study claimed that Clemens’ performance, when compared to a proper control group, was indeed aberrant. The common denominator in most studies is a heavy focus on one individual. The Cole-Stigler editorial
Former Sen. George Mitchell calls on a reporter during a New York news conference, Thursday, December 13, 2007, about his report on the illegal use of steroids in baseball. AP Photo/Richard Drew
notwithstanding, no comprehensive study of the alleged players and the effect of steroids on their performance was undertaken prior to our work. The key improvement in our study is the inclusion of a full and proper control group against which to make an appropriate comparison.
The Data Our study includes all offensive seasons from 1995 to 2007 with at least 50 plate appearances (PAs) in a season. A PA is recorded every time a hitter comes to the plate to bat, regardless of the outcome. This study period is chosen because it conforms to what is typically referred to as the “steroids era,” in which steroid abuse was considered to be prevalent. Further, this time period exhibited overall offensive production that is relatively constant (compared to earlier eras in baseball’s past when offensive production varied widely from where it is today). Pitchers were excluded from the analysis. Pitchers have been accused of using steroids, including players mentioned in the “Mitchell Report,” but we have not attempted to quantify the steroids effect for pitchers in this study.
By including all seasons for all offensive players, we can compare the performance in seasons denoted as steroid seasons against all other nonsteroid seasons. We believe this is the most appropriate basis for comparison because it uses all the available information, rather than an arbitrarily selected subsample of the data. To assess the impact of steroids on offensive performance, we must have a measure of offensive performance. The primary measure we chose is runs created per 27 outs (RC27). This is a statistic that measures a player’s overall offensive performance. The simplest interpretation of this statistic is that RC27 represents the average number of runs a team would be expected to score if the batter in question exclusively batted for his team. The specific equation used to calculate RC27 varies slightly, and the version we chose is the one found at www.espn.com. In addition to RC27, we considered other measures of offensive performance: home runs, isolated power, on-base percentage, and stolen bases. All the offensive performance data is available for all players by season in the Lahman database (2008). This database was merged with our newly created Mitchell database (containing the steroid abuse information) to give the final analytic database.
CHANCE
23
●
●
Raw data
●
Average at each age 20
Overall average ● ●
Runs created per 27 outs (RC27)
● ●
15
●
●
● ●
●
● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ●●●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ●●● ●● ● ● ● ● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ●● ●● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●
10
5
0
20
25
30
35
40
45
50
Age in years
Figure 1. Runs created per 27 outs (RC27) versus age in years for 6,657-player season from 1995 to 2007
Age Effects It is generally accepted that players (on average) improve early in their careers, maintain peak or near-peak performance for several years, and then tail off their performance at the end of their careers. This is termed the “age effect,” or sometimes when referring to an individual, his “career trajectory.” It is necessary to make an adjustment for age because otherwise we might mistakenly conclude a change in performance is due to steroids when a change in age is a competing explanation. Making an age adjustment for all player seasons removes age as a possible confounder in the steroids results. Figure 1 shows RC27 plotted against player age. As you can see by looking at the mean at each age (heavy solid 24
VOL. 22, NO. 2, 2009
line), the observed age effect is quite mild compared to the variability in the data. While it does seem the younger players are performing at below peak levels, the conventional wisdom that performance tails off at the end of careers is not evident in this plot. One reason for this is because of the influence of a single player, Barry Bonds, who posted remarkable seasons in his late 30s that changed the nature of the tail of the distribution. Another explanation as to why Figure 1 shows a relatively flat age effect is the impact of selection bias. As mediocre players age and get worse, they are less likely to be employed by a major league team. Hence, when we look back at all 37-year-olds, for example, we find a relatively high mean RC27 because only
the players capable of maintaining that high level are allowed to continue to play. A similar argument says that only the very good young players are allowed to play and the curve is artificially flat at the lower end. It is an open area of baseball research to determine what the true underlying average age effect curve is. In a presentation at the 2008 Joint Statistical Meetings, Phil Birnbaum reviewed the age effect problem. He suggested the curve from Figure 1 is too flat to be the truth (because of the selection effect). He further posited that a popular, but biased, estimate of the age effect based on the “paired seasons” method is too curved to be the truth, and reasonably concluded that the truth must lie somewhere in between.
To check whether the results are sensitive to different age adjustments, we applied the “paired seasons” approach from Birnbaum’s presentation and a naive adjustment based on Figure 1. We found that the steeper curve resulted in larger estimates of the steroid effect. Therefore, we report the smaller, more conservative results in this article. If researchers ever arrive at a consensus age effect curve, it will almost certainly be steeper than that shown in Figure 1, and the corresponding steroid effect estimates would be larger than those reported here. The naive method adjusted individual player seasons to the overall average (the horizontal reference line in Figure 1). The adjustment was a simple difference between the reference line and the line of averages added to each individual player season. For example, the average RC27 at age 24 was 4.49, the overall average was 4.87, so the age adjustment for 24-year-old players was 0.38. Any player season by a 24 year old would be increased by 0.38. The same type of adjustment is made for each age. Because other measures of offensive performance (not just RC27) were considered, the age effect had to be estimated for each measure and an age adjustment made in a similar manner for each. The apparent age effect for the other variables was as mild or milder as that seen for RC27 (figures not shown). Again, assuming the true age effect curve is steeper, the results presented here should be conservative. Any time in this article that RC27 or HR (home runs) or other baseball statistics are used, it should be noted that we are talking about an age-adjusted statistic.
Estimating the Steroids Effect To estimate the effect of steroid use on offensive performance, one must consider the appropriate statistical model for the data at hand. Let us start with two simplified modeling strategies that would be suitable if our data structure was a little different. First, consider a data set with n1 players on steroids and n2 players not on steroids, where each of the players contributes one season of information. To judge whether steroids are associated with increased performance in this example, you would need to perform a two-sample independent t-test, comparing the average RC27 between the two groups.
Second, consider a data set with n players, where each player contributes two seasons of information: one on steroids and one not on steroids. To judge the effect of steroids in this example, you would need to perform a paired t-test, comparing the average change in RC27 between the steroid and nonsteroid seasons. The actual data is somewhere between these two conditions. Similar to the first example, there is a steroid group and a nonsteroid group (independent data). And like the second example, some players contribute both a steroid season and a nonsteroid season (paired data). However, the real data is more complicated than either example, because many players contribute several seasons of data and some players contribute several steroid seasons and several nonsteroid seasons. To use all this data, we must turn to linear mixed effects modeling. A linear mixed effects model is an extension of the traditional linear model. One purpose of such a model is to allow the analysis of data where the observations are correlated with each other, such as when one player contributes more than one observation to the data set. This is accomplished by identifying so-called “fixed effects” and “random effects” portions of the model. The fixed effects portion is the traditional component that identifies a response or dependent variable of interest and attempts to explain its variability by regressing it on one or more predictors or independent variables. The random effects portion is the component that accounts for the variability due to a random sample of players in the data and the variability within players (i.e., season to season variability). The interpretation of the fixed effects portion of a linear mixed effects model is the same as the interpretation of the traditional linear model after taking account of the random effects. We will focus on the fixed effects interpretation because it contains the estimate of the steroid effect. The simplest model has only the fixed effect of steroids. As the variable steroids is dichotomous (each season is denoted as 0=nonsteroid or 1=steroid), this model has the flavor of
a t-test. However, as mentioned, there are repeated measures for some players that need to be accounted for, so this model needs to be fit using the mixed models framework. The fixed effects portion of the model is of the form: RC27 = b0 + b1 STEROIDS. Where b0 represents the intercept and b1 represents the slope associated with the steroids predictor. The resulting model fit for our data is: RC27 = 4.62 + 0.83 STEROIDS. This model estimates that the average RC27 in a nonsteroid season is 4.62, whereas the average in a steroid season is 4.62 + 0.83 = 5.45, an increase in offensive production of 18% attributable to steroid use. Note that this is an average effect. We would expect some players to see a larger effect, some to see a smaller effect, and even some to see no effect at all. This interpretation needs to be applied to all the effects presented in this article.
Changing the Model Specifications An objection to the simple model is that the players mentioned in the “Mitchell Report” may not be like the rest of the players in the league. Since there are only steroid seasons for “Mitchell Report” players (by definition), the effect that appears to be due to steroids may actually be an artifact of the Mitchell players being different (better) on average. To CHANCE
25
Including all the player seasons Including Bonds’ data 1. Centered RC27 = b0 + b1 STEROIDS + b2 MITCHELL 2. RC27 = b0 + b1 STEROIDS + b2 MITCHELL 3. Centered RC27 = b0 + b1 STEROIDS 4. RC27 = b0 + b1 STEROIDS Excluding Bonds’ data 5. Centered RC27 = b0 + b1 STEROIDS + b2 MITCHELL 6. RC27 = b0 + b1 STEROIDS + b2 MITCHELL 7. Centered RC27 = b0 + b1 STEROIDS 8. RC27 = b0 + b1 STEROIDS Including only seasons from players listed in the “Mitchell Report” Including Bonds’ data 9. Centered RC27 = b0 + b1 STEROIDS 10. RC27 = b0 + b1 STEROIDS Excluding Bonds’ data 11. Centered RC27 = b0 + b1 STEROIDS 12. RC27 = b0 + b1 STEROIDS
steroid use allegation against this individual player is true, but merely to assess the magnitude of the effect more clearly. Note also that of all Bonds’ seasons (he played in all of the seasons in the study), only two (2003 and 2004) are denoted as steroid seasons in this study based on the strict criteria we established for reading the “Mitchell Report.” To get a comprehensive view of how these adjustments affect the results, we performed all possible combinations of the four adjustments (see the box at left). Looking at the results of all 12 models allows us to assess the results in the context of a variety of assumptions. If the results are widely different, then we must examine what assumptions led to the differences and what assumptions are most appropriate and meaningful for the question at hand. If the results are largely the same, then we conclude that they are robust to various modeling assumptions.
Results
account for this effect, we considered three adjustments to the model. First, we included a fixed effect for whether a player is one of the “Mitchell Report” identified players (0=no, 1=yes). This is the simplest and most logical way to account for this set of players being different. Note that the Mitchell variable and the steroids variable are not the same because those mentioned in the report have both steroid years and nonsteroid years, so some seasons that are “yes” Mitchell will be “yes” steroids, while others will be “no” steroids. Second, we tried centering each player’s performance on his own mean. That is, for each player, we calculated his average RC27 across all his seasons, then for each individual season, we subtracted his RC27 from his overall average to get a new outcome variable, Centered RC27. If the Mitchell players are different on average, this effect will be eliminated by removing the effect of the overall level of each player. In fact, this adjustment goes further than the first because it removes any average difference among all the players, not just the Mitchell players. As an aside, this action drives one
26
VOL. 22, NO. 2, 2009
component of the random effects portion of the model to zero, as there is effectively no variability of average level among the players. But a mixed effects model is still necessary to account for the correlated data within players. Third, we reduced the sample of players to include only the “Mitchell Report” players. Again, any systematic difference between Mitchell players and others would be eliminated by the direct method of simply removing the other players from consideration. Obviously, this is not the preferred adjustment because we want to include all relevant data and the non-Mitchell players certainly have information to impart. In addition to adjustments to the modeling to account for the “Mitchell Report” effect, we also considered one last adjustment to investigate the effect of a single player. Bonds had seasons that were so exceptional that his effect, alone, was influential in measuring the overall effect of steroids. We removed Bonds’ seasons from the data set to determine the steroids effect with him versus without him. Note that this exercise is not intended to judge whether the
The results of the models for RC27 are summarized graphically in Figure 2. The model specifications are given below the plot. The point estimate for the percent increase in offensive production due to steroids is given, as well as a 95% confidence interval on this estimate. As you can see, the estimated effect of steroids ranges from about a 4% to about an 18% increase in offensive performance. The high estimate comes from the model where no adjustment is made for Mitchell players, and the low estimates come from models that don’t include Bonds. The strength of evidence from all the models supports the notion that the steroids effect is in excess of 5%. If 5% seems like a small increase in offensive production, the use of RC27 allows us a simple interpretation that shows how large 5% really is. A common baseball rule of thumb allows us to estimate a team’s winning percentage based on how many runs it scores (RS) and how many runs it allows (RA). The socalled Pythagorean Theorem of Baseball suggests a team’s winning percentage is well estimated by RS2 / (RS2+RA2). Because our statistic (RC27) is based on runs, we can estimate that a completely average team—not on steroids— would score about 4.6 runs per game
Percent Increase in Runs Created due to Steroids
30
20
10
0
−10
Centered: Mitchell: Bonds: Sample:
Y
N
Y
Y
N
Y
N
N
Y
Y
Y
Y
N
Y
N
N N
All
N
Y
N Mitchell
Figure 2. Results of the effect of steroids on runs created per 27 outs (RC27) from linear mixed effects models under 12 sets of assumptions. The assumptions are noted below the plot. The point estimate for the percent increase in performance due to steroids from each model is plotted with a 95% confidence interval. The models where Bonds is excluded are shaded to aid visual comparison.
CHANCE
27
Percent Increase in Home Runs due to Steroids
60
40
● ●
20
●
●
●
●
● ●
●
●
Y
N
●
●
Y
N
0
−20
Centered: Mitchell:
Y
N
Y
Y
Bonds: Sample:
N
Y
N
N
Y
Y
Y
N N
N All
Y
N Mitchell
Figure 3. Results of the effect of steroids on home runs (HR) from linear mixed effects models under 12 sets of assumptions. The assumptions are noted below the plot. The point estimate for the percent increase in performance due to steroids from each model is plotted with a 95% confidence interval. The models where Bonds is excluded are shaded to aid visual comparison.
and allow about 4.6 runs per game for a winning percentage of 0.500 and 81 wins in a 162-game season. If the steroids effect was indeed 5%, then a team composed of hitters on steroids would score about 4.6*1.05 = 4.83 runs per game and continue to allow 4.6 runs per game. Its winning percentage would be 4.832 / (4.832 + 4.62) = 0.524. Over the course of a 162-game season, this amounts to 85 wins, or an excess of four wins over expected. To give some perspective, in the 2007 season, four of the six divisions of baseball were decided by a margin of fewer than four games between the first- and secondplace teams. Clearly, an advantage of 5% is substantial. Furthermore, based on some of the models, one could make
28
VOL. 22, NO. 2, 2009
an argument that the advantage is more than 10%, which would translate to almost eight extra wins per year by the similar calculation. We think the model with an indicator variable for players mentioned in the “Mitchell Report” is the best model. Therefore, we peg the steroid effect at 12% based on all the players and 7% if Bonds is excluded. The other model results vary somewhat around those estimates, as seen in Figure 2.
Further Results Having shown strong evidence that steroid use confers an overall advantage to offensive performance on average, we next set out to better understand
what part of the offensive game is most affected. To do so, we looked at different measures of offensive performance. One popular notion is that HRs are inflated due to steroid use because the extra power gained from steroids causes the ball to travel farther, turning long fly ball outs into just-barely home runs. The results for HR are shown in Figure 3. First, note that the estimated steroids effect is similar to (if not larger than) that shown in RC27. However, the confidence intervals are substantially wider. The reason for the wider intervals is that HR is a considerably more variable measure than RC27. A simple way to show this is with the coefficient of variation (CV). We find that the CV for HR is 1.05, whereas the CV for RC27 is 0.39.
Percent Increase in Isolated Power due to Steroids
30
●
20
● ●
10
●
●
●
●
●
●
●
●
●
0
−10
Centered: Mitchell:
Y
N
Y
Y
Bonds: Sample:
N
Y
N
N
Y
Y
Y
N
Y
N
Y
N
N N
All
Y
N Mitchell
Figure 4. Results of the effect of steroids on Isolated Power (IsoP) from linear mixed effects models under 12 sets of assumptions. The assumptions are noted below the plot. The point estimate for the percent increase in performance due to steroids from each model is plotted with a 95% confidence interval. The models where Bonds is excluded are shaded to aid visual comparison.
This means the standard deviation of HR is 5% larger than its mean, whereas the standard deviation of RC27 is 60% smaller than its mean. In conclusion, it appears HR shows the same sort of steroids effect as RC27, but this conclusion must be tempered by the fact that it is made in the context of a much more variable environment. Because HRs are a crude measure of power, we next considered the statistic Isolated Power (IsoP), which attempts to more accurately measure the power aspect of offensive performance. IsoP is calculated by subtracting a player’s batting average from his slugging percentage. The batting average measures the rate a player gets a hit out of his at-bats. For example, if you get 100 hits in 400 at-bats,
your batting average would be reported at 0.250 (for the 2007 season, the mean batting average was 0.270). The slugging percentage measures the weighted batting average with the weights equal to the number of bases per hit. For example, if your 100 hits in 400 atbats were all singles (one base achieved), then the slugging percentage would be the same as batting average. If the 100 hits were 50 singles, 30 doubles (two bases achieved), and 20 home runs (four bases achieved), then the slugging percentage would be (50*1 + 30*2 + 20*4)/400 = 0.475 (the mean in 2007 was 0.423). Because it takes power to hit the ball hard enough or far enough to capture extra bases, slugging percentage is a good measure of power. But, since IsoP
subtracts the batting average (which can be achieved without power), it is an even better measure of power — it isolates the effect of power from the effect of just getting hits (hence the name). The results for IsoP are shown in Figure 4. The widths of the confidence intervals are similar to those seen in RC27. This reflects the fact that IsoP has a similar level of variability (CV=0.45). The effect of steroids on IsoP is substantially more than 5%. Given the results of all the models, the effect is probably comfortably around 10%. This suggests that while there appears to be a significant effect on overall offensive performance due to steroids (likely in excess of 5%), there is an even larger and more clear
CHANCE
29
20
Percent Increase in Stolen Bases due to Steroids
0
−20
●
●
●
●
●
● −40
●
●
●
●
●
● −60
−80
−100
Centered:
Y
Mitchell:
N
Y
Y
N
Y
N
Bonds:
N
Y
Y
Y
N
N
Y
N
N N
Sample:
Y
Y
N Mitchell
All
Figure 5. Results of the effect of steroids on stolen bases (SB) from linear mixed effects models under 12 sets of assumptions. The assumptions are noted below the plot. The point estimate for the percent increase in performance due to steroids from each model is plotted with a 95% confidence interval. The models where Bonds is excluded are shaded to aid visual comparison.
Table 1— Point Estimates of Percent Increase in Offensive Performance Due to Steroids Model #
1
RC27
9.1
30
2
3
4
5
6
7
8
9
10
11
12
12.6 7.2 18.0 4.9
7.0
3.9 11.9 7.7 11.3 4.2
6.2
HR
11.9 12.1 7.8 26.1 9.8
9.3
5.8 22.3 7.3
6.7
IsoP
9.3
8.3
6.2
15.9 7.3 20.8 8.0 14.3 6.1 18.8 7.6 13.0 6.7 11.8
VOL. 22, NO. 2, 2009
effect on the power aspect of offensive performance (likely in excess of 10%). Table 1 shows the estimated increase in power-specific performance is generally larger than the estimated increase in overall offensive performance. There are other aspects of a player’s ability, such as speed, that may be affected by steroid use. To this end, we next considered the number of stolen bases (SB) by the players. Since a player has only a limited amount of time to dash from first base to second and arrive safely for a SB, a SB is a good measure of player speed. The results for SB are shown in Figure 5. The confidence intervals are again wide due to the tremendous amount of variability in the measure (CV=1.57). Strikingly, there is a clear decrease (negative increase) in SB attributable to steroids. This effect appears to be in excess of 20%.
Percent Increase in Runs Created due to HGH
30
20
10
●
●
0
●
●
●
●
●
●
N
Y
●
●
●
●
Y
N
−10
Centered:
Y
Mitchell: Bonds: Sample:
N
Y
Y
N
Y
N
Y
Y
N
Y
N
N N
All
Y
N Mitchell
Figure 6. Results of the effect of human growth hormone (HGH) on runs created per 27 outs (RC27) from linear mixed effects models under 12 sets of assumptions. The assumptions are noted below the plot. The point estimate for the percent increase in performance due to HGH from each model is plotted with a 95% confidence interval. The models where Bonds is excluded are shaded to aid visual comparison.
The world of sports suggests speed is one of the primary athletic attributes that can be enhanced by steroids (Ben Johnson in the 1988 Olympics 100m dash). The most likely explanation for steroids in baseball showing a marked decrease in speed (as measured by SB) is that the players (or their teams) value power hitting more. Therefore, steroid abusers likely tailor their training to increase muscle mass and power in such a way as to improve their hitting, while neglecting to train to increase their speed. A 20% drop in SB seems large (we’ve already shown that a 5% increase in runs is substantial), but SBs are not an important part of the modern game. The league-wide SB per season average is only 5.66, and a 20% reduction would drop this to 4.53.
Last, we considered the effect of HGH, rather than steroids. The results for RC27 for HGH are shown in Figure 6. As you can see, the magnitude of the effect is quite close to zero, with one model even showing a negative effect. There is no evidence that HGH is associated with an increase in offensive performance. The results for the other offensive measures for HGH are not shown — they are similar to Figure 6 (with different widths of confidence intervals as expected from different CVs). A recent review article in the Annals of Internal Medicine concluded that, “Claims that growth hormone enhances physical performance are not supported by the scientific literature.”
Limitations Our study does have some limitations. First, the “Mitchell Report” was not intended as a data source for statistical analysis. We applied a strict reading of the report to decide whether a season was to be labeled a steroid season. This can lead to two types of errors. A “false positive” labeling would come about when the “Mitchell Report” lists a player season as an alleged steroid season when no abuse occurred. As the report is merely allegation, this type of error must be acknowledged. A “false negative” labeling would come about when the “Mitchell Report” fails to list a player season as an alleged steroid season when the player was abusing
CHANCE
31
steroids that year. Given anecdotal evidence that steroid use was prevalent during the Steroids Era, it seems likely that this sort of error is commonplace. What effect would these mis-measurements have on the results? Of course it is impossible to know, but given that a significant steroids effect was observed, it is likely that more accurate data would strengthen the conclusion. We would expect performance in false negative seasons to be inflated due to steroid use. Currently, those seasons are allocated to the nonsteroid column, and switching them to the proper steroid column would strengthen the results. Similarly, we would expect performance in false positive seasons to be non-inflated due to nonsteroid use, and reallocating them to the proper category would again strengthen the results. A second limitation is the possibility that steroid users took the drugs in response to an injury. Thus, there may be concern that the effect being attributed to steroid abuse may actually be more accurately attributed to injury recovery or regression to the mean. There is no way, in a study such as this, to decide which is the true cause. However, the anecdotal evidence is that HGH is the preferred performance-enhancing drug for injury recovery, and our work suggests that HGH has no effect on performance, leaving steroid abuse as the more likely cause of the observed effect. A third limitation is related to steroid dosing. While the “Mitchell Report” is comprehensive in many respects, it is simply not possible to know the quantity of steroids that actually made its way into the players’ bodies. Therefore, the estimates obtained in this study are most properly interpreted as average effects. It is undeniable that some players would have seen larger and some smaller (or no) advantages compared to the averages seen here. And it is likely that some of the variation is attributable to dosing.
Conclusion Given the body of evidence from the “Mitchell Report,” it appears that use of performance enhancing drugs in the form of steroids (but not HGH) does confer an advantage to offensive performance. That advantage is estimated to be a 12% increase in run production, given the players in the league during the Steroids
32
VOL. 22, NO. 2, 2009
Era. Even discounting one remarkable player in that time frame, the advantage is still estimated to be 7%. Further investigation shows that power as measured by home runs or isolated power is positively affected by steroids, likely beyond a 10% increase, but speed as measured by stolen bases is negatively affected, likely beyond a 20% decrease. While offensive performance appears to be enhanced by the use of steroids, there can be no mistaking the negative impact the scandal has had on the reputation of the once-gilded sport of baseball. Indeed, these players—these mighty Caseys—have struck out. Editor’s Note: One of the top players in baseball (Alex Rodriguez) admitted to steroid use this spring. In our study, all his seasons are labeled as “nonsteroids” because he does not appear in the “Mitchell Report.”
Further Reading Birnbaum P. “Studying the Effects of Aging in Major League Baseball.” Joint Statistical Meetings Presentation, 2008. Bradlow E., Jensen S., Wolfers J., Wyner A. “Report Backing Clemens Chooses Its Facts Carefully.” The New York Times (Keeping Score Section), February 10, 2008. Cole J.R., Stigler S.M. “More Juice, Less Punch.” An editorial published in The New York Times, December 22, 2007. Hendricks R.A., Mann S.L., LarsonHendricks B.R. “An Analysis of the Career of Roger Clemens.” Available at www.rogerclemensreport.com, 2008. Lahman, S. The Lahman Database (www. baseball1.com). 2008. Liu H., Bravata D.M., Olkin I. “Systematic Review: The Effects of Growth Hormone on Athletic Performance.” Annals of Internal Medicine, 148:10, 2008. Mitchell, George J. “Report to the Commissioner of Baseball of an Independent Investigation Into the Illegal Use of Steroids and Other Performance Enhancing Substances By Players in Major League Baseball.” DLA Piper US LLP, December 13, 2007. Schmotzer B., Switchenko J., Kilgo P. “Did Steroid Use Enhance the Performance of the Mitchell 89? The Effect of Performance Enhancing Drugs on Offensive Performance from 1995–2007.” Journal of Quantitative Analysis in Sports, 4(3).
What Are the Odds? Another look at DiMaggio’s streak Don M. Chance
Joe DiMaggio lines a single to left field in the seventh inning of the second game of a doubleheader at Washington on June 29, 1941, to set a record for hitting safely in 42 consecutive games. In the first game, DiMaggio tied George Sisler’s record of 41 games, set in 1922. The catcher is Jake Early of the Washington Senators. Yankees won both games, 9–4, 7–5. AP Photo
O
ne of the most amazing athletic feats is the celebrated 56-game hitting streak of Joe DiMaggio in the 1941 baseball season. Popular opinion is that the streak was an extremely unlikely event, and discussion of the streak has been widespread. Calculating the probability that DiMaggio would hit in 56 consecutive games is a common exercise for a probability class, but there are many other related questions that arise. The probability of a specific person winning a lottery is clearly
not very high, but the chance that someone will win the lottery could be quite high. When we allow the possibility that some person, that is, any person, will win the lottery, the probability is much higher. If we conduct the lottery many times, the likelihood of there being a winner increases even further. The analogy to a hitting streak in baseball is that the chance of someone having such a long hitting streak is considerably higher than the chance that this person is Joe DiMaggio or any particular player of interest. Given enough players and CHANCE
33
Definition of a Streak and Other Long Streaks There are some subtleties in how a consecutive-game hitting streak is defined. On the Major League Baseball web site, the official rules state: Rule 10.23 (b) Consecutive-game Hitting Streak: A consecutive-game hitting streak shall not be stopped if all of a batter’s plate appearances (one or more) in a game result in a base on balls, hits batsman, defensive interference, or obstruction, or a sacrifice bunt. The streak will end if the player has a sacrifice fly and no hit. A player’s individual consecutive-game hitting streak shall be determined by the consecutive games in which such player appears and is not determined by his club’s games. Thus, a streak is established only if the player has at least one plate appearance that does not involve one of the abovementioned outcomes. At the end of the 2007 season, there were 43 streaks of at least 30 games, achieved by 41 players, with two players who did it twice—Ty Cobb and George Sisler. There have been 20 streaks in the American League and 23 in the National League. Two streaks have spanned seasons, the most recent being a 38-game streak in 2005–2006 by Jimmy Rollins of the Philadelphia Phillies. Interestingly, not all streak hitters were outstanding hitters. Streak hitting may well reflect the ability to make effective contact at pitches outside the strike zone, the tendency of which can lower one’s average over the long run. For example, in 1987, Benito Santiago hit in 34 consecutive games, but batted only 0.300 that year and only 0.263 in his career. Ken Landreaux batted only 0.281 in the year of his 31-game streak and only 0.268 in his career. The list contains another DiMaggio, Joe’s brother Dom, who compiled a 34-game hitting streak for the Boston Red Sox in 1949 and had a lifetime batting average in 11 seasons of 0.298.
A (Very) Simplified Estimate of the Probability of a Streak Ty Cobb, outfielder for the Detroit Tigers, is shown in action during practice in March of 1921. AP Photo
enough opportunities, why are streaks of this length so rare? How unusual is it, given the large number of players and of games in the history of the sport? According to www.baseball-reference.com, by the end of the 2007 season, there were 382,852 games played. Considering that almost every game consists of 20–25 participants, each with an opportunity to start or continue a hitting streak, there were quite a few opportunities for 56-game hitting streaks. As long as someone gets a hit in at least 56 games in a row, a DiMaggio-like streak will occur. When considering the probability of such a streak, we should not care who did it. That the streak was achieved by one of the most famous and popular baseball players in history is as irrelevant as who won the lottery, as long as someone did. 34
VOL. 22, NO. 2, 2009
Consider a player who has four official at-bats per game and gets one hit. For now, I will treat batting average—the ratio of hits to official at-bats—as indicative of the player’s probability of getting a hit. Intuition suggests it would be difficult to hit in a large number of consecutive games if the player will make an out twice as often making a hit. Define b as batting average and A as the number of official at-bats per game. An initial estimate of the probability of getting a hit in a game, p*, is one minus the probability of not getting a hit in a game, or p* = 1 – (1 – b)A, assuming b is constant and outcomes of at-bats are independent of one another. With a batting average of 0.333, we get p* = 1 – (1 – 0.333)4 = 0.802. Thus, a 0.333 hitter with four official at-bats per game has a more than 80% chance of getting a hit in a game. For even lower averages, these probabilities may seem surprisingly large. A mediocre 0.250 hitter with four official at-bats has a more than two-thirds chance of getting a hit in a game (1 – (1 – 0.250)4 = 1 – 81/256 = 0.684). In fact, even if a player has a pitcher-like batting average of 0.160, he is more
likely than not to get a hit in a game if he has four official atbats (1 – (1 – 0.160)4 = 0.502). The probability of getting a hit in s consecutive games, p(s)*, is found by multiplying this value (p*) by itself s times, that is, raising it to the s power, p(s)* = (p*)s. Thus, for a 0.333 hitter, the probability of getting a hit in 56 consecutive games would appear to be p(56)* = (0.703)56 = 0.0000000027, or about 1-in-400 million. In DiMaggio’s case, his 0.409 batting average and 3.98 at-bats per game during the streak leads to a probability of 0.00063, or about 1-in-1,585. The asterisk in the notation above is used because there are some problems with these measures. For one, batting average is not the probability of a hit. Although baseball rules allow that a streak is not stopped if a player does not receive an official at-bat, a streak is stopped if a player has received an official at-bat during a game, has so far failed to get a hit, and then walks in his last at-bat. In fact, walks are a strong determinant of the likelihood that a player will have a streak. The old expression, “A walk is as good as a hit,” is not true for a player trying to extend a streak. So what we need is not batting average, but the probability that a player will get a hit when he has an opportunity to get a hit. Denote the number of hits as h and the number of hitting opportunities as H. Then, the probability that he will get a hit when he has an opportunity is estimated as h/H. Now, we need an estimate of H. Baseball statisticians use a measure called plate appearances, which are official at-bats plus unofficial at-bats, the latter including walks, sacrifice hits, sacrifice flies, and hitsby-pitch. This measure comes closer to hitting opportunities than at-bats, but it is not exact. Suppose in a game, a player has three official at-bats with one hit, plus one walk, and one sacrifice bunt. It can be argued that his probability of a hit is 1-in-4, because on the sacrifice bunt, he was not attempting to get a hit. He was, however, attempting to get a hit when he walked. If instead of the sacrifice bunt, he had a sacrifice fly, however, we should count the sacrifice fly as an attempt to get a hit because the player was swinging the bat. If the walk was intentional, we should not count it because the player did not have a chance to get a hit. There are some hitting opportunities in which a player receives a few ordinary pitches and then is intentionally walked, but these are not common. There also are plate appearances in which a player walks but does not see any hittable pitches, thereby resulting in a nonintentional walk, but a lost opportunity to extend the streak. These types of walks are not tallied in official baseball statistics, however, and are unlikely to affect the overall figures by much. If the player is hit by a pitch, it is not definitive as to whether he had a chance to get a hit, but we will assume that such an outcome should not be counted against the player. Fortunately, hit batsmen are relatively few compared to total plate appearances. For example, Joe DiMaggio had 7,671 plate appearances in his career and was hit only 46 times. Recall that the initial objective is to estimate the probability in a game that when the player was attempting to get a hit, he succeeded. Thus, we should calculate plate appearances minus hits-by-pitch, sacrifice bunts, and intentional bases on balls. Unfortunately, hit batsmen were not recorded until 1887, records on sacrifice bunts were not kept until 1895, and intentional bases on balls were not tallied until 1955, so the official baseball statistics do not reflect these factors for
some players over all or a portion of their careers. (For hit batsmen, this will have no effect because it is not counted as a plate appearance.) This overall figure for adjusted plate appearances is H, hitting opportunities. Thus, the probability of a hit in a game is 1 minus the probability of a hit in a single opportunity (h/H) raised to the power of the number of hitting opportunities per game. We define the latter as H/g, where g is the number of games. Thus, p is 1– (1– (h/H))H/g, assuming again that attempts have a constant probability and are independent. Note also that H/g might not be an integer. For DiMaggio, during the streak, there were 246 hitting opportunities with 91 hits for a probability of a hit of 0.370 per at-bat. The number of hitting opportunities per game is, therefore, 246/56=4.39. An estimate of the probability of DiMaggio getting a hit in a game during the streak is therefore p = 1– (1– 0.370)4.39=0.868. His probability of getting a hit in every game for a single run of 56 straight games is p(56) = (0.868)56 = 0.00036, or about 1-in-2,772. Of course, we have to acknowledge that these measures assume independence from one hitting opportunity to another and do not reflect how pressure can affect the player or even his teammates or opponents. But that is a problem with any analysis of a seemingly rare human event.
Probability of a Streak in a Career Another problem with this approach is that it gives the player only a single run of 56 games in which to obtain a hit in each game. Over the course of a season in which DiMaggio played 139 games, there are 139 – 56 + 1 = 84 possible (overlapping) 56-game periods. Considering that a streak can span seasons, we also should allow the possibility that the streak might have started in 1940 or ended in 1942. Carrying that argument further, however, we should consider the possibility of his having such a streak during any 56-game period in his career of 1,736 games. But, if we consider his entire career as a possibility, we can hardly use his performance during the 56-game period or in the 1941 season. Instead, his career performance would be more appropriate. Adapting a formula from the classic An Introduction to Probability Theory and Its Applications, by William Feller, the probability of a streak of length s during n games with a probability of hit p per game is p(n,s)=1– (1– px)/[(s+1– sx)(1– p)](1/xn+1), where x is approximately 1+(1– p)ps+(s+1)((1– p)ps)2. The measure x is technically found by iterative solution as xn+1=1+(1– p)psxn with x0=1. The above specification is a quadratic approximation found at MathForum.org. In virtually all cases in this study, x is essentially 1.0 to several decimal places. Of course, if n were small, then exact calculations could be made. In this application, the approximation is very useful. In DiMaggio’s career, using hitting opportunities to determine the probability of hitting in a game, p is 0.778, and using the above formula, the probability of a streak of 56 games in his career is 0.000295, or about 1-in-3,394. G. Warrack, in a 1995 CHANCE article titled “The Great Streak,” undertook a similar analysis and reported a value of p = 0.777 and a streak probability of 0.000274, or about 1-in-3,650. M. Freiman, in a Baseball Research Journal article, estimates the probability for DiMaggio over his lifetime at 0.00121, or 1-in-826. Several Continued on Page 38 CHANCE 35
Table 1— Probabilities of a 56-Game Hitting Streak by the Top 50 Hitters of All Time
36
Player
Batting Average
Hitting Opportunities
Probability of a Hit in a Game (p)
Probability of the Streak
Likelihood (1-in)
Streak Rank
Ty Cobb
0.366
12,777
0.812
0.004890
204
1
Rogers Hornsby
0.358
9,259
0.790
0.000842
1,187
16
Joe Jackson
0.356
5,559
0.798
0.000869
1,151
15
Lefty O'Doul
0.349
3,620
0.756
0.000036
27,916
66
Ed Delahanty
0.346
8,340
0.816
0.003806
263
2
Tris Speaker
0.345
11,679
0.777
0.000434
2,302
21
Ted Williams
0.344
9,700
0.742
0.000031
32,440
70
Billy Hamilton
0.344
7,544
0.798
0.000988
1,013
14
Dan Brouthers
0.342
7,656
0.804
0.001610
621
11
Babe Ruth
0.342
10,503
0.738
0.000027
37,015
72
Dave Orr
0.342
3,411
0.822
0.002246
445
7
Harry Heilmann
0.342
8,683
0.772
0.000244
4,099
34
Pete Browning
0.341
5,315
0.811
0.001700
588
10
Willie Keeler
0.341
9,244
0.810
0.002967
337
3
Bill Terry
0.341
6,974
0.783
0.000419
2,385
23
George Sisler
0.340
8,787
0.808
0.002477
404
5
Lou Gehrig
0.340
9,554
0.772
0.000251
3,989
32
Jake Stenzel
0.339
3,381
0.797
0.000428
2,338
22
Jesse Burkett
0.338
9,525
0.806
0.002202
454
8
Tony Gwynn
0.338
9,984
0.787
0.000752
1,329
17
Nap Lajoie
0.338
10,239
0.792
0.001101
909
13
Riggs Stephenson
0.336
5,043
0.747
0.000026
38,332
74
Al Simmons
0.334
9,404
0.795
0.001142
876
12
John McGraw
0.334
4,894
0.750
0.000026
38,071
73
Ichiro Suzuki*
0.333
5,046
0.819
0.002738
365
4
Paul Waner
0.333
10,588
0.770
0.000246
4,069
33
VOL. 22, NO. 2, 2009
Player
Batting Average
Hitting Opportunities
Probability of a Hit in a Game (p)
Probability of the Streak
Likelihood (1-in)
Streak Rank
Eddie Collins
0.333
11,525
0.749
0.000066
15,116
59
Mike Donlin
0.333
4,186
0.768
0.000085
11,708
53
Cap Anson
0.333
11,292
0.801
0.001950
513
9
Todd Helton*
0.332
6,590
0.754
0.000050
20,158
63
Albert Pujols*
0.332
4,620
0.767
0.000084
11,861
54
Stan Musial
0.331
12,550
0.757
0.000125
7,978
48
Sam Thompson
0.331
6,497
0.813
0.002360
424
6
Bill Lange
0.330
3,570
0.786
0.000227
4,407
36
Heinie Manush
0.330
8,230
0.777
0.000322
3,110
26
Wade Boggs
0.328
10,531
0.766
0.000187
5,357
40
Rod Carew
0.328
10,278
0.769
0.000235
4,249
35
Honus Wagner
0.327
11,518
0.766
0.000205
4,872
39
Tip O'Neill
0.326
4,720
0.789
0.000369
2,710
24
Bob Fothergill
0.325
3,491
0.683
0.000000
5,807,999
100
Jimmie Foxx
0.325
9,599
0.737
0.000023
44,000
78
Earle Combs
0.325
6,433
0.780
0.000282
3,545
29
Joe Dimaggio
0.325
7,657
0.778
0.000295
3,394
28
Vladimir Guerrero*
0.325
6,596
0.767
0.000131
7,652
47
Babe Herman
0.324
6,134
0.751
0.000040
25,087
64
Hugh Duffy
0.324
7,733
0.789
0.000621
1,611
19
Joe Medwick
0.324
8,098
0.774
0.000252
3,974
31
Edd Roush
0.323
7,900
0.762
0.000114
8,748
50
Sam Rice
0.322
10,033
0.771
0.000259
3,866
30
Ross Youngs
0.322
5,214
0.765
0.000086
11,631
52
The top 100 players are analyzed, but only the top 50 are shown here. The highest values among the top 100 hitters are in bold. A more detailed table including the top 100 players is available at www.amstat.org/publications/chance/supplemental.cfm. *Indicates the player was active as of the end of the 2007 season
CHANCE
37
Probability of Streaks 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 30
35
40
45
50
55
60
65
70
Game Streak
Figure 1. Probabilities of streaks based on 30 to 70 games by the top 100 hitters of all time
other researchers estimate probabilities for a single season and for various other players over single seasons, their best seasons, and their lifetimes (see Further Reading).
Probability of a Streak Among the Top 100 Hitters It is not feasible to analyze data for every player in the history of the game. The most likely candidates for streaks would seem to be the best hitters. I focus initially on the top 100 hitters of all time through the 2007 season, based on at least 3,000 plate appearances as listed at www.baseball-reference.com/leaders/ BA_career.shtml. This group consists of nine current and 91 retired players. The batting averages range from Cobb’s 0.366 to Bug Holliday’s 0.311. The results are presented in Table 1. Only the top 50 players by batting average are shown, but the complete table is available at www.amstat.org/publications/chance/ supplemental.cfm. Note a few interesting results. The player most likely to get a hit in a game is Dave Orr, who played in the 1880s and is 11th in all-time batting. Of modern players, the most likely to get a hit in a game is Ichiro Suzuki, although his lifetime batting average of 0.333 puts him only 25th in all-time batting. Ed Delahanty is third in likelihood of getting a hit in a game, Sam Thompson is fourth, and Cobb—the all-time best hitter—is fifth. But Cobb is the player most likely to have had a 56-game hitting streak at least once in his career at a probability of 0.00489, or 1-in-204. The second-best hitter of all time, Rogers Hornsby, ranks only 16th in likelihood of having the streak. Thompson, only the 33rd-best hitter, ranks sixth in streak likelihood. Ted Williams, the seventh-best hitter, ranks only 70th in streak likelihood, and Babe Ruth—the 10th-best hitter—ranks only 72nd. And what about Joe DiMaggio, the 43rd-best hitter? He ranks 28th in streak likelihood at a probability of 0.000295, or 1-in 3,394. There is something to be said for DiMaggio ranking much higher in streak likelihood than in hitting. Clearly, there is more to achieving a streak than merely batting average. In fact, games played is a major determinant, as it is a significant driver of hitting opportunities. Consider Jake Stenzel 38
VOL. 22, NO. 2, 2009
and the aforementioned Orr. Stenzel ranks 22nd in streak probability, but played only 766 games. Orr ranks seventh with only 791 games. If we extrapolate to a career three times as long, which puts their career longevity much closer to the other top hitters, Stenzel moves up to 12th and Orr surpasses Cobb as the most likely player to have achieved the streak. Of course, we do not know whether they would have maintained the same level of performance, but we can clearly see playing a lot of games is a major factor. It is interesting to consider why some of the best hitters rank so low and some of the lowest-ranked hitters, although still outstanding hitters, rank so high in streak probability. One explanation is the number of walks. Williams averaged one walk every 4.84 plate appearances, and Babe Ruth averaged one every 5.15 plate appearances. Thompson, in contrast, averaged one walk every 14.5 plate appearances. Orr walked only once every 34.8 plate appearances, which was partly a result of a rule change I will discuss later. Nonetheless, even after accounting for this effect, Orr walked with about the same infrequency. He ranked 11th in batting, but seventh in streak likelihood. Players who received few walks are likely to be either players who preceded powerful hitters in the batting order or were capable of getting hits on bad pitches. Now let us estimate the overall probability that there will be at least one 56-game hitting streak over the careers of these 100 players. Let pi(ni,s) be the probability for player i, where i = 1, 2, …, 100; ni be the number of games in the player’s career; and s be the streak of interest, which is 56 in this case, but we will change that figure later. To estimate the overall probability, we cannot simply add the probabilities for the individual players because achievement of the streak is not mutually exclusive. More than one player can have the streak. Thus, we must find the probability that no player has the streak and subtract that figure from 1. The probability that player i does not have a streak is 1 – pi(ni,s). The overall probability of at least one streak is 1 minus the probability of no streaks among all the 100 players. The probability of no streaks for all 100 players is (1 – p1(n1,s)) x (1 – p2(n2,s)) x … (1 – p100(n100,s)), or in other words, the
Rule Changes and Their Effect on Streaks Rule changes over the years have altered the interpretation of a streak. For example, the number of balls required for a “base-on-balls” has changed. In addition, at one time, foul balls with less than two strikes did not count as strikes. In fact, the second-longest streak, 45 games by Willie Keeler, was achieved during that era. Another player from that era with a high probability of a long streak is Cal McVey. McVey was a versatile player, who played catcher but also all infield positions as well as outfielder and pitcher. During his brief career in the 19th century, he was indeed a good hitter. He played for the Boston Red Stockings of the National Association for four years, a league that only lasted those four years and later became the modern-day National League. McVey then played two years for the Chicago Cubs and two years for the Cincinnati Reds. Seasons were shorter at that time, and McVey played only 530 games in his nine-year career, batting 0.346. But that average does not officially place him in the top 100 hitters because he did not have at least 3,000 plate appearances. In fact, he had the second-fewest games played of all of the players we examined. McVey achieved his streak of 30 games during the period of June 1 to August 8, 1876. To put that summer in perspective, it was during McVey’s streak that General George Custer and his 7th Cavalry were defeated at the Battle of Little Big Horn, the first transcontinental train ride was completed, the United States celebrated its centennial, and Colorado became a state. During his career, McVey walked only 30 times in 2,543 plate appearances. Walks, however, were not common then because seven balls were required for a walk. In 1887, the number of balls for a walk was reduced to five, and in 1888, it was reduced to the current rule of four. Thus, it is not clear that McVey’s record should count. We can alter his bases-on-balls to a reasonable number and determine if he were really a threat to establish a long streak.
product of the probabilities of no streak for each player. Then, we subtract this number from 1 to get the probability of at least one streak. The answer is 0.0442, or about 1-in-23. This means that if the entire history of baseball could be played 23 times, we would expect at least one 56-game hitting streak from these 100 players. It is simple to examine the likelihood of other hitting streaks. Figure 1 shows the probability of each hitting streak from 30 to 70 games. Note that the probability of at least one streak up to about 40 games is quite high and then drops off rapidly. In the history of baseball, there have been 41 streaks of 30 to 39 games. The probability of at least one 40-game streak is about 0.821 (1-in-1.2), while the probability of a 45-game streak is only about half that, at 0.419 (1-in-2.4). There have been only five streaks of 40 to 45 games. The probability of a 50-game hitting streak is 0.16 (1-in-6.3), and the probability of a 60-game hitting streak is 0.018 (1-in-55.6). It is also interesting to note that the probability of at least one 79-game
Cal McVey— Catcher (Boston Red Stockings) 1874 Courtesy of New York Public Library Digital Gallery
hitting streak among these 100 players is approximately the same as the probability that DiMaggio would have at least one 56-game hitting streak in his career. It is also easy to bump up the lifetime batting averages of these players to see how much a given collective increase in hitting ability would increase the chance of a streak. Adding 10 points (0.01) to each player puts the overall probability at 0.089 (1-in-11.2), and adding 20 points puts the overall probability at 0.168, or about 1-in-6. These estimations are somewhat unrealistic in that they assume that all players have significantly higher lifetime averages, but they do give an idea of at what level higher these players would have had to have played to improve the probability to a certain level. Of course, we also know that if a player is hitting much better than his lifetime average, the probability of a streak is much higher, but it is not possible to definitely say how much higher the probability is because we cannot hypothesize how long the player’s average would remain high. CHANCE
39
Table 2 — Probabilities of a 56-Game Hitting Streak by the Players Who Achieved a Streak of at Least 30 Games and Are Not Among the Top 100 Hitters of All Time*
Player
Batting Average
Hitting Opportunities
Probability of a Hit in a Game (p)
Probability of the Streak
Pete Rose
0.303
15,638
0.752
0.000102
9,764
Bill Dahlen
0.272
10,235
0.683
0.000000
2,396,802
Paul Molitor
0.306
11,985
0.765
0.000190
5,272
Jimmy Rollins
0.277
5,107
0.742
0.000015
65,539
Tommy Holmes
0.302
5,496
0.737
0.000012
81,104
Luis Castillo
0.294
6,140
0.736
0.000012
82,721
Chase Utley
0.300
2,409
0.725
0.000002
455,352
George McQuinn
0.276
6,467
0.691
0.000000
2,048,046
Dom Dimaggio
0.298
6,421
0.751
0.000038
26,565
Benito Santiago
0.263
7,437
0.654
0.000000
31,328,349
George Davis
0.295
9,976
0.729
0.000013
76,272
Hal Chase
0.291
7,723
0.733
0.000013
74,227
Willie Davis
0.279
9,664
0.706
0.000002
412,069
Rico Carty
0.299
6,250
0.694
0.000001
1,621,939
Ken Landreaux
0.268
4,431
0.632
0.000000
328,544,907
Cal McVey
0.346
2,543
0.866
0.019723
51
Elmer Smith
0.310
5,362
0.747
0.000024
41,470
Ron LeFlore
0.288
4,843
0.742
0.000015
65,162
George Brett
0.305
11,369
0.745
0.000045
22,039
Jerome Walton
0.269
1,742
0.555
0.000000
830,003,617,282
Sandy Alomar, Jr.
0.273
4,802
0.646
0.000000
92,246,664
Eric Davis
0.269
6,085
0.633
0.000000
228,299,708
Luis Gonzalez
0.284
9,985
0.691
0.000001
1,353,552
Willy Taveras
0.293
1,605
0.715
0.000001
1,472,527
Moises Alou
0.303
7,759
0.723
0.000007
151,491
*A more detailed table can be found at www.amstat.org/publications/chance/supplemental.cfm.
40
Likelihood (1-in)
VOL. 22, NO. 2, 2009
Probability of a Streak Including Other Players Who Had Streaks The 100 best hitters would seem to be the most likely group from which to examine the overall probability of a 56-game streak. But of the 43 streaks of at least 30 games achieved by 41 players, only 16 are in the top 100 all-time hitters. There may be isolated cases of players not on this list who might have had significant probabilities of a streak, but those players would have had a remarkable number of plate appearances and hits without a high enough batting average to make the top 100. However, to extend the analysis to other possible candidates, we should consider those players who actually did achieve significant hitting streaks but were not in the top 100 batters. For example, there are 25 other players who obtained hitting streaks of at least 30 games. Let us take a look at this group. This set of 25 players contains only seven who batted at least 0.300 over their careers—including Pete Rose, whose 1978 streak of 44 games is third—and four who batted below 0.270. Table 2 shows the estimates for these 25 players. One player, Cal McVey, stands out. Of the remaining players, Paul Molitor had the highest probability of a streak at 0.00019, or 1-in-5,272. Rose was second at 0.000102, or 1-in-9,764. The average probability of the top 100 players is 0.00045, so both Rose and Molitor would rank below average in streak likelihood. None of the other players is close to Rose and Molitor in terms of likelihood of the streak, except McVey, whose likelihood is 1-in-51, the highest by far of all 125 players. Among the top 100 hitters, the average number of plate appearances per walk is 10.46. McVey averaged 84.77 plate appearances per walk. Suppose we change the number of McVey’s walks to the number equivalent to one every 10.46 plate appearances. In that case, McVey would have 266 walks. Using that figure, McVey’s probability would go down to 0.0144, or 1-in-69, which is still much higher than that of Cobb, the player who had the highest streak likelihood among the top 100 hitters. The player with the lowest ratio of plate appearances to walks is Williams at 4.84. Even changing McVey’s walks so he has one for every 4.84 plate appearances, we obtain a probability of 0.0095, or 1-in-105, which is still ahead of Cobb. In other words, when we penalize him by giving him an inordinately large number of walks, McVey is more than twice as likely as Cobb to obtain a 56-game hitting streak. We have no information about McVey’s sacrifice bunts, though this is not likely to dramatically alter the overall results. However, the historical requirement of seven balls for a walk could have reduced McVey’s batting average. If a pitcher can make more bad pitches without giving up a walk, the hitter is less likely to see as many good pitches. On the other hand, conventional baseball wisdom argues that the more pitches a hitter sees at a plate appearance, the more likely he is to get a hit. Another factor that helped McVey is hitting opportunity, as he had 4.798 per game, the highest of the 125 players examined here. Billy Hamilton, who was 14th in streak probability and 8th in batting average, was second at 4.742. Thus, McVey had such a high probability of a streak because of a combination of two effects: he came to the plate
often and he made the most of his opportunities. His probability of a hit in a game was 0.866, much higher than Orr’s 0.822. And his effective average was 0.342, much higher than that of Cobb’s. In fact, his batting average of 0.346 would have put him sixth all-time if his 2,543 plate appearances had been sufficient to qualify. His infrequency of walks may have appeared to play a small role, but after adjustment, we see this factor had very little effect. So, coming up often and having a very high probability of a hit were probably the key factors, and certainly these attributes affected the streak probability for other players. Not counting McVey and counting only the top 100 batters plus the 24 other players with streaks of at least 30 games, the overall probability comes to 0.0447 or 1-in-22. Counting McVey, the probability goes to 0.0635 or 1-in-16.
Who Was Most Likely to Have Set the Record? Given that we know a streak has occurred, an interesting question is to determine the likelihood that DiMaggio or any particular player was the one who did it. Suppose we assume we know only the following: at least one streak has been achieved by only one player. Given the assumptions we have made (constant probability of success, independent trials, using p instead of p*), for the sample of the top 100 batters, the probability of seeing at least one streak by one and only one player is
100
100
i1
j1, jxi
¤ pi ( ni ,s) (1 p j ( n j ,s)).
That is, each player’s probability of a streak is multiplied by the joint probability that all the other players did not have a streak. This calculation is done over all players and summed to obtain the probability of one, and only one, player achieving at least one streak. The probability that player i had the streak is then 100
pi ( ni , s ) 100
1 p j ( n j ,s)) j1, jx1 100
.
¤ pi ( ni ,s) (1 p j ( n j ,s)) i1
j1, jxi
Of course, an important caveat is that we cannot rule out that these probabilities account for a person achieving more than one streak. The Feller formula is technically the probability of a streak not occurring, and we adapted it to obtain its complement—the probability of at least one streak occurring. Thus, the probability that a streak does occur accounts for the possibility of multiple occurrences. Rather than present a lengthy table, we show an abbreviated version, Table 3, with the 15 players with the highest probabilities. Part A is from the results for only the top 100 hitters, and Part B incorporates all 125 players. The players line up as they did before, but we now have information about the likelihood that the streak was achieved by a particular player. For the top 100 hitters, the probability of it being Cobb is about 11%. Using all 125 players, the most likely is, of course, McVey, at almost 31%. Cobb falls to about 7.5%. Not surprisingly, DiMaggio is not in the top 15, as he accounts for only 0.45%.
CHANCE
41
Table 3 —The Probability That a Particular Player Is the One Who Achieved at Least One 56-Game Hitting Streak, Given That We Know at Least One Such Streak Occurred
A. 15 Highest from the Top 100 Hitters Only
B. Top 100 Hitters Plus 25 Other Players Who Achieved Streaks of at Least 30 Games
Ty Cobb
0.1086
Cal McVey
0.3054
Ed Delahanty
0.0844
Ty Cobb
0.0746
Willie Keeler
0.0658
Ed Delahanty
0.0580
Ichiro Suzuki
0.0606
Willie Keeler
0.0452
George Sisler
0.0549
Ichiro Suzuki
0.0417
Sam Thompson 0.0523
George Sisler
0.0377
Dave Orr
0.0497
Sam Thompson 0.0359
Jesse Burkett
0.0487
Dave Orr
0.0342
Cap Anson
0.0432
Jesse Burkett
0.0335
Pete Browning
0.0376
Cap Anson
0.0297
Dan Brouthers
0.0356
Pete Browning
0.0259
Al Simmons
0.0253
Dan Brouthers
0.0245
Nap Lajoie
0.0243
Al Simmons
0.0174
Billy Hamilton
0.0218
Nap Lajoie
0.0167
Joe Jackson
0.0192
Billy Hamilton
0.0150
Summary I estimate that DiMaggio had a lifetime chance of 1-in-3,394 and is only the 28th most likely player in the top 100 all time hitters to achieve the streak. The top 100 hitters collectively had a chance of 1-in-22. The most likely player among the top 100 hitters was Cobb, the best all-time hitter who also had the most games played and plate appearances. But batting average and longevity are not the sole determinants. And of course, there is the intriguing case of McVey from the 19th century, who was far more likely than anyone else to achieve the streak. Counting McVey and the 124 other players analyzed here, the chance someone would achieve at least one 56-game hitting streak is about 1-in-16. Of course, there are limitations to any such estimates. Clearly a large number of other players has been omitted, but the marginal contributions of the omitted players should be 42
VOL. 22, NO. 2, 2009
extremely small. Even the 100th best hitter of all time has a probability of only 0.00005. Also, the formulas assume each hitting opportunity is independent and the probability of hitting is based on the career average. The notion of a hot streak belies the principle of independence. If a player has what appears to be a hot streak, the likelihood of extending the streak is greater. Other factors could also affect the likelihood of a streak. A team playing well is likely to help a player achieve a streak, though a player in a streak is also likely to help a team play well, so causality is not clear. Teammates, opposing pitchers, and even umpires might behave differently during a streak. But some of these factors would increase the likelihood of extending the streak, and some would decrease it. Rarely in life are events completely independent, as we are all parts of a system of complex interacting factors. So the streak does seem fairly improbable, but perhaps not as improbable as we might have thought. And we should not act as though the rarity of the streak is that DiMaggio did it. Probability analysis removes our subjectivity and allows us to analyze without bias. It cares not about the mystical aura of a player like DiMaggio (who married movie stars Dorothy Arnold and Marilyn Monroe and ultimately became the historical persona of the Yankee franchise) in comparison to a player like Ed Delahanty, whose name would not be recognized by most Americans, and yet whose career was of similar length and 13 times more likely to have produced the streak. And of course, almost no one has heard of McVey, who was far more likely than anyone to have done it, but has virtually no name recognition even in the annals of baseball history.
Further Reading Arbeson, S. and S. Strogatz (2008). “A Journey to Baseball’s Alternative Universe.” The New York Times, March 30. Berry, S. (1991). “The Summer of ’41: A Probability Analysis of DiMaggio’s Streak and Williams’ Average of .406.” CHANCE, 4 (4):8–11. Brown, B. and P. Goodrich (2003). “Calculating the Odds: DiMaggio’s 56-Game Hitting Streak.” Baseball Research Journal, 32:35–40. D’Aniello, J. (2003). “DiMaggio’s Hitting Streak: High ‘Hit’ Average the Key.” Baseball Research Journal 32:31–34. Feller, W. (1968). An Introduction to Probability Theory and Its Applications, 3rd ed., Vol. 1. New York: Wiley. Freiman, M. (2002). “56-Game Hitting Streaks Revisited.” Baseball Research Journal 31:11–15. Gould, S. J. (1989). “The Streak of Streaks.” CHANCE, 2 (2):10–16. Levitt, D. (2004), “Resolving the Probability and Interpretations of Joe DiMaggio’s Hitting Streak.” By the Numbers 14(2):5–7. Seidel, M. (2002). Streak: Joe DiMaggio and the Summer of ’41, Lincoln: University of Nebraska Press (originally published in 1988 by McGraw-Hill). Short, T. and L. Wasserman (1989). “Should We Be Surprised at the Streak of Streaks?” CHANCE, 2(2):13. Warrack, G. (1995). “The Great Streak.” CHANCE 8(3):41–43, 60.
Announcing the Burtin Graphic Contest Winners Mike Larsen
T
he graphics contest announced in Volume 23, Issue 4, of CHANCE asked readers to create a display of a 1951 data set on antibiotic effectiveness. The three winners of the contest are Mark Nicolich, Brian Schmotzer, and Dibyojyoti Haldar. The winners’ graphs are different, yet communicate the same information— each compares the antibiotics and bacteria and succinctly summarizes the influence of the gram staining factor.
Mark Nicolich The data are plotted in a radar, or spider, plot. There are separate plots for the positive and negative Gram staining results. Each ray of the plot is a bacterium. The connected data lines represent the minimum inhibitory concentration (MIC) measures for the three antibiotics on a base-10 log scale. For the Gram-negative bacteria, the neomycin line is within the other two antibiotics (except for a slight excursion with Aerobacter aerogenes), indicating it has the lowest MIC for the nine bacteria. Thus, neomycin is the best of the three antibiotics for inhibiting these Gramnegative bacteria. For the Gram-positive bacteria, none of the antibiotic lines are totally within the other two. Penicillin is largely better (closer to the center) for four of these seven bacteria. In absolute terms, penicillin is only slightly larger than neomycin for the other three bacteria. Therefore, penicillin is the best of the three antibiotics for inhibiting these Gram-positive bacteria.
Mark Nicolich is a consulting statistician in New Jersey. His main areas of interest are biological and market research data, and he has had a life-long interest in the understanding and presentation of data.
Nicolich
Positive Gram Staining
Negative Gram Staining Salmonella (Ebt) typhosa
Brucella anthracis
1000
100
Brucella abortus
Aerobacter aerogenes 10
Streptococcus fecalis
0.1
1
Streptococcus hemolyticus
0.01
Proteus vulgaris
Pseudomonas aeruginosa 0.001
0.0001
Staphylococcus aureus Klebsiella pneumoniae
Streptococcus viridans
Salmonella schottmuelleri
Mycobacterium tuberculosis
Streptomycin
Escherichia coli
Neomycin
Penicillin
Staphylococcus albus
Streptomycin
Diplococcus pneumoniae
Neomycin
Penicillin
CHANCE
43
Brian Schmotzer The graph presents separate bar graphs, with a bar for each antibiotic for each bacterium. A cutoff of 0.1 is used for determining antibiotic resistance. Bacteria are grouped by resistance. Gram-positive bacteria are labeled as such.
Brian Schmotzer is an associate faculty member in the Department of Biostatistics and Bioinformatics at Emory University’s Rollins School of Public Health in Atlanta, Georgia. He teaches introductory statistics and statistical computing to graduate students. He also works on a variety of medical research studies and public health studies, helping to design them from a statistical point of view and analyzing the data that result. In his spare time, he tackles interesting problems in applied statistics, especially in sports such as basketball and baseball.
Schmotzer
P
none
44
VOL. 22, NO. 2, 2009
MIC MIC
SN
MIC
PS
MIC
PSN
MIC
Resistant to 1000 100 10 1 0.1 0.01 0.001
1000 100 10 1 0.1 0.01 0.001
1000 100 10 1 0.1 0.01 0.001
1000 100 10 1 0.1 0.01 0.001
1000 100 10 1 0.1 0.01 0.001
P
S
N
Mycobacterium tuberculosis
P
S
N
Pseudomonas aeruginosa
P
S
N
Aerobacter aerogenes
P
S
N
Klebsiella pneumoniae Gram Positive
N P
S
N P
Escherichia coli
S
N P
Salmonella schottmuelleri
S
Brucella abortus
Gram Positive
Gram Positive
P
P
P
N
Streptococcus viridans
S
N
Streptococcus hemolyticus
S
S
Salmonella typhosa
Gram Positive
S
N P
N
Diplococcus pneumoniae
Antibiotics S
Penicillin = P Streptomycin = S Neomycin = N
N
P
Proteus vulgaris Gram Positive
Gram Positive
Gram Positive
P
P
P
S
Brucella anthracis
N
S
N
Staphylococcus aureus
S
N
Staphylococcus albus
N P
S
Streptococcus fecalis
Dibyojyoti Haldar The graph includes a description as part of the image. The Gram-positive and -negative bacteria are separated. The minimum inhibitory concentration (MIC) is log-transformed. The bacteria are ordered so that those most responsive to the three drugs are separated.
Dibyojyoti Haldar is a quantitative marketing professional currently working freelance in Singapore. He has spent the last four years with the consumer products analytics team at Marketics, one of Asia’s better-known analytics firms. Haldar has a keen interest in data and information visualization, as well as other forms of visual narratives. He likes to travel when not working.
The Most Effective Antibiotic Among the three tested antiobotics, against 16 different bacteria, Neomycin seems to be the best cure for 11 bacteria—which makes it the most effective antibiotic.
Haldar
3 Minimum
2
Inhibitory Concentration on a Logarithmic Scale
1
0
Streptomycin
Aerobacter aerogenes
Mycobacterium tuberculosis
Klebsiella pneumoniae
Pseudomonas aeruginosa
Salmonella (Eberthella) typhosa
–3
Neomycin is the most effective drug—against the majority of the studied bacteria
Escherichia coli
Penicillin
Proteus vulgaris
Streptomycin is the most effective drug against the Aerobacter, and not by a large margin either
Neomycin
Salmonella schottmuelleri
–2
Penicillin is most effective against four of the bacteria, all of whom are gram staining
Brucella abortus
–1
The Gram Staining Bacteria
Each of the winners will receive a year subscription to CHANCE. Those interested in examining the submissions to this contest can read the article “Pictures at an Exhibition” on Page 46. We would like to thank all who submitted entries, for there are many not shown here that were good and the task of selecting just three winners was difficult. According to the criteria we used, more than half of the entries were superior to Will Burtin’s 1951 original design; this is high praise indeed.
CHANCE
45
Visual Revelations Howard Wainer, Column Editor
Pictures at an Exhibition Howard Wainer and Mike Larsen
I
n the 223 years since their formal invention, statistical graphics have become a mainstay. One cannot open a newspaper or magazine—or turn on a TV newscast—without being confronted by them. So, it is not surprising that we have become pretty good at both reading and constructing them, although just how good was something of a surprise. In the fall 2008 issue of CHANCE, we proposed a contest to readers. We republished a 1951 data set that was transformed by master designer Will Burtin into a statistical graphic and challenged readers to construct a display of those data. Sixty-four readers accepted the challenge. For the most part, the results are impressive. Despite the overall quality of displays submitted, however, we thought there was often room for improvement and decided to not let this teaching moment pass. The data are shown in Table 1. The entries of the table are the minimum inhibitory concentration (MIC), a measure of the effectiveness of the antibiotic. The MIC represents the concentration of antibiotic required to prevent growth in vitro. The covariate “Gram staining” describes the reaction of the bacteria to Gram staining. Gram-positive bacteria are those stained dark blue or violet, whereas Gram-negative bacteria do not react that way. Burtin’s graphic solution was not published in the fall issue of CHANCE, although curious readers could have 46
VOL. 22, NO. 2, 2009
found it without too much difficulty (www.nytimes.com/2008/06/01/books/ review/Heller-t.html?_r=2&oref=slogin), so one would think all the displays constructed would be improvements to Burtin’s design. Some were improvements. Arguably, some were not.
Purpose Any design must begin with a statement of purpose. Graphs have the following four purposes: 1.
2.
Exploration—a private discussion between the data and graph maker about the character of the data. When this is the primary purpose, the greatest value of the display is when it forces the viewer to see what was never expected. Communication—a discussion between the graph maker and reader about the aspects of the data that the graph maker feels are most critical to the issue at hand.
3.
Calculation—a so-called nomograph in which the graphic performs some calculation to allow the user to see a result without doing the arithmetic.
4.
Decoration—the graph is used as a visual element of a presentation to attract the viewers’ attention.
Although graphs often must serve multiple purposes, the optimal display for one purpose is usually not the same for another, and so some hierarchy of purpose must be determined. This is the first decision that must be made by the graph maker. For the Burtin data, we expected the displays submitted would have communication as their primary goal, which presupposes that the graph maker has privately done some serious exploration to learn what messages are there to be communicated.
Questions to Be Answered The second step is to decide what the key questions are that the data were gathered to answer. For these data, the primary questions must surely be the following: (i) What bacteria does this drug kill? (ii) What drug kills this strain of bacteria? (iii) Does the covariate Gram staining help us make decisions? At first blush, one would think (ii) is the key clinical question. It would naturally result when a patient presents with a particular infection and the physician wants to know which treatment is most efficacious. But, in 1951, when antibiotics were still new, question (i) might have assumed greater significance. A good display will answer one of these questions, whereas a great display will answer both.
Table 1—Burtin’s Data Antibiotic Bacteria
Streptomycin
Neomycin
Gram Staining
870
1
1.6
negative
Brucella abortus
1
2
0.02
negative
Brucella anthracis
0.001
0.01
0.007
positive
Diplococcus pneumoniae
0.005
Aerobacter aerogenes
Penicillin
11
10
positive
Escherichia coli
100
0.4
0.1
negative
Klebsiella pneumoniae
850
1.2
1
negative
Mycobacterium tuberculosis
800
5
2
negative
0.1
0.1
negative
2
0.4
negative
1
0.4
0.008
negative
10
0.8
0.09
negative
Proteus vulgaris Pseudomonas aeruginosa Salmonella (Eberthella) typhosa Salmonella schottmuelleri
3 850
Staphylococcus albus
0.007
0.1
0.001
positive
Staphylococcus aureus
0.03
0.03
0.001
positive
Streptococcus fecalis
1
1
0.1
positive
Streptococcus hemolyticus
0.001
14
10
positive
Streptococcus viridans
0.005
10
40
positive
At the same time, the display should provide a clear overall picture of the battles between these bacteria and antibiotics and the relationship to known factors, such as the focus of question (iii).
Memorable A great display should be memorable, in that you should take away an overall picture of what is going, not just the details available from the original data matrix. To accomplish this, the display needs to convey coherence, a unity of perception that is greater than the sum of its parts. The best example of this is Minard’s famous plot of Napoleon’s failed Russian campaign, in which the river of the French army streams across the Russian
steppes in the warmth of the summer of 1812 and trickles back across the Polish border in the midst of the terrible winter. The memory of the graph transcends the series of data points from which it was constituted. To accomplish these three aims (i.e., have purpose, answer a question or questions, be memorable) requires thoughtful analyses and careful decisions about the scale of the graph, the visual metaphor used to represent the data, and the order in which the data are presented. We shall show some of the competition entries and indicate some of the choices made and what might have been tried to improve them. Before presenting examples, however, one issue should not be overlooked.
The Hound on the Night of the Murder Before discussing any of these displays explicitly, it is sensible to mention what was, to us, the most remarkable characteristic of all but two of the submissions; they were all in a log scale. Of course, when we see the values of the MICs covered a range of 106, any statistician would immediately consider transformation. So, it is easy to lose sight of how rare is this insight in the general population. When we tried this same exercise on undergraduates in an introductory statistics course, the proportion using a transformation was vanishingly small. So, our applause to all entrants (save two) who did the transformation pro forma. CHANCE
47
A selection of entries Entry 1— An unusual combination of the purposes: data exploration and decoration that foreshadows changes in knowledge about bacteria, in a decorative format We listed three questions we believed captured the essence of what the data were gathered to answer. Given the time lapse between Burtin’s original analysis and the CHANCE contest, Jana Asher, the designer of the graph below suggests a fourth possible goal: How has our understanding of the relationship between the drugs and bacteria in the table changed since the original 1951 graphic was created? This graph and its associated narrative provide unique insights in this regard. The scale is logarithmic. Each optimal MIC by bacterium is marked with a small box. Asher researched the current scientific classification for each
bacterium and included that information via pictures grouped within colored boxes. In that process, she discovered that some of the bacteria have been reclassified since the original Burtin graphic was created. In those cases, the modern name for the bacteria is in parentheses. The display is predictive of the changes in classification. For example, the line representing Streptococcus faecalis is dissimilar to the other Streptococcus bacteria. It has since been reclassified as Enterococcus faecalis. In contrast, the reclassification of Diplococcus pneumoniae to Streptococcus pneumoniae, was foreshadowed by seeing that its line pattern is similar to those of the other streptococcal bacteria. Using current bacteria classification—looking at the bacteria from the viewpoint of their phylum and class— one finds that Gram staining results are specific to particular phylum. In making this graph, Asher identified the bacteria
Brucella anthracis as being mislabeled: It should be Bacillis anthracis. If anthracis fell under the genus Brucella, it would be a noticeable exception to that rule. When it is properly classified under the genus Bacillis, the rule holds. Gram staining and optimal antibiotic are also associated (positive with penicillin, negative with neomycin). The graphic is ambitious in the quantity of information it attempts to communicate, which perhaps is its downfall. The inclusion of images of the bacteria would be attractive in a poster-style presentation, but they communicate little additional information about the data and overwhelm the graph in the center. One must study the graph carefully to find the message Asher attempts to communicate. The following entries address the data more specifically in terms of the first three criteria. All entries can be found in color at www.amstat.org/publications/chance.
Graphic created by Jana Asher of Carnegie Mellon University
48
VOL. 22, NO. 2, 2009
Entry 2— Escapades looking beyond flatland In this display, each bacterium’s response to neomycin is plotted against that for streptomycin in a scatterplot on a log scale, oriented so that more effective is to the right/top. A fine idea, but it doesn’t scale up to more than two drugs very well. Penicillin’s effectiveness is shown by the size of the plotting point (bigger is better), and Gram staining is shown by the plotting point’s color. The display’s strength is that it shows the relationship between the effectiveness of neomycin and streptomycin directly, and by showing all bacteria, save one, below the diagonal line, it makes clear streptomycin’s limited appeal. But penicillin’s role is more difficult to discern.
Entry 3— Another attempt to escape flatland This display by Pierre Dangauthier of the United Kingdom uses a pseudo 3-D display with Gram staining shown by colors. As the data are shown without transformation, it is necessary to use an insert (the right panel) to show fine structure. The effects of antibiotics are labeled as weak, strong, or very strong. Some clustering of antibiotics is readily apparent. This idea does not scale to more than three drugs and requires an advantageous association with dimensions to see differences clearly. The static graph requires careful reading to answer questions, but a dynamic version that the viewer could rotate might prove easier to interpret.
CHANCE
49
Entry 4— Bars: a simpler approach This display does a good job of presenting a coherent picture of the data. The log scale allows us to make distinctions among the various efficacies, Grampositive and Gram-negative bacteria are visually separated, and the metaphor “bigger = better” is employed. The bacteria are ordered by the efficacy of penicillin. It is not colorful or fancy, but it is rather effective. Indeed, the graph maker noted in his submission that he was assuming the graph would appear in a typical medical journal and thus not include color or be too large. Edward Tufte has taught us that we can often improve our displays by removing non-data figurations from our plots. In a bar chart, the bars serve no purpose except to hold up their top line, which contains all the information. If we delete the bars and show only a plotting point where the tops were, we can convey the same information with less ink. This approach was taken by many submitters.
Entry 5a— Dots In this design by Max Marchi of Italy, the Gram staining was visually separated, the dots representing a particular antibiotic were connected by having distinct shapes, and the bacteria were ordered alphabetically. A clean design, but are we really interested in “Alabama First” (or, in this case, ‘Aerobacter aerogenes’)?
50
VOL. 22, NO. 2, 2009
Entry 5b— Dots This display by Charlotte Wickham presents a variation on the dots theme. The key improvement is that the bacteria are ordered by their efficacy. Among Grampositive bacteria, they are ordered by penicillin; among Gram-negative, they are ordered by neomycin. Which dot represents which antibiotic is shown within a legend. This represents an inefficiency, as we must memorize the legend before we can read the graph. This problem was easily resolved by the next design.
Entry 5c— Dots By replacing the arbitrary dots with the obvious, evocative plotting symbols P, N, and S, this design by Phil Price of Lawrence Berkeley National Laboratory has made a legend unnecessary and the seeing of the data structure easier.
CHANCE
51
3 Minimum
2
Inhibitory Concentration on a Logarithmic Scale
1
0
Streptomycin Penicillin is most effective against four of the bacteria, all of whom are gram staining
Neomycin
Mycobacterium tuberculosis
Klebsiella pneumoniae
Pseudomonas aeruginosa
Escherichia coli
–3
Neomycin is the most effective drug—against the majority of the studied bacteria
Salmonella schottmuelleri
Penicillin
Aerobacter aerogenes
–2
Streptomycin is the most effective drug against the Aerobacter, and not by a large margin either
Proteus vulgaris
–1
Brucella abortus
There are two kinds of good displays: A strongly good display that tells you everything you want to know just by looking at it and a weakly good display that tells you everything you want to know just by looking at it, once you know what to look for. You can change a weakly good display into a strongly good one through the inclusion of informative labels. The open space provided by substituting dots for bars can be used for more than just the luxuriousness of emptiness. In this design by Dibyojyoti Haldar of Bangalore, India, the space was filled with interpretative labels. In addition, the dots are connected and each antibiotic is colorcoded. These two graphic elements give the entire display greater coherence. Another version of this display, with some advantages and disadvantages, can be found at http:// peltiertech.com/WordPress/2009/01/05/ antibiotic-effectiveness-a-study-of-chart-types, along with a detailed description of the pathway taken before arriving at the final version. It shows vividly the decisions made and confirms, for those who need confirmation, that good displays are rarely accidental.
The Most Effective Antibiotic Among the three tested antiobotics, against 16 different bacteria, Neomycin seems to be the best cure for 11 bacteria—which makes it the most effective antibiotic.
Salmonella (Eberthella) typhosa
Entry 5d— Connected dots
The Gram Staining Bacteria
Gram Negative
Entry 6— Small multiples
Gram Positive
52
VOL. 22, NO. 2, 2009
Another possibility explored was using a single small icon to represent the performance of the three drugs on one bacterium. In this entry by Georgette Asherman of Direct Effects LLC, the icon was a triangle in which the distance of each vertex from the center was proportional to the log of the amount of antibiotic needed; the bigger the icon, the more resistant the bacterium. Icons of similar shape represent bacteria of similar patterns of resistance. Bacteria are separated by Gram staining response. After sorting by gram status, Asherman sorted by penicillin, streptomycin, and neomycin inhibitory concentrations. Thus, one can compare readily the effect of penicillin on bacteria. An alternate ordering, such as according to total volume, or placing the icons onto the two-dimensional plane to reflect response to all three antibiotics could help viewers answer questions.
MIC
PSN
MIC
Resistant to
MIC
PS
P
none
MIC
MIC
SN
1000 100 10 1 0.1 0.01 0.001
1000 100 10 1 0.1 0.01 0.001
1000 100 10 1 0.1 0.01 0.001
1000 100 10 1 0.1 0.01 0.001
1000 100 10 1 0.1 0.01 0.001
P
S
N
Mycobacterium tuberculosis
P
S
N
Pseudomonas aeruginosa
P
S
N
Aerobacter aerogenes
P
S
N
Klebsiella pneumoniae Gram Positive
N P
S
N P
Escherichia coli
S
N P
Salmonella schottmuelleri
S
Brucella abortus
Gram Positive
Gram Positive
P
P
P
N
Streptococcus viridans
S
N
Streptococcus hemolyticus
S
S
Salmonella typhosa
Gram Positive
S
N P
Diplococcus pneumoniae
Penicillin = P Streptomycin = S Neomycin = N
N
P
Proteus vulgaris Gram Positive
Gram Positive
Gram Positive
P
P
P
S
N
Brucella anthracis
S
N
Staphylococcus aureus
S
S
Streptococcus fecalis
Entry 7— Small multiples
N
Antibiotics S
N P
N
Staphylococcus albus
The display shown to the left by Brian Schmotzer of Emory University uses a different multivariate icon, one that is more easily decoded. It arranges the icons in a way that allows us to easily answer questions about bacteria resistance. A cutoff of 0.1 is used to separate resistance from nonresistance. If this were an important cutoff value, the graph would immediately aid decisionmaking. Looking within an icon allows easy identification of the best drug for a particular bacterium. Gram staining is labeled, but it could have been more visually separated.
Submissions Entry 1.
Jana Asher
Entry 2.
Anonymous
Entry 3.
Pierre Dangauthier, London, UK
Entry 4.
Anonymous
Entry 5a. Max Marchi Entry 5b. Charlotte Wickham Entry 5c. Phillip N. Price, Lawrence Berkeley National Laboratory Entry 5d. Dibyojyoti Haldar, quantitative marketing professional currently working freelance in Singapore Entry 6.
Georgette Asherman, Direct Effects, LLC
Entry 7.
Brian Schmotzer, Emory University
Entry 8.
Mark Nicolich, Lambertville, NJ
CHANCE
53
Entry 8— Polygons This display by Mark Nicolich of New Jersey uses elements from entries 6 and 7. Like Entry 6, it uses a polygon icon to represent the efficacy of the antibiotics; however, each vertex is a bacterium. Also like Entry 6, it represents the efficacy of the antibiotic with a point on the radius from the center of the polygon to its vertex; the metaphor is smaller equals more effective. Again, like Entry 7, it orders the bacteria around the icon carefully so the polygon thus formed is smooth. The two panels of the display represent the two Gram stain conditions. Note that, at a glance, we see neomycin dominates for Gram-negative. Even when it is not absolutely the best, it is so close that we can certainly live by the rule “Gram-negative use neomycin.” When the bacteria is Gram-positive, the story is not quite so clear, but almost. Penicillin dominates except for two Staph and one Strep infection. Thus, a treatment rule for Gram-positive might be to use penicillin and neomycin in combination. Usually, graphs answer some questions well and others less so. Some let us answer “which drug is best for this bacterium?” Others are better at “what bacteria are not resistant to this antibiotic?” This entry allows us to answer both of these, as well as the more general question, “what’s going on?” Perhaps it could be improved with a bold regular polygon that separates practical from impractical dosages, but those are nits. This display presents a coherent, memorable picture of the data, and it can scale upward to include more bacteria and more antibiotics. Will Burtin would be impressed.
What We Have Learned We have enjoyed studying the graphs submitted and thank all participants. There are many ways to display even a relatively small data set. Myriad choices are required to produce a graph that accomplishes the three aims of having a purpose, answering a question or questions, and being memorable. Scale, visual metaphor including plotting symbols, the order of the data in one or more dimensions, and the directions to the reader are important. There are many
54
VOL. 22, NO. 2, 2009
Negative Gram Staining Salmonella (Ebt) typhosa 1000
Brucella abortus
Aerobacter aerogenes 10
0.1
Proteus vulgaris
Pseudomonas aeruginosa 0.001
Klebsiella pneumoniae
Salmonella schottmuelleri
Mycobacterium tuberculosis
Escherichia coli
Streptomycin
Neomycin
Penicillin
Positive Gram Staining Brucella anthracis 100
Streptococcus fecalis
1
Streptococcus hemolyticus
0.01
0.0001
Staphylococcus aureus
Streptococcus viridans
Staphylococcus albus
Streptomycin
ways to make a good graph and nearly as many ways to improve a graph. Fortunately for us, the tools for graph making have come a long way. It is clear from entries to this contest that understanding has come a long way, too.
Further Reading “CHANCE Graphic Display Contest: Burtin’s Antibiotic Data.” (2009). CHANCE, 21(4): 62.
Diplococcus pneumoniae
Neomycin
Penicillin
Gregg X. www.forthgo.com/blog/2009/01/11/ burtin-antibiotic-illustrations P e l t i e r J . http://peltiertech.com/ WordPress/2009/01/05/antibiotic-effectiveness-a-study-of-chart-types Heller S. (2008). “Visuals.” The New York Times (6/1/08): www.nytimes. com/2008/06/01/books/review/Heller-t. html?_r=2&oref=slogin Wainer, H. (2009). “Visual Revelations: A Centenary Celebration for Will Burtin: A Pioneer of Scientific Visualization.” CHANCE, 22(1): 51–55.
Here’s to Your Health Mark Glickman, Column Editor
Building and Validating High Throughput Lung Cancer Biomarkers Xiaofei Wang, Herbert Pang, and Todd A. Schwartz
L
ung cancer is the deadliest form of cancer in the United States, and it causes more deaths each year than colon, breast, and prostate cancer combined. The deaths of television news anchor Peter Jennings and Christopher Reeve’s widow, Dana Reeve, have recently put
lung cancer in the national spotlight. Using lung cancer as an example, we illustrate important statistical issues in discovering and validating high throughput cancer biomarkers for predicting a cancer patient’s risk of cancer recurrence.
Microscopic image of lung carcinoma Getty/Duncan Smith
CHANCE
55
Lung Cancer Is a Heterogeneous Disease Jennings was an older male with a long history of smoking, while Reeve was a younger female who never smoked. Yet, the difference among lung cancer patients can be even more striking than the apparent differences of its famous patients; lung cancer is a common term for several heterogeneous diseases. The heterogeneity of lung cancer patients lies at the levels of histology, pathological stage, molecular characteristics, and genetics. Accurate classification of lung cancer patients into subtypes is the key for effective clinical management and disease prognosis. In terms of histology, the vast majority of lung cancers are carcinomas—malignancies that arise from epithelial cells. There are two main types of lung carcinoma: non-small cell (80%) and small-cell (17%) lung carcinoma. The non-small cell lung carcinoma can be further classified into three main sub-types: squamous cell lung carcinoma, adenocarcinoma, and large-cell lung carcinoma. Besides histology, cancer staging is the most important factor in deciding a patient’s prognosis and treatment. The current staging system uses information about tumor size, lymph nodes, and metastasis to assess whether cancer has spread from a primary site and, if so, how extensive the spread is. For nonsmall cell lung cancer (NSCLC), for example, stage I cancer is confined to lung tissue, stage II cancer is confined to lung tissue and lymph nodes in the lung, stage III cancer is found in lung tissue and lymph nodes outside the lung, and stage IV cancer has distant spread to other sites, such as the liver, glands, bone, and brain. Patients with advanced NSCLC (stages III and IV) have a five-year survival rate of less than 5%, as compared to 50% for patients with stage I NSCLC. In current clinical practice, lung cancer patients who have the same histology and pathological stage may receive the same chemo- and radiation therapy. However, a substantial proportion of treated patients do not respond to the recommended treatment. It is now believed that lung cancer is indeed a combination of hundreds of distinct diseases. The heterogeneous response to treatment is due to the differences in patients’ genetic makeup, and the most effective treatment is tailored to each patient. In recent years, targeted agents have been developed to turn a specific genetic pathway on or off. The development of targeted lung cancer therapy has generated much interest in discovering markers to identify the subgroup of patients who benefit most from these therapies. The key questions are what biomarker reveals the distinct patient’s genetic profile, how can such a biomarker be developed, and how effective is the biomarker in classifying patients into different subgroups of genetic profiles?
Lung Cancer Biomarkers What exactly is a cancer biomarker? According to the 2001 NIH Biomarker Definitions Working Group, a biomarker is “a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention.” A robust and reproducible biomarker that identifies patients with unique genetic or molecular features lays the foundation
56
VOL. 22, NO. 2, 2009
for personalizing treatment for lung cancer patients. One may categorize biomarkers into the following three groups, depending on their intended use in cancer treatment: 1. Prognostic biomarkers, which predict the outcome of patients in terms of a clinical endpoint. A validated prognostic biomarker provides opportunity to identify patients at high risk and a possibility for early intervention. For example, several genomic-based classifiers have been developed recently to predict the risk of cancer recurrence for stage I NSCLC patients following surgical resection, and a randomized clinical trial is planned to evaluate whether chemotherapy would benefit high-risk patients. 2. Predictive biomarkers, which predict the effect of a specific treatment on a clinical endpoint for patients. As an example, over-expression of Cyclo-oxygenase-2 (COX-2) is associated with poor prognosis of overall survival in advanced NSCLC, and patients with overexpressed COX-2 benefited significantly more from receiving celecoxib (a COX-2 inhibitor) and standard chemotherapy relative to those receiving standard chemotherapy only. In other words, COX-2 serves as a biomarker that is both prognostic and predictive. 3. Surrogate biomarkers, which replace a clinical endpoint in clinical trials carried out to evaluate the effect of a specific treatment on patients. Surrogate biomarkers can be used as intermediate indicators of treatment efficacy in cancer treatment studies. For example, standardized uptake values (SUV) calculated from fluorodeoxyglucose–positron emission tomography (FDG-PET) are believed to measure tumor metabolism, and a decrease in SUV is being evaluated as a surrogate endpoint for progression-free survival in a clinical trial for advanced non-small cell lung cancer. Depending on the nature of a specific biomarker (prognostic, predictive, or surrogate), a lung cancer biomarker can be useful for risk stratification, early cancer detection, treatment selection, prognostication, or monitoring. In recent years, there have been significant advances in our understanding of the molecular and genetic changes involved in lung carcinogenesis, including circulating DNA, genetic mutations, gene hypermethylation, gene expression, and proteins. Meanwhile, there have also been considerable technological and scientific advances in fields such as gene expression profiling, proteomics, and molecular imaging. Progress as such has allowed efficient characterization of the changes underpinning disease progress and drug response. We are in an unprecedented time to discover and validate biomarkers and the corresponding targeted agents for lung cancer.
Development of Lung Cancer Biomarkers To illustrate some important statistical issues in discovery and validation of lung cancer biomarkers, we use as an example the development of a genomics-based prognostic biomarker for predicting cancer recurrence risk in stage I NSCLC patients.
ROC Curve predicting 60-month recurrence-free survival
Kaplan-Meier for recurrance-free survival 1.0
1.0 high risk low risk
0.6
0.4
true positive rate
0.8
true positive rate
recurrence-free probability
0.8
0.6
0.4
0.2
0.2
0.0
0.0 0
20
40
60
80
100
120
time (months)
0.0
0.2
0.4
0.6
0.8
1.0
false positive rate AUC=0.673
Figure 1. The left panel shows the cancer recurrence-free Kaplan-Meier survival curves for the high- and low-risk groups. The right panel shows the ROC curve of the predicted risk probability, predicting five-year (60-month) cancer recurrence-free survival.
A refined prognostic biomarker for stage I NSCLC Standard cancer treatment is recommended according to a patient’s staging. For stage I patients, the standard treatment is surgical removal, while adjuvant chemotherapy is not recommended because of the toxicity and lack of proven survival benefits due to treatment. However, a considerable proportion of stage I NSCLC patients die after cancer recurrence within two years of surgery, and another considerable proportion survive for more than 10 years. In contrast, adjuvant chemotherapy has been shown to improve survival in patients with resected stage II-III NSCLC. It has long been hypothesized that some stage I patients who have tendencies for cancer recurrence may benefit from adjuvant chemotherapy. To give adjuvant chemotherapy to patients at high risk of cancer recurrence, the discovery of a classifier (i.e., biomarker) with high prediction accuracy, high reproducibility, and cost-effectiveness would be a first step. A genomics-based classifier In lung cancer, several single gene mutations—including EGFR, K-ras, and P53—have associated with a more aggressive form of the disease. However, because of the complexity and multifaceted nature of carcinogenesis, it is unlikely that any one of these markers alone is able to encapsulate all the genetic heterogeneity within lung cancer patients. A panel of biomarkers, or a profile of a patient’s whole genome, probably would be more effective in distinguishing patients with different risks of cancer recurrence. Microarray analysis, a high-throughput technology, can simultaneously measure gene expression of thousands of genes. The increased availability of this technology offers a great opportunity to decipher variability in the prognoses for
patients with stage I NSCLC. Developing biomarkers based on microarray data requires efficient and valid statistical methods that characterize the relationship between genetic signatures and clinical outcomes. Recently, several research teams used genome-wide profiling to predict stage I NSCLC patients who are at high risk of cancer recurrence. These results were published in several medical journals, such as “A Genomic Strategy to Refine Prognosis in Early-Stage Non-Small-Cell Lung Cancer,” in a 2006 issue of The New England Journal of Medicine and “Gene Expression–based Survival Prediction in Lung Adenocarcinoma: a Multi-site, Blinded Validation Study,” in a 2008 issue of Nature Medicine. Based on retrospectively collected specimens, these studies used statistical association tests to identify differentially expressed genes or meta-genes related to survival time. Machine learning algorithms, or statistical predictive models, may be subsequently employed to establish an optimal classifier based on the selected genes or meta-genes. To develop a signature or biomarker that is highly predictive of risk for cancer recurrence, the classifiers must undergo rigorous validation. Assessing prediction accuracy For each biomarker under consideration for prediction, one must assess how well it distinguishes between cases and controls, or patients with high risk of recurrence versus low risk of recurrence. It is not uncommon for clinical manuscripts on classification biomarker development solely to report statistical associations. Yet, a strong association between outcome and biomarker does not imply that the biomarker can adequately discriminate between patients who are likely to have cancer
CHANCE
57
recurrence and who are not. Alternative measures—such as sensitivity, specificity, and ROC curve—to evaluate the performance of a biomarker have been greatly advocated by many authors, such as those of the 2003 monograph on evaluating medical tests, The Statistical Evaluation of Medical Tests for Classification and Prediction. The True-Positive Rate (TPR) and False-Positive Rate (FPR) are two statistical measures that summarize the performance of a biomarker related to a clinical outcome. For binary biomarker Y (e.g., 1 [positive] versus 0 [negative]) and binary clinical outcome D (e.g., 1 [high risk] versus 0 [low risk] for cancer recurrence), the TPR (or sensitivity), Pr(Y=1| D=1), is the probability of having a positive biomarker given a subject having a positive outcome, while the FPR (or 1-specificity), Pr(Y=1|D=0), is the same probability of having a positive biomarker given a subject having a negative outcome. Unlike Positive-Predictive Value (PPV), Pr(D=1| Y=1), and NegativePredictive Value (NPV), Pr(D=0| Y=0), the TPR and 1-FPR do not depend on the prevalence of the outcome. When a biomarker is measured on an ordinal or continuous scale, the Receiver Operating Characteristic (ROC) curve is a natural generalization of FPR and TPR. An ROC curve describes the entire set of (FPR(c), TPR(c)) combinations at different thresholds of c for biomarker positivity, with FPR(c)=Pr(Y≥c |D=1) as the y-axis and TPR(c)=Pr(Y≥c | D=0) as the x-axis. Summary measures, such as the Area Under Curve (AUC) or partial AUC can be derived from an ROC curve. These measures do not depend on the measurement scale or the prevalence of the outcome, which facilitates comparison of the discriminatory capacities of different biomarkers. As an example, a genomics-based classifier was used to classify 57 stage 1B NSCLC specimens into high risk versus low risk for cancer recurrence. The risk probability was computed from the predictive model underlying the classifier. Specifically, those patients with a predicted risk probability exceeding 0.55 were classified as high risk (26/57); otherwise, they were classified as low risk (31/57). In Figure 1, the left panel shows the cancer recurrence-free survival curves for the high- and low-risk groups. The median cancer recurrence-free survival is 40.7 months (95%CI: 24.5, 75.7) for the high-risk group and 74.2 months (95%CI: 68.6, NA) for the low-risk group. The two-sided p-value from the log rank test on the survival difference is 0.009, and the corresponding unadjusted hazard ratio is 2.417 (95%CI: 1.224, 4.772), favoring the low-risk group. When the predicted risk probability was fitted as a continuous linear predictor, the unadjusted hazard ratio was as large as 5.923 (95%CI: 1.307, 26.828). The right panel in Figure 1 shows the ROC curve of the predicted risk probability predicting the five-year cancer recurrence-free survival. The ROC curve was estimated using the method of Patrick J. Heagerty in the 2000 Biometrics paper “Time-Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker,” which proposed for censored survival outcome data. At the chosen cutoff point of 0.55 (black circle), the predicted risk probability wrongly identified 27% of controls as positive (false positive rate) and correctly identified 63% of cases as positive (true positive rate). Overall, the AUC is 0.673. With such a performance, a classifier (biomarker) would usually not be considered useful for clinical use. In other words, a strong association as measured by hazard ratio does 58
VOL. 22, NO. 2, 2009
Table 1— Hypothesis Testing Scenario Not Rejected
Rejected
Total Number
True Null Hypothesis
U
V
m0
True Alternative Hypothesis
T
S
m–m0
Total Number
m–R
R
m
not necessarily translate into high predictive accuracy as measured by TPR and 1-FPR. In evaluating the prediction accuracy of a biomarker, it is also important to assess the improvement of a new biomarker over and above existing standard predictors. For example, in evaluating a new biomarker for predicting high versus low risk for cancer recurrence in patients, one would need to compare its performance against the predictive model that uses only standard predictors, such as tumor size, performance status, and histology type. Finding differentially expressed genes High throughput technologies generate an immense number of molecular measures, which consist of a large pool of candidate biomarkers for predicting the risk of cancer recurrence. However, many of them may not be associated with cancer recurrence. In a microarray experiment, expression levels of thousands of genes are measured simultaneously. The observed clinical difference between high- versus low-risk stage I NSCLC patients is presumably caused by genes that are differentially expressed in tumor cells. For each gene represented on the array, the correlation of gene expression with cancer recurrence is statistically tested by comparing the gene’s average expression between the high- versus low-risk groups. Hundreds and thousands of simultaneous statistical tests of association generate an obvious multiplicity problem. When many hypotheses are tested, and each test has a specified false positive (type I error) rate, the chance of finding any false positives, or the family-wise error rate (FWER), increases sharply with the number of tests. To retain the same overall FWER, say 5%, in a microarray experiment, the standard for each test should be more stringent. Assuming the tests are independent, the well-known Bonferroni method reduces the size of the allowable error for each test according to the number of tests. For example, if tests of association are performed for 5,000 genes, the Bonferroni method would require p-values smaller than .05/5000=0.00001 to declare formal significance at the 0.05 level. Because simple techniques such as the Bonferroni method can be overly conservative, a small—but not extremely small— p-value, such as 0.001, might be used to select and/or rank biomarkers. While this may be a reasonable solution, the extent of the false positive rate is not well-understood and so is
Microscopic image of diseased lung tissue Getty/Duncan Smith
not necessarily controlled. Recently, much attention has been paid to developing better techniques that maintain the FWER without unnecessarily inflating the rate of false negatives. Yoav Benjamini and Yosef Hochberg, in a 1995 Journal of the Royal Statistical Society, Series B article, argued that, in many situations, control of the FWER can lead to unduly conservative procedures. One may well be prepared to tolerate some false positives, provided their number is small in comparison to the number of rejected hypotheses. These considerations led them to propose a less conservative approach that calls for controlling the expected proportion of type I errors among the rejected hypotheses—the false discovery rate (FDR). The hypothesis testing scenario can be summarized by Table 1. The specific number of hypotheses, m, is known in advance. The number, m0, of true null hypotheses is an unknown parameter. R is an observable random variable, and S, T, U, and V are unobservable random variables. The FDR is defined as E(Q), where Q = V/R if R >0 and 0 if R = 0 (i.e., FDR = E(V/R | R > 0) Pr(R > 0)). The challenge lies in finding a threshold that controls the FDR at the level across the entire test of genes. Let H0i, for i=1,…,m be the null hypothesis of no association between the i-th gene and the clinical outcome. Let p1,…,pm be the corresponding p-values. According to Benjamini and Hochberg, one first orders the raw p-values as p(1) ≤ p(2) ≤ … ≤ p(m). For k=1,…,m, find the largest k (K) such that p(k) ≤ α k/m, then reject all the null hypotheses H0(k) for
k=1,…,K. This procedure controls the FDR at the α level for independent tests and certain dependency structures. A rigorous FDR threshold is determined from the observed p-value distribution, and hence is adaptive to the amount of signal in the data under study. Here is a simple illustration: A lung cancer study with more than 200 patients aimed to identify differentially expressed genes comparing patients’ survival using a simple t-test. A total of more than 22,000 probe sets were tested and 1,932 were found to be nominally significant at the 0.05 level. After adjusting for multiple testing using the Bonferroni FWER, none of these nearly 2,000 probe sets was significant. However, when FDR was employed, 19 probe sets were identified as significant, 14 were either found to be related to lung cancer in the literature or to well-known molecular classifiers. FDR narrowed the number to a small set of relevant genes for further biological testing without being overly conservative. There have been developments in other approaches too, such as resampling-based methods (e.g., bootstrap or permutation approaches), which avoid assumptions for the test statistic distribution. Building a genomics-based classifier Considering the complex relationship of prognosis and biomarker, one often believes a decision based on a pattern of multiple biomarkers is potentially more useful than individual biomarkers in predicting prognosis. To build a classifier of a CHANCE
59
clinical outcome based on the pattern of thousands of biomarkers (genes), one often uses supervised learning methods to train the classifier with a data set in which true values of both the outcome and biomarkers are known. Classifiers using different supervised learning algorithms have been proposed, including linear discriminants, decision trees, nearest neighbor classifiers, and neural network classifiers. However, there is no consensus in the statistical and machine learning communities about what types of classifiers or variable selection methods are best. Interval validation uses the data set from the same batch of patients to assess the performance of the classifier. To ensure an unbiased evaluation, one must follow the principle that the data used for evaluating the predictive accuracy of the classifier must be distinct from the data used for selecting the biomarkers and building the supervised classifier. One straightforward approach is to base the evaluation on a separate test data set. In a split-sample procedure, the initial data set is randomly split into subsets. One part represents the training data set, and the remaining portion is the test data set. Another approach is the cross-validation iterative process, in which the data set is repeatedly partitioned into the left-out portion and training portion. The training portion is used by the training algorithm to develop a classifier, and the left-out portion is used for prediction. This process is repeated for all partitions. The average prediction error based on the iterative process is an unbiased estimate for the true prediction error of the classifier. A special case of cross-validation is “leave-oneout-cross-validation,” where the left-out portion consists of one observation only.
Validation of Lung Cancer Biomarkers Analytic validation Before a biomarker undergoes rigorous validation in a pivotal clinical trial, the pre-pivotal developmental phase should generally establish a threshold of positivity and analytical validation for the test. For many such diagnostic biomarkers, there is no gold standard, and hence analytical validation really means the test is reproducible and robust to sample handling and laboratory variation. If the test is repeated with different samples of the same tumor or repeated on different days in the same or different laboratories, then the resulting classification should be unchanged. The reproducibility and robustness should be evaluated with ROC curve analysis. Issues to be resolved during analytic validation include equipment evaluation, procedure standardization, quality assurance, reproducibility and replication, inter-rater and intra-subject variability, the measurement of reference standards, and sensitivity to interventions or disease progression. Clinical validation In many cases, seemingly impressive prediction accuracy is found for a classifier in the training data set, but its application to an independent data set performs poorly. In a sense, all classifiers built with supervised learning methods are comparable to multivariate predictive models. A statistical predictive model is subject to overfitting when there are too many model parameters (e.g., genes) for the limited number of patients in the training data set. Therefore, the utility of a classifier is ultimately determined by how well it predicts the
60
VOL. 22, NO. 2, 2009
clinical outcome of patients who do not belong to the training data set. Over the last several years, research on lung cancer biomarkers has proliferated. Many studies reported biomarkers with high sensitivity and specificity. These studies are, however, often plagued by small size, lack of reproducibility, or inadequate selection of controls. It is important to establish the superior performance of the biomarker in a cohort of patients independent of those used to build the biomarker. A large, single-arm trial of patients receiving the new drug may be sufficient to establish the validity of the biomarker in predicting clinical outcome, but may not be sufficient to establish the effectiveness of the new drug and the clinical utility of the biomarker. For those objectives, a randomized phase III clinical trial is generally needed. Targeted design
With a targeted design, a biomarker is used to restrict eligibility for a randomized clinical trial comparing an experimental regimen to a control. Often, the experimental regimen is a target agent developed to target those patients with a positive status identified by the biomarker. When evaluating the treatment efficacy of a target agent using a randomized phase III trial, targeted design can be much more efficient than untargeted design. When less than half of patients are marker positive, and the drug has little or no benefit for marker negative patients, targeted design requires many fewer randomized patients. However, a targeted design prevents the chance to test the interaction between treatment and the biomarker prediction and to validate the performance of the predictive biomarker by restricting enrollment to marker-positive patients. Biomarker stratified design
On the other hand, in biomarker stratified randomized (BSR) designs, marker positivity is a stratification factor and both marker positive and negative patients are randomized to target agent versus placebo. BSR design allows testing whether the marker-positive patients benefit from a target agent compared to control, testing whether there is an overall treatment benefit for all patients, and evaluating the performance of the predictive classifier in identifying targeted patients. However, the drawback of BSR design is the high cost of conducting the trial because more patients require treatment, follow-up and biomarker assays. If the overall treatment benefit is small and the marker-negative patients are predominant, such a design can be inefficient, costly, or unethical for the marker-negative patients. To evaluate the utility of a classifier for identifying a subgroup of patients who would potentially benefit from additional chemotherapy, a multicenter randomized clinical trial is considered the gold standard, as it provides an opportunity to balance all potential biases in patient selection between two practices, chemotherapy versus observation. The Cancer Leukemia Group B (CALGB), a cancer cooperative group sponsored by the National Cancer Institute, recently started a randomized phase III trial to evaluate the lung genomic prediction model’s ability to direct the treatment of stage I NSCLC patients. Patients with suspected stage I NSCLC who have undergone a surgical resection are eligible for registration. This trial was proposed as a targeted design (see the gray boxes in Figure 2) and aims to
Biomarker Stratified Design (all)
Targeted Design (gray area)
Stage I NSCLC
Surgery Gene Expression Analysis
Predicted High Risk
Predicted Low Risk
randomize
Observation
randomize
Chemo
Observation
Chemo
Figure 2. A diagrammatic comparison of a biomarker stratified design (stratification variable predicted risk category) versus a targeted design (gray area; predicted high-risk category)
randomize 350 high-risk patients. The study intends to screen 670 eligible patients and collect specimens. The trial has ultimately been approved as a BSR design. A total of 1,200 stage I NSCLC patients will be enrolled. For each patient, the risk of cancer recurrence will be predicted by the Duke University lung cancer genomic prediction model based on microarray data. Stratified by risk group (high vs. low) and pathological stage (IA vs. IB), eligible patients will be randomized with equal probability to either chemotherapy (Chemo) or observation (see Figure 2). This design allows testing the hypothesis that high-risk patients, as predicted by the genomic prediction model, will benefit from additional chemotherapy compared to the low-risk patients, as well as testing whether there is benefit of adjuvant chemotherapy for patients in all risk groups. Moreover, this design allows for full evaluation of the genomic prediction model and any other genomics-based classifiers, prognostic or predictive, that might be proposed in the future. Enrichment design
An enrichment design lies somewhere between targeted design and BSR design. Like the BSR design, the enrichment design randomizes both marker-positive and marker-negative patients. But to reduce cost and improve study efficiency, only a subset of all marker-negative patients are randomized. CALGB 30801 used an enrichment design to maximize efficiency gain. The process of selecting which patients to randomize may depend on biomarker prediction, clinical outcome, or other baseline patient characteristics. The efficiency gain due to an enrichment design could be significant when marker negatives are predominant in the unselected patient population and auxiliary variables exist to identify those informative patients. The
important implication of using such enrichment designs is the estimation bias when standard statistical methods are used to analyze the pooled data.
Future Research Issues Pathway to identify lung cancer biomarkers Like many other cancers, lung cancer biomarker discovery can help improve staging of patients and predict patient prognosis with better accuracy. This may eventually allow doctors to give a more accurate diagnosis for patients if such diagnostic assays can be developed. As is well-known, biology is not about a single gene, but rather a set of genes. Signaling and metabolic pathways are becoming more important in identifying molecular targets for treatment and diagnosis. These pathways can come from pathway databases such as KEGG or Gene Ontology. In recent years, researchers have tried to associate gene expression with metastasis or prognosis and identify gene signatures. These gene sets are ideally associated with a set of biological pathways that serve different cellular or physiologic functions. The relevant pathways may help characterize different aspects of molecular phenotype related to the tumor. Statistical methods for pathway analysis based on machine learning, and parametric and nonparametric enrichment tests, have been developed in the past few years. Further refinements to these approaches for performing pathway-based analysis are likely to be received favorably. These pathway-based approaches allow scientists to focus on a few sets of genes, select targets from multiple biomarkers, and gain insights into the biological mechanisms of the tumor.
CHANCE
61
In the near future, statisticians will be faced with new challenges in finding ways to integrate the above with other proteomic and metabolomic data to identify changes associated with metastasis.
Last, how to combine multiple biomarkers optimally to gain additional predictive accuracy, and how to validate such combinations in a pivotal trial with a codeveloped target agent are topics of active statistical research.
Imaging biomarkers Imaging methods—such as computed tomography (CT), positron emission tomography (PET), and magnetic resonance imaging (MRI)—allow the lungs to be visualized noninvasively in vivo at high spatial and temporal resolution and offer exciting opportunities to develop imaging biomarkers as valuable tools for diagnosis, tests of treatment efficacy, and disease monitoring. In recent years, the potential of using imaging biomarkers as surrogates for clinical endpoints has been investigated. For example, the epidermal growth factor receptor (EGFR) inhibitor, a target agent, induces cytostasis rather than tumor shrinkage. Therefore, traditional size-based CT and MRI response criteria are no longer appropriate to assess the activity of the EGFR inhibitor. Instead, a specific imaging biomarker that reliably detects antitumor activity for the target agent is desirable. FDG-PET allows a quantitative measurement of tumor metabolism by calculating the SUV. A decrease in SUV from pretreatment values within two hours of the treatment predicts a better progression-free survival to EGFR inhibitor in nonsmall cell lung cancer. Due to its noninvasiveness, quickness, and sensitivity, FDG-PET is a potential surrogate endpoint for treatment efficacy. However, an imaging biomarker has to meet rigorous criteria before it can be used in a pivotal clinical trial as a surrogate endpoint. Ross Prentice, in a 1989 Statistics in Medicine paper, proposed a set of criteria to establish surrogacy status for a new biomarker: (1) the treatment must modify the surrogate, (2) the treatment must modify the clinical endpoint, (3) the surrogate and clinical endpoint must be significantly correlated, and (4) the effect of treatment on the clinical endpoint should disappear when statistically adjusting for its effect on the surrogate. Under such conditions, it would require a large, prospectively conducted trial to validate the surrogacy status of the imaging biomarker. Novel study designs and further statistical research are much needed in this promising area.
Conclusions
Codevelopment of biomarkers and target agents Often, biomarkers that identify subgroups of patients with unique molecular signatures and drugs developed to target a specific pathway or gene mutation are codeveloped. It is often impossible to have the candidate diagnostic completely specified and analytically validated prior to its use in pivotal clinical trials. Richard Simon and his colleagues proposed a study design and analytic plan for phase III trials that can be used when no classifier is available at the start of the trial. In the case where a continuous diagnostic test is available at the start of the trial, but the optimal cutoff point for the test for positivity has not yet been established, they studied a biomarker adaptive threshold design and developed a strategy to analyze data resulting from such a design. Additionally, how to estimate the treatment effect, and how to make unbiased assessment of the performance of the biomarker in enrichment designs require further investigation.
62
VOL. 22, NO. 2, 2009
The increased availability of high-throughput technologies has provided exciting opportunities and daunting challenges for researchers in developing robust and useful biomarkers for clinical decisionmaking. Biostatisticians and bioinformaticians are trained to use statistical theory to tackle complex biological problems such as these. Meaningful success in these fields lies in how well these issues are addressed and how well strong collaborations are developed among biologists, oncologists, and statisticians in the process.
Further Reading Benjamini Y. and Hochberg Y. (1995) “Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society, Series B, 57:125–133. Biomarkers Definitions Working Group. (2001) “Biomarkers and Surrogate Endpoints: Preferred Definitions and Conceptual Framework.” Clinical Pharmacology Therapy, 69: 89–95. Dudoit S., Fridlyand J., Speed T.P. (2002) “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data.” Journal of the American Statistical Association, 97:77–87. Heagerty P.J., Lumley T., Pepe M.S. (2000) “Time-Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker.” Biometrics, 56(2):337–344. Pepe M.S. (2003) The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press. Potti A., Nevins J.R. et al (2006) “A Genomic Strategy to Refine Prognosis in Early-Stage Non-Small-Cell Lung Cancer.” The New England Journal of Medicine, 355:570–580. Prentice R. (1989) “Surrogate Markers in Clinical Trials: Definition and Operational Criteria.” Statistics in Medicine, 8:431–440. Shedden K., Taylor J.M.G, Enkemann S.A., Tsao M-S., Yeatman T.J. et al. (2008) “Gene Expression-Based Survival Prediction in Lung Adenocarcinoma: A Multi-Site, Blinded Validation Study.” Nature Medicine, 14:822–827. Simon R., Korn E.L., McShane L.M., Radmacher M.D., Wright G., Zhao, Y. (2004) Design and Analysis of DNA Microarray Investigations. New York: Springer. Simon R. and Maitournam A. (2004) “Evaluating the Efficiency of Targeted Designs for Randomized Clinical Trials.” Clinical Cancer Research, 10(20):6759–6763. Simon R. (2008) “Designs and Adaptive Analysis Plans for Pivotal Clinical Trials of Therapeutics and Companion Diagnostics.” Expert Opinion on Medical Diagnostics, 2(6):721–729.
“Here’s to Your Health” prints columns about medical and healthrelated topics. Please contact Mark Glickman (
[email protected]) if you are interested in submitting an article.
Goodness of Wit Test Jonathan Berkowitz, Column Editor
T
he global economic crisis started in large part with the home mortgage business. In honor of the central role of real estate, the title of this issue’s puzzle is the three-word guiding principle of real estate. The puzzle is a standard bar-type cryptic with one additional solving requirement. It’s definitely not your average puzzle! Reminder: A guide to solving cryptic clues appeared in CHANCE 21(3). The use of solving aids (electronic dictionaries, the internet, etc.) is encouraged. A one-year (extension of your) subscription to CHANCE will be awarded for each of two correct solutions chosen at random from among those received by June 15, 2009. As an added incentive, a picture and short biography of each winner will be published in a subsequent issue. Please mail your completed diagram to Jonathan Berkowitz, CHANCE Goodness of Wit Test Column Editor, 4160 Staulo Crescent, Vancouver, BC, Canada V6N 3S2, or send him a list of your answers by email (
[email protected]). Note that winners of any of the three previous issues are not eligible to win this issue’s contest.
Planning Your Summer Vacation Are you looking for something new and exciting to do this summer? Why not attend the annual convention of the National Puzzlers’ League in Baltimore, Maryland, July 9–12, 2009? It is the greatest collection of word game and puzzle enthusiasts imaginable. You’ll have a chance to meet and solve puzzles with some of the best crossword puzzle constructors and solvers in North America. Almost all the top performers and organizers of the American Crossword Puzzle tournament (as featured in the movie Wordplay) are members of the NPL. Check us out at www.puzzlers.org.
Solution to Goodness of Wit Test #2 This puzzle appeared in CHANCE 21(4), p. 64. Deleted letters spell: Term coined by Tukey for waste recycling. Anagram of “for waste” = SOFTWARE. Across: 1 ALLEGRET(T)O [rebus + reversal: all+ eg + retto(otter)] 10 PA(E) AN [reversal + container: pae(a)n (neap)] 11 TROOPE(R) [reversal + container:
Winners from Goodness of Wit Test #2—Waste Recycling Lawrence Schwartz has been a statistician with Bayer HealthCare Pharmaceuticals since 1984. Married with one son and living in Connecticut, he enjoys sports, reading, music, playing games, and solving puzzles, especially cryptic crosswords.
Jeffrey Passel spent the first half of his career at the U.S. Census Bureau and the second half at the Urban Institute. He is now in the midst of the “third half” at the Pew Hispanic Center. His outside interests include word puzzles, baseball, food, wine, and a small bit of gardening.
tro(o)per (report)] 12 (M)ATTRESS [rebus: mat + tress] 13 MEC(C)A [initial letters: m+e+c+c+a] 14 ESTIMATI(O) N [anagram: so intimate] 16 H (I)ONS [anagram: NIOSH] 17 PROTO(N) S [anagram: sport on] 20 SOFTWARE [anagram: for waste] 21 AM(E)NITY [anagram: anytime] 23 UPEN(D) [homophone: append] 26 (B)REEZELESS [anagram + container + beheadment: beer + z + (us)eless] 28 (Y)OURS [rebus: hours – h + y] 29 NOISET(T)E [pun: noise + “ette”] 31 DR(U)GGED [rebus: d + rugged] 32 KIOS(K) [initial letters: k+i+o+s+k], 33 ASSESSM(E)NT [rebus: asses + s + men + t] Down: 1 APATH(Y) [rebus: a + path + y] 2 (F)LATWORM [rebus + containers: f(l)a(two)rm] 3 LATIN(O) [hidden word: fLAT IN Oxford], 4 GOE(R) [anagram: Gore] 5 (W)RISTS [container: wri(s)ts] 6 TRIM(A)RANS [rebus: trim + arans(Saran)] 7 RO(S) ETTE [container: ro(set)te] 8 OP CI(T) [anagram: optic] 9 MEAN(E)ST [pun] 15 SOFTENE(R)S [container: s(oftener) s] 18 OVERTON(E) [rebus: overt + one] 19 BAR (C)ODE [anagram: brocade] 20 S(Y)NERGY [rebus + container: s (y) n + erg (y)] 22 DE(C)ODE [container: De(cod)e] 24 PRE(L)IM [rebus + reversal: p + relim(miler)] 25 I(N)VEST [rebus: in + vest] 27 (N)EURO [rebus: n + Euro] 30 (G)IGS [double definition] CHANCE
63
Goodness of Wit Test #4: Location, Location, Location! Instructions: Answer words indicated by asterisks form three sets of three words where each set has something in common. As part of your solution, give the three sets of three words and state what the words in each set have in common. Answers include five capitalized entries.
Across 1 Hailstorm swirls around middle of ocean, being equally warm (10) 10 Fertilizer component hidden in bureau (4) 12 Pursue switch of third and fourth trial parts (5) *14 Willing but not able to give hearty approval (4) 15 Ms. Kournikova holds net back for receiver (7) *16 Unmarried woman’s assistant in Minnesota (6) 17 Discourage resolve without explosive (5) *20 Show off source of evil without end (4) 21 Old boats heading away from targets (4) 24 I termed possible fault (7) 26 Realism ruined posters (7) *27 Select central piece of ornament(4) 29 Adversaries using flowers oddly (4) 34 God! Year following crazy diet (5) *35 Lady holds one new father (6) 36 Judge northern conflagration (7) 38 Musicians’ hearing outlawed (4) 40 Former South African province of birth (5) 41 Is behind one red flower (4) 42 ER mandates new brands (10, 2 wds)
64
VOL. 22, NO. 2, 2009
Down 1 Prayer leaders aim poorly at any woman (5) 2 Indonesian island edited Kama Sutra after a king departed (7) 3 Rich source of Danish money (3) 4 Takes care of latest of closest finishes (5) 5 Despised the ad remake (5) 6 Gives new labels re Mensa puzzles (7) 7 Live in most of region (3) 8 Jacket inside straight missing ace (5) 9 Wine cartel reorganized (6) *11 Listen to principal with long hair (4) 13 An insect middle dropping out infectious disease (7) *17 Sport stadium right out of racecourse (4) 18 Two-thirds of eleven exactly! (4) 19 Way to get up mountain if in small Scottish outfit (7, 2 wds) 22 Spot endless deadly sin (4) 23 Challenge cool club (4) 24 Overwhelmed scout pack promise on the rise (7) 25 Eminem’s strangely great (7) *26 Crossing Ahmed in a holy city (6) 28 Write Gore about punishment (5) *30 Leaders of each district of Moses’ biblical land (4) 31 Posed in luxurious material (5) 32 Rumba wildly in shaded area (5) 33 Range as well as transcendental numbers (5) 37 Hospital drama captivating a listener (3) 39 In retreat lost soldier’s intent (3)