Editor’s Letter
Mike Larsen,
Executive Editor
Dear Readers, This issue of CHANCE contains articles about a variety of interesting topics. Three articles are associated by their concern with missing data and methods for imputation, or filling in the missing values. Tom Krenzke and David Judkins use data from the National Education Longitudinal Survey to illustrate a semiparametric approach to imputation in complex surveys. In the area of health, Michael Elliott describes a study of childhood obesity and some of its associated complications. Novel mixture model and multiple imputation approaches are used to address unusual observations, probable transcription errors, and missing data. His medical collaborator, Nick Stettler, comments on aspects of their interaction that enhanced the consulting experience for all parties involved. Mark Glickman brings us an article using multiple imputation in his Here’s to Your Health column. In this issue, Yulei He, Recai Yucel, and Alan Zaslavsky model the relationship between cancer registry data and survey data and use multiple imputation to improve the quality and quantity of information available for analysis. Two articles have sports themes. Eric Bradlow, Shane Jensen, Justin Wolfers, and Adi Wyner address the debate about baseball pitcher Roger Clemens and whether he used steroids. They examine a broad set of comparison pitchers on several dimensions. Read the article to learn their conclusions! Phil Everson, in his A Statistician Reads the Sports Pages column, examines the importance of offense versus defense in Women’s World Cup Soccer. In particular, did the United States make a strategic mistake in the 2007 final?
Peter Freeman, Joseph Richards, Chad Schafer, and Ann Lee illustrate the mass of data coming from the field of astrostatistics. The quality and quantity of information will allow examination of fundamental questions. Simo Putanen and George Styan discuss postage stamps with a probability and statistics theme. Actually, there are many more such stamps, and they will be described in upcoming articles. Peter Olofsson critiques arguments made by supporters of the idea of intelligent design on grounds of probability and hypothesis testing logic. Two additional columns and a letter to the editor complete the issue. Grace Lee, Paul Velleman, and Howard Wainer take on claims by computerized dating services in Visual Revelations; Donald Berry comments on the previous Visual Revelations column in a letter to the editor; and Jonathan Berkowitz brings us his first puzzle as column editor of Goodness of Wit Test. Some guidance on solving this and other puzzles accompany this first column. In other news, CHANCE cosponsored eight sessions at the recent Joint Statistical Meetings. (See the online program at www.amstat.org/meetings/jsm/2008 and select “CHANCE” as the sponsor.) We hope to encourage submissions to CHANCE in diverse areas on significant issues such as those discussed in these sessions. Plans for current issues of CHANCE to go online (in addition to the print version) for subscribers and libraries in 2009 are moving along. I think this will be a positive development for readers and authors of CHANCE. I look forward to your comments, suggestions, and article submissions. Enjoy the issue! Mike Larsen
CHANCE
5
Letter to the Editor The Best Graph May Be No Graph Dear Editor, In CHANCE 21(2), Howard Wainer writes about “Improving Graphic Displays by Controlling Creativity.” He makes good suggestions. In one example (Figure 4), he offers 10 improvements (Figure 5) on a report of “five-year survival rates from various kinds of cancer, showing the improvements over the past two decades” (from the National Cancer Institute). Indeed, the latter figure is neater. But, he missed the most important improvement: not showing the figure in the first place! It’s terribly misleading and doesn’t necessarily reflect any real improvement “over the past two decades.” The three cancers with survival improvement over time (i.e., breast, prostate, colorectal) are those with intensified screening programs over these two decades. Much, if not all, of the higher survival rates is due to what are called the lead time and length biases of screening. These biases are elementary and fundamental in cancer epidemiology. Lead-time bias is the easier of the two to understand. Someone whose cancer is detected n years early in a screening program lives up to n years longer after her tumor is discovered. The pure bias of n years adds to the cancer survival time of everyone whose tumors were detected by screening. Because of the heterogeneity of cancer, the value of n is highly variable and unknown for any particular tumor. The average of n is also unknown, but it is substantial; it is commonly estimated to be 3–5 years in breast cancer. The “length” in length bias refers to the tumor’s pre-symptomatic period, when it is detectable by screening, called the sojourn time. Aggressive tumors have shorter sojourn times because they grow faster. Indolent tumors have longer sojourn times. Screening finds tumors in proportion to the lengths of their sojourn times. Screening preferentially selects tumors with longer sojourn times and, therefore, tumors detected through screening are slower growing and less lethal. An extreme form of length bias is over diagnosis, in which some cancers are found by screening that would not have caused symptoms or death. There are many analogues that may help one’s intuition regarding length bias, and these should be familiar to statisticians. When you look into the sky and see a shooting star, it’s more likely to be one with a longer arc, simply because it’s the one you saw. Or, when you select a potato chip from a newly opened bag, it’s more likely to be a bigger one, simply because bigger ones are more likely to be selected. Waiting time paradoxes are standard examples. Suppose the interarrival times of buses at a certain bus stop are independently exponentially distributed, all with mean m. You arrive at the stop at an arbitrary time and catch a bus. What is the mean 6
VOL. 21, NO. 3, 2008
time between the arrival of the bus you caught and that of the previous bus? The answer is 2m. I don’t mean to suggest that we have not made important strides in treating cancer over the last two decades. We have. But, although Figures 4 and 5 are literally correct, they reflect mostly artifact and greatly exaggerate these strides. Similar figures have been misinterpreted by policymakers and the press and have led to inappropriate recommendations regarding screening, with potentially deleterious effects. The only good use of these figures is as an example for teaching, to demonstrate how easy it is to lie with statistics. Donald Berry Head, Division of Quantitative Sciences & Chair, Department of Biostatistics & Frank T. McGraw Memorial Chair of Cancer Research, The University of Texas M.D. Anderson Cancer Center
Howard Wainer responds: I am delighted that professor Berry raised this issue. While I was preparing this column, I debated with myself (and my colleague, Brian Clauser) this very point and decided not to include it, for it seemed an aside from my main point (fixing graphs to communicate better) and confused the goals of description with those of causation. As a descriptive graph, the figures are correct—survival times are increasing. But the causal inference, why they are increasing, is what professor Berry addresses. The issue is how much of the improvement is due to earlier detection and how much is due to improved treatment. This seems to me to be hard to partition. Perhaps by adjusting survival rates by, say, the maturity of the tumor at the time of discovery might provide some help. I would be interested in other schemes that could help us measure the causal effect of the changes in treatment.
Correction
According to Steve Stigler of The University of Chicago, the picture represented on Page 29 of CHANCE, volume 21, number 2, is an 1842 posthumous painting of Laplace, not of Chevalier de Mere. We have not located a confirmed image of Chevalier de Mere.
Filling in the Blanks: Some Guesses Are Better Than Others Illustrating the impact of covariate selection when imputing complex survey items Tom Krenzke and David Judkins
I
mputation is the statistical process of filling in missing values with educated guesses to produce a complete data set. Among the objectives of imputation is the preservation of multivariate structure. What is the impact of common naïve imputation approaches when compared to that of a more sophisticated approach? Fully imputing responses to a survey questionnaire in preparation for data publication can be a major undertaking. Common challenges include complex skip patterns, complex patterns of missingness, a large number of variables, a variety of variable types (e.g., normal, transformable to normal, other continuous, count, Likert, other discrete ordered, Bernoulli, and multinomial), and both time and budget constraints. Faced with such challenges, a common approach is to simplify imputation by focusing on the preservation of a small number of multivariate structural features. For instance, a hot deck imputation scheme randomly selects respondents as donors for missing cases, and, similarly, a hot deck within cells procedure randomly selects donors within the same cell defined by a few categorical variables. To simplify the hot deck procedure, a separate hot deck with cells defined by a small common set of variables (e.g., age, race, and sex) might be used for each variable targeted for imputation. Another example in the context of a longitudinal survey might be to simply carry forward the last reported value for each target variable. Although such procedures are inexpensive and adequately preserve some important multivariate structural features, they may blur many other such features. Such blurring, of course, diminishes the value of the published data for researchers interested in a different set of structural features than those preserved by the data publisher’s imputation process. We have been working on imputation algorithms that preserve a larger number of multivariate structural features. Our algorithms allow some advance targeting of features to be preserved, but also try to discover and preserve strong unanticipated features in the hopes of better serving secondary data analysts. The discovery process is designed to work without human intervention and with only minimal human guidance. In this article, we illustrate the effect of our imputation algorithm compared to simpler algorithms. To do so, we use data from the National Education Longitudinal Survey (NELS), which is a longitudinal study of students conducted for the U.S. Department of Education’s National Center for Education Statistics.
The NELS provides data about the experiences of a cohort of 8th-grade students in 1988 as they progress through middle and high schools and enter post-secondary institutions or the work force. The 1988 baseline survey was followed up at two-year intervals, from 1990 through 1994. In addition to student responses, the survey also collected data from parents, teachers, and principals. We use parent data (family income and religious affiliation) from the second follow up (1992) and student data (e.g., sexual behavior and expected educational attainment) from the third follow up (1994), by which time the modal student age was 20 years. This results in CHANCE
7
a sample size of approximately 15,000 student-parent dyads. We chose this set of variables for the variety of measurement scales and because the multivariate structure of this group of variables was not a central interest of the NELS. The example is, therefore, illustrative of what can happen when secondary analysts investigate new issues with existing data sets. There are interesting features in this subset of the NELS data. For example, there is a moderate correlation (0.31) between family income (reported by a parent) and the student’s expected education level at age 30 (self-reported). The statistics provided in this article are for illustration purposes and are not intended to be official statistics. Weights were not used in the generation of results, and response categories for some variables are collapsed to simplify the presentation. Family income is also moderately correlated (-0.28) with whether the student ever dropped out of school.
Challenges in Survey Data Some survey data complexities are described here to explain why data publishers may abridge their imputation approach. Among the issues to address are skip patterns, which begin with a response to a trigger item (referred to as the “skip controller”) and continue by leading the respondent through a certain series of questions dependent upon the response to the skip controller. For example, a “yes” response to the NELS item, “Have you had sexual intercourse?” would lead to a question about the date of first occurrence. A “no” response would result in the question about the date of first occurrence being skipped. Survey data become increasingly complex, as dozens or hundreds of items are nested within questions. Figure 1 shows a designed skip pattern for nine observations and three generic variables—A, B, and C—that creates a monotone pattern of missing values. In the figure, a -1 is used to show an inapplicable value. For example, case 6 reports a 1 for question A, a 2 for question B, and then skips question C. The shaded cells represent missing values. Another complexity is the “Swiss cheese” (or nonmonotone) pattern of missing data. The Swiss cheese pattern causes
Questions Case
A
B
C
Case
A
B
C
11 12
x
·
x
x
x
·
13
·
x
x
14
x
·
·
15
x
x
x
16
x
x
·
Figure 2. Swiss cheese missing data pattern illustration for questions A, B, and C on six cases. Shaded cells represent missing values.
havoc in attempts to preserve relationships between variables. The number of distinct missing data patterns escalates as the number of survey items grows. Furthermore, Swiss cheese patterns may occur within a pool of items controlled by the same skip controller. Each item to be imputed may be associated with a different set of key covariate predictor variables. Figure 2 shows a Swiss cheese pattern for six observations and three generic variables—A, B, and C—where x denotes a nonmissing value and the shaded cells denote missing values. For instance, case 11 reports a value for each question A and C, but not for question B. Managing different types of variables and retaining their univariate, bivariate, and multivariate distributions represents another set of challenges. Variables can be ordinal (e.g., year of first occurrence of sexual intercourse) or nominal (e.g., religious affiliation and race). Their distributions can be discrete, continuous, or semi-continuous (e.g., family income, where several modes appear in an otherwise continuous distribution due to rounding of reported values). The distribution for expected income at age 30 is semi-continuous and a portion of the distribution is displayed in Figure 3, extracted for values between $10,000 and $100,000. Due to disclosure control, values lower than $10,000 and greater than $100,000 are suppressed, as well as values between spikes with small percentages. Note the spikes caused by respondent rounding. Figure 4 shows the distribution for month of first occurrence of sexual intercourse. Perhaps others knew of the summer peak, but it is not a pattern we anticipated and is an example of the sort of fascinating discoveries that can be made in secondary analysis provided the data publisher has not blurred these features through poor imputation procedures.
1
·
·
·
2
1
·
·
3
1
1
·
4
1
1
1
5
1
1
2
Covariate Selection and the Semiparametric Approach
6
1
2
-1
We compare simple and data-driven semiparametric covariate
7
1
2
-1
8
2
-1
-1
9
2
-1
-1
Figure 1. Skip pattern illustration for sequential questions A, B, and C on nine cases. The shaded cells are missing values. Values of -1 indicate the question is not asked because it is not applicable.
8
Questions
VOL. 21, NO. 3, 2008
selection options in our illustration. For the simple covariate selection option, three demographic variables (race, age, and sex) are cross-classified to form hot-deck imputation cells. For each target variable and observation with a missing value, a donor is selected at random within the hot-deck cell, and the donor’s value is used to fill in the missing value of the target variable. We refer to this approach as the simple hot deck. The second, more sophisticated, option includes a
Percent
Expected Income at Age 30
Percent
Figure 3. Distribution of expected income at age 30, between $10,000 and $100,000. Due to disclosure control, values lower than $10,000 and greater than $100,000 are suppressed, as well as values between spikes with small percentages.
Month of first occurence of sexual intercourse Figure 4. Month of first occurrence of sexual intercourse. Respondents having had intercourse reported values 1 (January) to 12 (December). Respondents not having had intercourse have a value of -1 recorded. The height of all bars adds to 100.0.
CHANCE
9
more extensive search for covariates. This is the data-driven, semiparametric approach. The impetus behind the semiparametric approach was a friendly competition in 1993 between teams of designbased and Bayesian statisticians. The competition was the brainchild of Meena Khare and Trena Ezzati-Rice, both then at the National Center for Health Statistics. The goal was imputation of data for the National Health and Nutrition Examination Survey (NHANES) III. The Bayesian team’s driving force was Joe Schafer, who had developed imputation software based on Gibbs sampling. He was backed by Rod Little and Don Rubin. On the design-based team were David Judkins and Mansour Fahimi, backed up by Joe Waksberg and Katie Hubbell. Judkins had worked with others on specialized iterative semiparametric procedures in the early 1990s, but these were not yet ready for general use. Instead, the design-based team used more traditional methods—one nonparametric approach and one that was semiparametric but did not involve iteration. The Bayesians were declared the winners of that competition at JSM in San Francisco that summer. Design-based statisticians such as Ralph Folsom, the session discussant, were impressed. Judkins also admired the ability of the new Bayesian approach to preserve multivariate structure. However, semiparametric techniques are known to perform better at preserving unusual marginal distributions of single variables. Since that time, Judkins and others at Westat have
been working to develop imputation software that preserves both complex multivariate structures and marginal distributions with unusual shapes. A solution for imputing ordinal and interval-valued variables is to use cyclic n-partition hot decks, where the partition for any variable is formed by coarsening the predictions from a parametric model for the variable in terms of reported and currently imputed data. The general approach is defined by several features. First, a simple methodology is used to create a first version of a complete data set. Second, each variable is sequentially re-imputed using a partition optimized for it in terms of the other variables. After every variable has been re-imputed once, the process is repeated. This goes on until some measure of convergence is satisfied. Our current procedure uses unguided step-wise regression and stratification on predicted values to form the partitions, but other methods—such as hand-crafted models—are possible. Recent simulation studies have confirmed that this method can preserve pair-wise relationships that are nonlinear but monotonic, as well as highly unusual marginal shapes. The solution we developed for the imputation of nominal variables is more complex and involves the following steps: • Create a vector of indicator variables for the levels of a nominal variable • Create a separate parametric model (stepwise) for each indicator variable (in terms of other variables in the data set) • Using the estimated models, estimate the indicator-variable propensities (the probabilities that an indicator has a value of 1) for each indicator for each observation • Run a k-means clustering algorithm on the propensity vectors • Randomly match donors and recipients within the resulting clusters • Copy the reported value of the target variable from the donor to the recipient As a fillip, the clustering algorithm is actually run several times with different numbers of clusters. Then, if a small cluster happens to have a very low ratio of reported values to missing values, the search for a donor automatically retreats to a cluster from a coarser partition. This avoidance of the overuse of a small number of reported values tends to improve variances on reported marginal distributions at the cost of some decrease in the preservation of covariance structure. A simulation study by David Judkins, Andrea Piesse, Tom Krenzke, Zizhong Fan, and Wen-Chau Haung that was published in the 2007 JSM Proceedings also demonstrated that this procedure can preserve odd relationships between nominal variables. For example, when imputing religion, a series of separate models are fit for the probabilities of being Catholic, Lutheran, Jewish, Buddhist, atheist, and so on. The same covariates are not required to be used in each model. A particular cluster may contain people who have a high probability of belonging to one particular religion, or it may contain people who have a high probability of belonging to one of a small number of religions, depending on the number of clusters in the partition and the strength of the covariates.
10
VOL. 21, NO. 3, 2008
Table 1—Data Dictionary for Illustration Variables Variable Type1
Item Nonresponse Rate
Total family income from all sources 1991 15 categories: none to $200,000 or more
O
18%
Expected income at age 30
$0 to $1,000,000 or more
O
14%
Reported previous sexual intercourse occurrence
Yes, No
O
5%
Month of first intercourse occurrence
Not applicable, January – December
N
13%
Age of first intercourse occurrence
Not applicable, 1–23 years
O
13%
Respondent’s religion
Jewish, Mormon, Roman Catholic/Orthodox, Other Christian, Other, None
N
12%
Expected occupation at age 30
Blue collar, Clerical, Manager, Owner, Professional, Protection, Teacher, Other, Not working
N
10%
Age categories
Less than 19.25, between 19.25 and 20.25, more than 20.25
O
0%
Race/Ethnicity
Hispanic, Black, Other
N
0%
Sex
Male, Female
O
0%
Highest level of education expected
9 categories: Some high school to college/PhD and college/professional degree
O
2%
Ever dropped out flag
Never dropped out of high school, dropped out of high school at least once
O
0%
Current marital status
5 categories: Single never married, Married, Divorced/Separated, Widowed, Marriage-like relationship
N
<1%
Number of biological children
6 categories: No children to Six
O
<1%
Voted in past elections
Had voted, Had not voted
O
<1%
Census region of student’s school
Northeast, Midwest, South, West
N
5%
Health problems limit type of work
Yes, No
O
<1%
Time spent on religious activities (one or more times per week)
Yes, No
O
<1%
Number of hours watch TV weekdays
10 categories: No TV on weekdays to 8 hours or more
O
<1%
Volunteered at church or churchrelated activities (not including worship services) in past 12 months?
Yes, No
O
<1%
Age of respondent at time of interview
14–24 years
O
0%
Variable
Variable Values
Target Variables
Hot Deck Cell Variables
Key predictors
1
O=Ordinal, N=Nominal
Illustration Setup
We identify three major types of variables: target variables to be imputed, hot deck cell variables, and key predictors. The target variables are chosen because they are self-reported and have relatively high item nonre-
sponse rates. As seen in Table 1, the variable with the highest item nonresponse rate is family income (18% missing) and the one with the lowest rate is the question about whether the respondent had a prior sexual intercourse occurrence (5%). The other target items CHANCE
11
Figure 5. R2 values for multivariate regressions for target variables on all variables other than hot deck cell variables. The horizontal axis is R2 values for the observed data. The vertical axis is R2 values for the imputed data using two methods.
are age at first occurrence of intercourse, month of first occurrence of intercourse, expected income at age 30, expected occupation at age 30, and religious affiliation. The hot deck cell variables are race (three categories), age (three categories), and sex. Race and age are pre-imputed such that each is nonmissing for this illustration, and sex is nonmissing in NELS. There are 11 key predictors selected. The variables are chosen based on their theory-driven relationship to the key imputed variables and potential interest to the reader. Beside the semicontinuous and nonuniform distributions shown in Figures 1 and 2, respectively, a skip pattern is given by the prior sexual occurrence variable as a skip controller for the age at first occurrence of intercourse (13% missing) and month of occurrence (13% missing). In addition, as shown in the variable type column in Table 1, there is a mix of nominal (N) and ordinal variables (O). Lastly, a Swiss cheese pattern of missingness exists in cross-classifications among the target variables. The data file was fully completed twice, once by the simple covariate option via the simple hot deck with a limited fixed set of predictors, and once by the more extensive covariate search via the more sophisticated semiparametric approach. Pair-wise correlations and R2 values from multiple regression models were computed based on the two versions of imputed data and on the observed data, as well. The pair-wise correlations were produced among and across the three variable types (target variables, hot deck cell variables, and key predictors). While the nominal variables were imputed in their original form, they were recoded as indicator variables for computing correlations and regression model processing. One interesting, but not surprising, result is that both imputation approaches retain the pair-wise correlations that 12
VOL. 21, NO. 3, 2008
involve both a hot deck cell variable and a key imputed item. However, among correlations that do not involve the hot deck cell variables (e.g., the correlation between family income and the highest level of education expected), the correlation from the simple hot deck is attenuated for many items. The more extensive semiparametric approach does a better job at retaining the pair-wise correlations. Similarly, two sets of multiple regression models were processed for each key imputed item. The first set of models included only the hot deck cell variables as independent variables. The R2 values for multivariate regressions for target variables on hot deck cell variables are fairly consistent with R2 values from the observed data, regardless of the imputation approach. The second set of regression models included all non–hot deck cell variables as the independent variables. In Figure 5, the horizontal axis is R2 values for the observed data. The vertical axis is R2 values for imputed data using two methods. As shown in the figure, the semiparametric approach tends to arrive at the same R2 values as computed among the reported data; however, the hot deck approach generally results in lower R2 values. This verifies that the simplified hot deck approach blurs multivariate structural features more strongly than the semiparametric approach. Similar plots that include hot deck cell variables as independent variables show less of a difference. We also highlight examples of the blurring that occurred under the simplified approach. First, there are a couple of obvious results that are glaringly incorrect from the hot deck approach. We would hope to retain obvious relationships in the data, such as atheists having little to do with religious activities or Mormons residing in the West. As shown in Table 2, such distributions deviate from the obvious under the hot deck approach. In the observed data, 12.2% of
Table 2—Percentages by Subgroup and Imputation Approach for Two Questions
Subgroup
Original Data
Completed by Semiparametric Approach
Completed by Hot Deck Approach
Percent spending time on at least one religious activity per week among … Jewish
38
37
39
Mormon
58
56
56
Roman Catholic/ Orthodox
42
42
41
Other Christian
43
42
42
Other
39
39
38
None
12
13
15
Overall
39
39
39
Percent in West among … Jewish
12
14
14
Mormon
79
79
71
Roman Catholic/ Orthodox
21
22
22
Other Christian
14
15
16
Other
31
32
31
None
27
28
25
Overall
20
21
21
atheists spent time on religious activities at least once per week. A much higher percent (15.1) was obtained from the hot deck, while about the same level (12.8%) was retained by the semiparametric approach. A similar scenario results where the percent among Mormons residing in the West was reduced from 78.9% among the observed data to 71.1% from the hot deck, while retaining a 79.3% from the semiparametric approach. These obvious relationships are attenuated, as the simple hot deck approach does not use imputed survey items as predictors for each other. The correlations of the observed data, semiparametric approach, and simple approach between family income and the expected education level at age 30 by the student are 0.31, 0.31, 0.27, respectively. As another example, the correlations between family income and whether the student ever dropped out of school are -0.28, -0.28, -0.25, respectively. Clearly, the relationships in these examples are weakened when using
the simplified approach. When this occurs, the imputation process distorts the underlying multivariate structure and creates an unintended and different picture viewed by the data analyst. Lower correlations may lead to loss of explanatory power in regression models and reduce the contribution from interaction effects when trying to explain the variation in outcome variables.
Further Reading Ezzati-Rice, T.M.; Fahimi, M.; Judkins, D.; and Khare, M. (1993) “Serial Imputation of NHANES III With Mixed Regression and Hot-Deck Techniques.” In JSM Proceedings, Section on Survey Research Methods. Alexandria, VA: American Statistical Association. www.amstat.org/sections/SRMS/ Proceedings Folsom, R. (1993) “Discussion.” In ASA Proceedings, Section on Survey Research Methods. 309–311. Alexandria, VA: American Statistical Association. www.amstat.org/sections/ SRMS/Proceedings Judkins, D.R. (1997) Imputing for Swiss Cheese Patterns of Missing Data.” Proceedings of Statistics Canada Symposium 97, New Directions in Surveys and Censuses, 143–148. Judkins, D.; Krenzke, T.; Piesse, A.; Fan, Z.; and Haung, W.C. (2007) “Preservation of Skip Patterns and Covariance Structure Through Semiparametric Whole-Questionnaire Imputation.” In JSM Proceedings, Section on Survey Research Methods. Alexandria, VA: American Statistical Association. www.amstat.org/sections/SRMS/Proceedings Khare, M.; Ezzati-Rice, T.M.; Rubin, D.B.; Little, R.J.A.; and Schafer, J.L. (1993b). “A Comparison of Imputation Techniques in the Third National Health and Nutrition Examination Survey.” In JSM Proceedings, Section on Survey Research Methods. 303–308. Alexandria, VA: American Statistical Association. www.amstat.org/sections/SRMS/ Proceedings Khare, M.; Little, R.J.A.; Rubin, D.B.; and Schafer, J.L. (1993a) “Multiple imputation of NHANES III.” In ASA Proceedings, Section on Survey Research Methods. 297–302. Alexandria, VA: American Statistical Association. www.amstat.org/ sections/SRMS/Proceedings Little, R.J.A.; Yosef, M.; Cain, K.C.; Nan, B.; and Harlow, S.D. (2008) “A Hot-Deck Multiple Imputation Procedure for Gaps in Longitudinal Data on Recurrent Events.” Statistics in Medicine, 27:103–120. Marker, D.A.; Judkins, D.R.; and Winglee, M. (2001) “LargeScale Imputation for Complex Surveys.” In Survey Nonresponse, Eds. R.M. Groves, D.A. Dillman, E.L. Eltinge, and R.J.A. Little. New York: Wiley. Piesse, A.; Judkins, D.; and Fan, Z. (2005) “Item Imputation Made Easy.” In ASA Proceedings, Section on Survey Research Methods. 3476–3479. Alexandria, VA: American Statistical Association. www.amstat.org/sections/SRMS/Proceedings Siddique, J. and Belin, T.R. (2008) “Multiple Imputation Using an Iterative Hot-Deck With Distance-Based Donor Selection.” Statistics in Medicine, 27:83–102. CHANCE
13
Healthy for Life:
Accounting for Transcription Errors Using Multiple Imputation Application to a study of childhood obesity Michael R. Elliott
A
pplied statisticians working in an academic environment frequently have the opportunity to collaborate with scientists working on interesting and important problems and to use their creativity to both help their collaborators and advance the field of statistics. Unfortunately, these endeavors too often are divorced from each other. A clinician may have straightforward design questions or analytic needs. Or, a statistician might have an idea to extend a method, but lack an application to illustrate it with real data.
14
VOL. 21, NO. 3, 2008
However, the University of Pennsylvania and Children’s Hospital of Philadelphia Healthy for Life project provides a case study of how collaboration with a clinical scientist simultaneously advanced public health and developed new statistical methods. The primary goal of the project was important. Few population-based studies had been done in low-income populations and certain minority groups (e.g., non-Mexican Hispanic Americans or Chinese Americans) in medically underserved areas. Our efforts showed that these groups were suffering from pediatric obesity at a much higher rate than the general population. This was an important public health finding because it could affect the allocation of scarce resources to address health disparities. But, after the data were collected, it became clear there were two problems. First, because the data were extracted from clinical charts in which height was not systematically measured, a substantial fraction of the height data was missing. Second, some of the fully observed data had outliers, which are implausibly high or low weights or heights for the age of the child. A few of the outliers were easy to exclude as transcription/typographical errors (e.g., the 300 pound 2-year-old). Others might have been either transcription errors or accurate, but one could not say for sure. How realistic is it that a 4-year-old weighs 80 pounds? Methods exist for dealing with both missing data and outliers, but a chance to combine them in a single method that would deal with both simultaneously suggested itself. Other issues arose along the way: How do you design an efficient sample for the estimation of overweight prevalence? How do you deal with “arms-length” data collection? How do you recover when data problems arise post-collection?
children are a similarly understudied group with respect to obesity prevalence. Community health centers, funded in part by the U.S. government through the Health Resources and Service Administration (HRSA), deliver health care to the medically underserved, primarily in inner city and rural areas. Children served by these clinics are more likely to belong to ethnic/ racial minorities and/or low-income groups and live in areas that are not only medically underserved, but where structures that support a healthy lifestyle—such as supermarkets or sport clubs—are also lacking. As a result, they are more likely to be overweight. But, at the same time, these children are in a promising setting for interventions to reduce obesity risk. Robert M. Politzer and his coauthors documented the success these clinics have had in an article titled “Inequality in America: The Contribution of Health Centers in Reducing and Eliminating Disparities in Access to Care,” published in Medical Care Research and Review. The clinics have been effective in counseling patients about unhealthy lifestyle choices, improving rates of cancer screening examinations such as Pap smears and mammograms, helping patients control hypertension, and generally reducing health care disparities associated with an unfavorable health environment. Thus, a team including Steven Auerbach of HRSA; Nicolas Stettler of The Children’s Hospital of Philadelphia;
Childhood Obesity in Community Health Centers Pediatric obesity has increased several-fold in the United States during the past 30 years. Some have suggested obesity in the 21st century will begin to offset the tremendous gains in life expectancy during the 20th century. Although the prevalence of pediatric obesity has increased in all demographic groups, some health disparities persist. It is known that children from some disadvantaged ethnic/racial groups, whose parents are of lower socioeconomic status or who live in certain geographic regions, are at higher risk of developing obesity. The risk of obesity among non-Mexican, American Latino children, however, is not well characterized. This occurs because the primary source for these data, the National Health and Nutrition Examination Survey (NHANES), oversamples MexicanAmerican children, and thus has a limited sample of Hispanic children who are not Mexican-American. Chinese-American CHANCE
15
and Shiriki Kumanyika, Michael Kallan, and Michael Elliott of the Center for Clinical Epidemiology and Biostatistics at the University of Pennsylvania devised a study to ascertain the prevalence of overweight by collecting age, height, and weight data from children who were clients of these clinics. Data came from clinics in the eastern United States and Puerto Rico (HRSA Regions II and III) during 2001. This work was preliminary to the design of interventions to reduce pediatric obesity rates in this population.
Designing the Sample and Collecting the Data To obtain estimates of overweight prevalence about which statistical inference could be made, we obtained a probability sample of children from the clinics. A probability sample simply means each child in the population has a known, nonzero probability of selection, and that selection takes place using some type of random mechanism. One of the simplest probability sampling mechanisms is a random sample—conceptually “drawing names out of a hat.” This would have required obtaining some type of unique identifier from all children who sought care at all 141 HSRA Region II and III clinics in 2001 and choosing the sample at random with equal probability. Collecting these identifiers (the “sampling frame”) would have been very costly and time consuming. Also, many clinics would not have a sufficient sample size to make an accurate estimate of overweight prevalence in their clinics, removing one of the motivating factors for clinics to participate. Instead, a two-stage, clustered sample design was implemented. The first-stage sampling frame consisted of all the clinics in the HRSA II and III regions, and the second-stage frame consisted of the children who sought care in the sampled clinics. Sampling approximately 100 children from each sampled clinic provided stable prevalence estimates for the clinic—confidence intervals of 10 percentage points or fewer in length. We determined the project could afford to abstract approximately several thousand records, that is, several tens of clinics. At this point, a typical sampling strategy that allows all children to have the same chance of being selected would be to first sample the clinics with probability-proportional-tosize; that is, with probability ncNs / N, where nc is the number of sampled clinics, Ns is the number of children in the sth clinic, and N is the total number of children in all the clinics. By sampling a fixed number, n, children from each clinic, the probability of selection for each child is equal: N n n nc s × =n . N Ns c N Of course, as with any large study, there were complications. We wanted accurate overweight prevalence estimates within two age groups—2–5 years old and 6–11 years old— to compare prevalence to nationally representative data. Population data available for use in specifying the sampling plan, however, was only available on the 1–4-year-old and 5–12-year-old age groups. We also wanted accurate overweight prevalence estimates within each of six geographic regions: urban, suburban, and rural United States; urban and rural Puerto Rico; and Chinatown in New York City, as well as by race (Asian, Hispanic, African-American, and nonHispanic Caucasian). 16
VOL. 21, NO. 3, 2008
Because there was only one center in Chinatown and one in urban Puerto Rico, we automatically included these centers in the study. In other words, they were “sampled with certainty.” The number of clinics sampled in the remaining strata was allocated in such a fashion as to be as close to the proportion of the population as possible while also maintaining a minimum number of sampled clinics within each stratum. In the study, 30 centers, or 3,000 children, were sampled: 10 urban, seven suburban, and seven rural centers in the United States; one urban and four rural centers in Puerto Rico; and one center in Chinatown. Based on preliminary calculations, we could say these 30 centers should provide sufficiently large samples to determine how prevalent overweight children were with 95% confidence interval width of two percentage points or fewer in each age group, as well as 10 percentage points or fewer in each age-by-geographic stratum. Since exact population data were unavailable, we used something similar to a probability-proportional-to-size design. Clinics in a stratum were divided into substrata that contained approximately 1/C of the stratum population, where C is the number of clinics to be sampled in a geographic stratum. Thus, with 10 centers to be sampled in the U.S. urban stratum, all U.S. urban clinics were ordered by size and divided into 10 substrata, each containing approximately 1/10th of the total number of children 1–12 served by U.S. urban clinics in 2001. Once a clinic was selected, enough 1–4-year-olds and 5–12-year-olds were randomly sampled to provide an expected 42 2–4-year-olds and 58 5–11-year-olds, where the number of 1-year-olds and 12-year-olds to be dropped was estimated by linear interpolation. This would leave samples of approximately 50 in each target age group per center, since 1/7 3 58 ≈ 8 of the 5–11-year-olds will be exactly 5 years old. That is, there should be approximately 58-8=50 6–11-year-olds and 42+8=50 2–5-year-olds in the sample. Clinics were provided lists of random integers drawn from one to Nas, where Nas is the number of children seen during 2001 in the ath age group and sth clinic. The clinic then listed children in whatever fashion it preferred, picking the children associated with the random integers and abstracting height, weight, age, and race/ethnicity data from the child’s medical chart’s last visit in 2001. In principle, height and weight data only need to be abstracted for children aged 2 through 11, but, to minimize abstraction errors, we requested that it be abstracted for all sampled children.
Estimating Overweight Prevalence To account for taller children requiring a greater weight to be healthy than shorter children, body mass index (BMI), given by the weight in kilograms divided by the square of the height in meters, was used to determine overweight status, rather than weight itself. Furthermore, unlike adults, the BMI of children changes physiologically during growth. Thus, BMI cutoff points for overweight are specific to age and gender. Because BMI is not typically normally distributed within an age and gender group, to improve an approximation to normality, a Box-Cox–type transformation of BMI was used. Box-Cox is a mathematical transformation of continuous data that reduces skewness (imbalance in the tails of the distribution) and kurtosis (“heaviness” in the tails of the distribution) relative to the
0
normal distribution. The transformation used in this application is given by the following formula: Z=
(Y / M)L −1 . L×S
In the formula, Y is the BMI measure, and Z is the resulting “z-score” transformation, which should have an approximate standard normal distribution (mean zero and variance one). The parameters M, L, and S are a function of age and gender and taken from tables estimated from a representative U.S. population measured before the obesity epidemic. That is, they standardize the BMI z-scores across ages and genders. Similar transformations are available for the height and weight data used to compute BMI. Figure 1(a) shows a histogram of the raw BMI data. Figure 1(b) shows a histogram of the z-score–transformed BMI data. The transformed data removed the right skewness. Yet, there are small (negative) z-scores trailing away from the center of the graph; I’ll say more about that in a moment. Children whose z-score was above the 95th percentile for the reference population are considered overweight.
(a)
Raw BMI Data
0
10
20
30
40
50
(b)
BMIZ-Scores
If a simple random sample design had been used, overweight prevalence could have been estimated by pˆ = x / n , where x is the number of overweight children in the sample or relevant subsample (e.g., children aged 2–5) and n is the total sample or subsample size. However, the sample design employed needs to be taken into account for the analysis. Children did not have an equal probability of selection. If, for example, children with a lower probability of selection were more likely to be overweight, overweight prevalence would tend to be underestimated. Thus, sampling weights equal to the reciprocal of the probability of selection were constructed, so the under-represented children are “weighted up” (counted as more than one sample child) and the over-represented children are “weighted down” (counted as fewer than one sampled child) in the analysis. The weighted estimate is pˆw = xw / nw , where xw is the sum of the sample weights associated with the overweight children and nw is the sum of the sample weights. The sample weights, themselves, are constructed by determining the probability of selection for all 1 C
children in the ath age group in the sth center as π sa = ×
nsa , Nsa
where Cs is the number of centers in the substratum from which the sth center was drawn, nsa is the number of children sampled in the ath age group from the sth center, and Nsa is the total number of children(b)in the ath age group from the sth center. The actual sample weights are then given by wi = fra / π sa , where fra is aBMIZ-Scores “post-stratification” factor that adjusts the selection weights so the sum of the sample weights equals the known age totals within each of the six geographic regions. The average sampling weight was 171.2, with a range from 28.8 to 659.1. In a simple random sample, the variance of pˆ could have been estimated by pˆ(1− pˆ) / n , and 95% confidence intervals given by pˆ ±1.96 pˆ(1− pˆ) / n . The use of the sample weights to compute prevalence estimates, along with the clustering of the children at the first stage of selection by center, means the variance estimate given by replacing pˆ with pˆw in the above will typically underestimate the true variance of pˆw. Instead, “linearization estimators” can be used to estimate the variance -30 -20 -10 0 5 of pˆw. These estimators approximate xw/nw as a linear combination of xw and nw and use estimators of the variance of xw and nw and the covariance of xw and nw that account for sampling weights and clustering to obtain an estimate of the variance of this linear approximation.
Accounting for Missing Data
50
-30
-20
-10
0
5
Figure 1. Histograms of (a) raw BMI data and (b) BMI data after age-gender specific Box-Cox transformations
Because we did not have a direct affiliation with the health centers, we needed to work through HRSA to collect the data. Maintaining privacy is critical in health research, and we needed to avoid direct contact with individually identifiable health information such as names and addresses, specific dates of birth, or other unique identifiers of the sampled children. This also allowed the institutional review boards of both Children’s Hospital of Philadelphia and the University of Pennsylvania to exempt the study from oversight. Because of this “arms-length” relationship with data collection, we received the de-identified data from 3,579 children aged 1 through 12 in a single shipment after collection had been completed. As expected, we found 720 children aged 1 or CHANCE
17
Females
2–5
21.8%
10.3%
6–11
23.8
15.8
2–5
22.9
9.9
6–11
23.3
16.9
2–5
20.4
10.7
6–11
24.3
14.7
All differences between HRSA and NHANES are statistically significant. HSRA estimates were obtained using the standard multiple imputation procedure from Stettler and coauthors’ “High Prevalence of Pediatric Overweight in Medically Underserved Areas.”
0.6
NHANES
0.0
Males
HRSA
-0.6
Both
Age (years)
Mear 3IV I z-score
Gender
1.0
Table 1—Prevalence of Overweight by Age and Gender Using 2001 Data from HRSA and General Population Data from NHANES, 1999–2002
0.0
0.2
0.4
0.6
0.8
1.0
Response Rate
Figure 2. Proportion by clinic with recorded height data among children aged 2–5 years (o) and 6–11 years (x) versus mean BMI z-score among children with recorded height data
12 in the sample who were dropped from the analysis. In addition, we had 369 children missing age or gender information (11% of the total) who also were dropped from the analysis. An additional 16 cases were removed because they were missing weight data or because they had an implausible BMI—a BMI that corresponded to a z-score of 10 or greater, expected to occur with a probability of 10-24 in standard normal distributions. This left 2,474 cases available for analysis. A more serious missing data problem arose from our neglecting to consider that height data, in contrast to weight data, is not routinely collected as part of a child’s visit unless enough time has elapsed since the previous visit for the child to have grown detectibly. Nearly one-fourth of our cases (606) were missing height data. We could have proceeded by simply deleting these cases as well—a so-called “complete case” analysis—but this would have obviously reduced efficiency in the analysis and potentially introduced bias. For example, some centers were careful to collect height data each visit (only 2% missing); whereas, others did so very sporadically (72% missing). If centers with more missing height information also had systematically higher or lower levels of overweight, prevalence of overweight in the population of children would be under- or over-estimated. Furthermore, children with chronic conditions associated with obesity, such as asthma, typically visit their health providers more frequently, and, therefore, are potentially more likely to have missing data for height, since height measurement is less likely to happen if a child has been recently seen. Roderick Little and Donald Rubin, in Statistical Analysis with Missing Data, describe three types of missing data “mechanisms”: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR). MCAR, as the name suggests, assumes missingness is unrelated to either the observed or unobserved elements of the data. MCAR is implicitly assumed by a complete case analysis. MAR is a 18
VOL. 21, NO. 3, 2008
weaker assumption that can be met using the observed data to model (“fill in”) the missing data elements and could reduce bias if there are associations between the probability of a child’s height data missing and a child’s BMI. NMAR allows for the possibility that overweight children might have been more or less likely to have their height data recorded, even after accounting for observed data such as age, gender, and weight, as well as the clinic in which they received care. Here, it would appear the MAR assumption would be reasonable, since we wouldn’t expect the height of a child to be strongly associated with whether their height was measured after accounting for the other data we have available, particularly age. As children grow more slowly when they get older, it may be that older age, thus taller, children are less likely to have their height measured. But within an age group, we wouldn’t expect to find that taller children were less likely to have their height measured than shorter children. Figure 2 displays the proportion of children at a clinic with recorded height data compared to the mean BMI z-score for children with recorded height data at that clinic by age group (2–5 and 6–11). There is a slight tendency for clinics with lower completeness to have lower BMI among the children with recorded height data. The tendency for children 6–11 to have lower response rates is stronger than the tendency for children 2–5. Neither trend, however, is statistically significant. Hence, we decided to use multiple imputation to account for the missing height data. As Rubin describes in Multiple Imputation for Nonresponse in Surveys, multiple imputation uses observed data—in this case, age, gender, and weight—to estimate or “impute” a value for the height of each subject. Multiple imputation can reduce the size of the confidence intervals, and possibly reduce bias, if subjects in certain clinics were either more or less likely to be overweight and more or less likely to be missing height data.
To implement the multiple imputation procedure, we first needed to obtain 20 reasonable values of height data for the children who had only age and weight data. Fewer or more values could be obtained; typically 20 provide sufficient precision for the estimates of the variance, as described below. To do this, we followed an algorithm that begins by estimating the parameters in a linear regression model of height z-score on weight z-score for the complete data, with separate mean parameters for each age group within each clinic to allow for differing overweight rates across ages and clinics. We then used the results of this model to estimate the expected height for all children missing height data. The second step in the algorithm is to re-estimate the parameters in the linear regression of height z-score on weight z-score, this time using both the observed height data and the imputed height data from the first step. Random error terms are then added to the parameter estimates to account for their being estimates, not true values. The third step of the algorithm repeats the first step, this time using the parameters obtained in the second step and adding an error term drawn from a normal distribution with mean zero and variance equal to the error or “residual” variance estimated from the linear regression of height z-score on weight z-score. We repeated the second and third steps of this algorithm thousands of times, taking the results of 20 height imputations spaced enough iterations of the algorithm apart so that the values are reasonably independent of each other. Based on the 20 height imputations for each missing value, we formed 20 “completed,” or “imputed,” data -10 -5 0 sets composed 5 of the real observed data and one set of imputations. We then obtained multiply imputed estimates Weight Z-score of overweight prevalence by running separate analyses on each imputed dataset, treating the imputed data as real. As in Rubin’s book, we combined the resulting prevalence estimates into a single estimate by averaging the results from each imputed data set. We obtained the variance of the resulting estimate by first averaging the variance estimates from each of the imputed data sets and adding the between-imputation variance of the individual point estimates, with an adjustment for the finite number of imputations used. The second factor in the variance computation accounts for the true values for the imputed data not being known. If the model suggests the predicted values resulting from the observed data are highly accurate, then the betweenimputation variance will be very small and the recovery of the missing information nearly complete. On the other hand, if the model is essentially producing estimates of “noise,” the resulting between-imputation variance will essentially undo the reduction in variance resulting from treating the imputed data as known, and the resulting overall variance would be similar to that obtained from a complete-case analysis. Table 1 compares the overweight prevalence rates within the HRSA clinics with the equivalent rates in the general population estimated from NHANES during a similar time period (1999–2002). As expected, the rates are much higher among children seen at the HRSA clinics than among U.S. children overall. However, while the rate of overweight among children in the general population jumped from 10% to more than 15% between 2–5 and 6–11, the overweight rate
-10
-5
0
5
-10
Weight z-score Weight Z-score
-10
-5
0
5
Height z-score Height Z-score Figure 3. Q-Q plot for weight and height z-scores versus a normal distribution using observed data. A random sample from a standard normal distribution would align along the 45° line.
increased only slightly (from 22% to 24%) among children in the HRSA population.
Using a Mixture Model The imputation model makes the assumption that the height and weight data are normally distributed after transforming them via the Box-Cox z-score method. To determine if the z-scores follow a standard normal distribution, a “q-q plot” can be used. Such a plot compares the ordered data, often called “order statistics,” against the average values of the order statistics from a sample of identical size from a standard normal distribution. If the data follow a standard normal distribution, then the resulting plot should approximate a line with a 45° angle. Figure 3 shows q-q plots for the observed height and weight z-scores. There are many more extreme values than we would expect. The extremely small values (i.e., large negative values) are the more obvious in the plot, but a number of extremely large z-scores are also present. With a sample size of 2,474 from a standard normal distribution, the average smallest z-score is -3.54 and the average largest z-score is 3.54. In our CHANCE
19
Figure 4. (a) Two normal distributions: mean zero and variance one (solid) and mean one and variance two (dashed); (b) Mixture of normal distributions in (a), with 40% belonging to the mean zero and variance one class and 60% to the mean one and variance two class
sample, 22 of the weight z-scores are below this value and 27 are above. Similarly, 33 of the observed height z-scores are below and 23 are above the range of the expected minimum and maximum values for a sample of 1,868 from a standard normal distribution. In some cases, these extreme values are very likely transcription errors: a 4-year-old who weighs 250 pounds or an 8-year-old who is four feet tall but weighs only 35 pounds. Others could be either plausible, but unusual values, or transcription errors: a 12-year-old who weighs 24 pounds, but who is only 2 feet 8 inches tall. Identifying outliers in a multivariate setting can be problematic. “Masking” can occur when outliers inflate a covariance matrix, while “swamping” can occur when outliers pull a covariance matrix away from non-outlying observations. Thus, we considered using a latent class or mixture model to try to determine which values were likely to be outlier due to transcription errors, as opposed to outlying values of extreme heights and weights. In a latent class model, we assume each subject belongs to one, and only one, unobserved or “latent” group defined by a known distribution whose parameters depend on the class membership. So, we might have a two-class mixture of normal distributions, one with mean zero and variance one and the other with mean one and variance two. If 40% of the population belongs to the first class and 60% to the second, the probability density function (pdf) for this distribution would appear as in Figure 4. If the variances were smaller and the means in the latent groups more separated, there would likely be a bimodal appearance in the pdf. But, we don’t know if a particular subject belongs to a given class with certainty; we only have the probability that the subject belongs to one class or the other as a function of the subject’s value. Thus, a subject whose value is zero has a higher probability of belonging to the first class than the second (.60 versus .40); whereas, a subject whose value is three has a higher probability of belonging to the second class than the first (.98 versus .02). In the latent class model for this application, we assume each child’s height and weight belong to one of a small number of unknown classes of normal distributions. All the distributions have common means, but the variances are assumed to 20
VOL. 21, NO. 3, 2008
differ. The class with the largest variance is assumed to be a “transcription error” class. All the subjects in the “transcription error” class are to be dropped from the analysis as an erroneous value. This model has several advantages. First, the class membership can be treated as missing data, to be imputed as part of a multiple imputation procedure. This, in turn, allows uncertainty about whether an observation is a true transcription error to be incorporated into the analysis. For some imputed data sets, an observation will be included, while, in others, it will be excluded in proportion to the probability that it is estimated to be a transcription error outlier. By excluding transcription outliers, we prevent potential overestimation of the prevalence of overweight children by reducing the number of potentially extreme values of weight that might be included in the imputed data sets. Finally, by allowing for multiple variance classes, the over-dispersion in child height and weight data identified in Figure 3 can be accounted for in the imputation model. This presumably will reduce underestimation of the prevalence of overweight children by allowing more extreme values to be imputed than would be the case under a simple normal model. We can compare the results of this mixture model with the simple multiple imputation approach to determine to what degree the countervailing model failures of the simpler model might have failed to balance. To allow for the differing height and weight patterns across clinics, the mean height and weight were allowed to vary by clinic, as in the original multiple imputation approach. The correlation between height and weight was assumed to be equal within all the variance classes except the transcript error (largest) variance class. The mixture model assumes the number of classes is known. To choose between models with differing numbers of classes, we considered both the Akaike Information Criterion and the Schwartz Criterion. As we increase the number of latent classes in the model, the fit of the model improves, even if the true number of classes is exceeded. When the number of classes is too large, the model essentially will fit the residual “noise” into additional classes. To avoid this, both criteria introduce penalty factors that are functions of the number
Fully Observed Data
Weight Z-score
Transcription errors
Height Z-score
Large variance class
Height Z-score
Height Z-score
Small variance class
Weight Z-score
Weight Z-score
Imputed Height Data
Weight Z-score
Transcription errors
Height Z-score
Large variance class
Height Z-score
Height Z-score
Small variance class
Weight Z-score
Weight Z-score
Figure 5. Plot of height versus weight z-score for fully observed data (top row) and imputed height data (bottom row). Results from a single imputation of height and latent variance class.
of parameters and, in the case of the Schwartz Criterion, the sample size. Evaluating these criteria at the posterior mode of the model parameters suggested the three-class model was the best to use. Figure 5 shows the result of a single imputation of the missing height data and the indicators of the latent variance class. An estimated 91% of the population (95% posterior predictive interval or PPI of 87%–94%) belong to the small variance class. This class has an estimated weight variance of 1.43 (95% PPI 1.35–1.55) and a height variance of 1.14 (95% PPI 1.04–1.24). These variances are not too dissimilar from one, the variance of a standard normal that would have been expected if the z-scores truly were from a standard normal distribution. We use the term “posterior predictive interval” rather than confidence interval because of the Bayesian formulation of the model. The model parameters are assumed to have probability distributions, rather than fixed, if unknown, values, and the reported 95% PPI is an interval defined by the 2.5 percentile and 97.5 percentile of their posterior predictive probability distribution. An additional 7% of the population (95% PPI 5%–11%) was estimated to belong to a large variance class (weight variance 3.88 [95% PPI 2.40–6.07] and height variance 12.34 [95% PPI 7.01–18.83]), while 2% of the population
(95% PPI 1%–3%) was estimated to have been coded as some type of age, height, or weight transcription error. As Figure 5 shows, some of the transcription errors are assumed to appear in the central part of the distribution. This is consistent with transcription errors occasionally resulting in little change from the true values, or, more rarely, extreme true values being mistakenly converted to more typical values. Allowing for separate mean height and weight z-scores within each sampled age group helps to accommodate the sample design, since, within each clinic, a simple random sample of 2–4-year-olds and 5–11-year-olds were drawn. Thus, even if clinics with larger or smaller probabilities of selection were more or less likely to have heavier or taller children, the imputation was centered on the results of the local clinic, not an over- or under-represented set of clinics in the sample. However, if the larger variance classes were associated in some fashion with the probability of selection, biased estimates of their distribution in the population would be obtained. We found no statistically significant correlation between the most likely latent class membership for each subject and the inverse of the case sampling weight for each subject, indicating the sample design had been adequately accounted for in the modeling procedure. Nonetheless, for the actual analysis CHANCE
21
Key Ingredients for a Successful Collaboration with a Biostatistician: A Clinical Scientist Perspective Nicolas Stettler, The Children’s Hospital of Philadelphia and University of Pennsylvania
In “Accounting for Transcription Errors Using Multiple Imputation: Application to a Study of Childhood Obesity,” Michael Elliott, an applied statistician, describes a study for which his collaboration with a clinical scientist (me) was a successful and rewarding experience. I also found our cooperation rewarding. Mike’s article motivated me to think about the key ingredients for a successful collaboration between a clinical scientist and a biostatistician. Too often, clinical scientists only consider statisticians as service providers, rather than true collaborators. This misunderstanding in the nature of the relationship often results in frustration on both sides. Clinical scientists can feel they don’t get the answers they hoped for and that, after their data are processed through the “black box” of statistics, they are left with only long and obscure computer printouts. Additionally, statisticians and clinical scientists speak different languages and approach research from different perspectives, further increasing the risk for misunderstandings and frustrations. Terminology can be different. Clinical scientists can focus more on clinical outcomes or specific cases than on statistical concerns, such as design and replication. Sometimes, however, the collaboration is enriching and productive. One key ingredient is for each collaborator to take the time to try to understand the scientific field of the other and make an effort to teach some of his or her field’s methods and issues to the other. In such collaborations, each scientist not only contributes his or her expertise to the project, but also learns from and educates the other. For the “Healthy for Life” project, Mike learned about child growth, obesity, and pediatric primary care; I learned about complex sampling, multiple imputation methods, and handling of outliers. Neither of us became an expert in the other’s scientific field, but we understood enough of the other’s perspective to find a common language, to trust the other’s expertise, and to understand the strengths and limitations of the other’s scientific
methods. For example, for a clinical scientist, it is particularly comforting to work with a statistician who understands the limitations of clinical research and that it is not always possible to have perfect data quality and large sample sizes. In our project, the recognition that transcription errors were just part of the reality of this study was precisely what created the need for a novel statistical method. On the other hand, after differentiating transcription errors from outliers, we could be more confident about the validity and public health significance of our conclusions. Many (but not all) scientists find it rewarding and enjoyable to venture into new scientific territory and to contribute to the advancement of another branch of science. So, when clinical scientists and applied statisticians make the effort to learn enough about the scientific methods of the other to build a true collaboration and chemistry, the payoff can be great and work is more fun!
of the multiply imputed data sets, we implemented a fully design-based approach that accounts for the unequal probability of selection of the children and their “clustering” into centers as part of the sampling procedure. This highlights one of the advantages of the multiple imputation approach, which is that the analysis of the multiply imputed data can make weaker assumptions than the imputation model, reducing any damage caused by the failure of the imputation model assumptions.
Results of Using the Mixture Model Approach
22
VOL. 21, NO. 3, 2008
Nicolas Stettler is a pediatrician trained in nutrition, tropical medicine, and epidemiology and assistant professor of pediatrics and epidemiology at The Children’s Hospital of Philadelphia and the University of Pennsylvania School of Medicine. His research focuses on the epidemiology of obesity and related cardiovascular risk factors, with particular emphasis on the causes and consequences of obesity over the life course. He also leads a research project for the prevention of obesity in the pediatric primary care setting and another study to examine the consequences of obesity and obesity treatment on the bone health of adolescents.
Twenty imputed data sets were obtained in the fashion described previously. This time, imputations included an imputation of the latent “variance class” for all children and an imputation of height for children missing height data based on the mixture model described above. Before obtaining the estimates of the prevalence of overweight children, children assigned to the transcription error class were dropped from
Children Aged 6-11
a a b c
b c
a b c
a b c a b c
Urban
Suburban
Rural
NYC-C
PR-Urban
a b c
PR-Rural
Children Aged 6-11
since overweight is less common than non-overweight. We see a hint of this in the results in Figure 6. The overall results suggest the effect of over-dispersion of the height and weight data relative to the normality assumption and transcription errors on the original multiple imputation procedure were nontrivial for some of the subpopulations of interest, though modest in degree. The work on this project has provided important evidence concerning the prevalence of overweight children belonging to certain groups in medically underserved areas. It also has produced a useful novel statistical model to deal with overdispersion and transcription errors in physical measurements data. Results in both areas have the potential to lead to further developments. It would be of interest to design an intervention at the HRSA community center level to try to reduce the overweight prevalence among these populations of children. It would also be of interest to examine factors such as the correlation between height and weight across centers and methods for reducing measurement error.
Further Reading a a b c
a b c
b c a b c a b c
Urban
Suburban
Rural
NYC-C
PR-Urban
a b c
PR-Rural
Figure 6. Overweight prevalence and associated 95% confidence intervals for a complete case analysis (“a”), a standard multiple imputation analysis (“b”), and the mixture multiple imputation model (“c”), by age group and each of the six geographic regions of interest
the analysis. Figure 6 compares the results of a complete case analysis (intervals labeled with “a” in the figure), a standard multiple imputation analysis as implemented in the paper by Stettler and his coauthors (intervals labeled “b”), and the mixture multiple imputation model outlined above (intervals labeled “c”). Results are presented for the two age groups and each of the six geographic regions of interest. The standard multiple imputation analysis suggested children aged 2–5 missing height data appeared to be somewhat more likely to be overweight in the suburban and rural U.S. regions and less likely to be overweight in the Puerto Rico centers. Children 6–11 missing height data appeared to be less likely to be overweight in the rural U.S. centers and, again, in the Puerto Rico centers. Accounting for the outliers indicated the overweight rate for younger children might be biased upward under a standard multiple imputation analysis. This is consistent with transcription errors in which older children are incorrectly noted as younger: The resulting weight z-score would be extremely large, likely incorrectly classifying a child as obese. The effect of an age transcription error would be to deflate overweight rates in older children, although to a lesser degree,
Ghosh-Dastidar, M., and Schafer, J.L. (2003). “Multiple Edit/ Multiple Imputation for Multivariate Continuous Data.” Journal of the American Statistical Association, 98:807–817. Elliott, M.R., and Stettler, N. (2007). “Using a Mixture Model for Multiple Imputation in the Presence of Outliers: The Child Obesity Project.” Journal of the Royal Statistical Society C: Applied Statistics, 56:63–78. Little, R.J.A., and Rubin, D.B. (2001). Statistical Analysis with Missing Data, 2nd ed. New York, NY: Wiley. Politzer, R.M., Yoon, J., Shi, L., Hughes, R.G., Regan, J., and Gaston, M.H. (2001). “Inequality in America: The Contribution of Health Centers in Reducing and Eliminating Disparities in Access to Care.” Medical Care Research and Review, 58:234–248. Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York, NY: Wiley. Stettler, N., Elliott, M.R., Kallan, M., Auerbach, S.B., and Kumanyika, S.K. (2005). “High Prevalence of Pediatric Overweight in Medically Underserved Areas.” Pediatrics, 116:381–388.
CHANCE
23
A Statistical Look at Roger Clemens’ Pitching Career Eric T. Bradlow, Shane T. Jensen, Justin Wolfers, and Abraham J. Wyner
B
aseball is America’s pastime, and with attendance and interest at an all-time high, it is clear baseball is a big business. Furthermore, many of the sport’s hallowed records (the yearly home run record, the total home run record, the 500 home run club, etc.) are being assailed and passed at a pace never before seen. Yet, due to the admitted use and accusations documented in the “Mitchell Report” of performance-enhancing substances (PESs), the ‘shadow’ over these accomplishments is receiving as much press, if not more, than the breaking of the records. A particularly salient example comes from a recently released report by Hendricks Sports Management, LP, which led to widespread national coverage. Using well-established baseball statistics, including ERA (number of earned runs allowed per nine innings pitched) and K-rate (strikeout rate per nine innings pitched), the report compares Roger Clemens’ career to those of other great power pitchers of his era (i.e., Randy Johnson, Nolan Ryan, and Curt Schilling) and proclaims that Roger Clemens’ career trajectory on these measures is not atypical. Based on this finding, the report suggests the pitching data are not an indictment (nor do they provide proof) of Clemens’ guilt; in fact, they suggest the opposite. While we concur with the Hendricks report that a statistical analysis of Clemens’ career can provide prima facie ‘evidence’ (and a valuable lens with which to look at the issue), our approach provides a new look at his career pitching trajectory using a broader set of measures, as well as a broader comparison set of pitchers. This is important, as there has been a lot of recent research as to what are the most reliable and stable measures of pitching performance. Our attempt is to be inclusive in this regard. Even more important, one of the pitfalls all analyses of extraordinary events (the immense success of Clemens as a pitcher) have is “right-tail self-selection.” If one compares extraordinary players only to other extraordinary players, and selects that set of comparison players based on their behavior on that extraordinary dimension, then one does not obtain a representative (appropriate) comparison set. By focusing on only pitchers who pitched effectively into their mid-40s, the Hendricks report minimized the possibility that Clemens would look atypical. Here, we use more reasonable criteria for pitchers that are based on their longevity and the number of innings pitched in their career to form the comparison set, rather than performance at any specific point. Thus, the focus of this paper is an analysis of Clemens’ career using a more sophisticated and
24
VOL. 21, NO. 3, 2008
comprehensive database and, based on that, what one can say about Clemens’ career.
A Closer Look Before we begin our full analysis and discussion, we first take a closer look at Clemens’ career. To be sure, this unavoidable act of data ‘snooping’ was part of our research method, and it is instructive to unfold our insights in the order in which they actually occurred. For the average fan, the most salient measures of success are winning percentage and ERA, which are a good place to start. Of course, for each game, there is a winning pitcher and a losing pitcher (hence 0.5 is the average winning percentage), and an average ERA varies between 4.00 and 5.00, which has been fairly stable over the last 30 years or so. In this light, one can see in Figure 1 how extraordinary Clemens has been in various stages of his career. In particular, what this figure shows is that Clemens quickly established himself as a star and, in the early 1990s, he lost his ‘relative’ luster. His final four years with the Red Sox were certifiably mediocre (compared to his history), so much so that the future Hall of Famer was considered to be in the “twilight of his career.” However, as our graph clearly demonstrates, Clemens recovered and climbed to new heights at the comparatively old age of 35. His last few years showed a second period of decline. Now, any well-read student of baseball understands that winning percentage and ERA are fairly noisy measures of quality. Both are readily affected by factors outside a pitcher’s ability, such as fielding and the order in which batting events occur. Additionally, winning percentage critically depends on run support. Analysts who specialize in pitching evaluation use measures of component events instead, such as rates of strike outs (K) and walks (BB). We graph the career trajectory of K rate and BB rate for Clemens (Figure 2) and note his career average values
of these statistics are roughly .23 and .078 for K rate and BB rate, respectively. Again, we see Clemens’ strong start, a gradual decline in BB rate as he entered the “first twilight” of his career, followed by a marked improvement. His strikeout rate is more erratic, but, roughly, he improved in his early career, then declined, and then rose again, peaking at the age of 35 in 1998, in his second year with Toronto. To put these career trajectories into an appropriate context, we require a comparison group. Our first effort was a handful of star-level contemporaries, including Greg Maddux, Johnson, and Schilling. We also include here Ryan, as he was compared to Clemens in the Hendricks report. Their career trajectories for K rate and BB rate are graphed in Figures 3–6. The career trajectories for Clemens’ star contemporaries are nicely fit with quadratic curves. In terms of performance, the curves clearly show steady improvement as the players entered their primes, followed by a marked decline in their strikeout rate (except Ryan, whose K rate trajectory is fairly steady) and a leveling off in their walk rates. The contrast with Clemens’ career trajectory is stark. The second act for Clemens is unusual when compared to these other greats because his later success follows an unprecedented period of relative decline. This leaves us with the following question: How unusual is it for a durable pitcher to have suffered a mid-career decline only to recover in his mid- and late 30s?
Houston Astros pitcher Roger Clemens throws a pitch against the St. Louis Cardinals during the fifth inning of their Major League game September 24, 2006, in Houston. (AP Photo/David J. Phillip)
Figure 1. Clemens’ winning percentage and ERA throughout time
Figure 2. Clemens’ BB rate and K rate throughout time CHANCE
25
Figure 3. Randy Johnson BB rate and K rate throughout time
Figure 4. Greg Maddox BB rate and K rate throughout time
Figure 5. Curt Schilling BB rate and K rate throughout time
Figure 6. Nolan Ryan BB rate and K rate throughout time
26
VOL. 21, NO. 3, 2008
Database Construction To perform our statistical analyses, we first obtained data from the Lahman Database, Version 5.5 (www.baseball1.com), on all Major League Baseball (MLB) pitchers whose careers were contained in the years 1969–2007. (The starting year of 1969 was selected because of the change in the height of the pitchers’ mound, which launched the ‘modern era’ in baseball.) From that set of pitchers, we constructed a comparison set of all durable starting pitchers by looking at all pitchers who played at least 15 full seasons as a starter (with 10 or more games started per year) and had at least 3,000 innings pitched in those seasons (we note that sensitivity analyses run that included minor perturbations in these criterion indicated the results are quite stable). There were 31 pitchers other than Clemens who fit these criteria. All of these starting pitchers, therefore, had comparably long careers (in years) and innings pitched similar to Clemens. Hence, they were a relevant comparison set, although others could certainly be chosen. See www.amstat. org/publications/chance (appendix1) for the names and a set of descriptive statistics for the 31 players and Clemens. For each pitcher, we looked at the following wellestablished pitching statistics for each of the years in which they pitched: WHIP = Walks + hits per inning pitched BAA = Batting average for hitters when facing the given pitcher ERA = Earned run average per nine innings pitched BB Rate = Walk rate K Rate = Batter strike-out rate per plate appearance (not including walks)
Together, these statistics provide a fairly complete picture of the career trajectory for a starting pitcher.
Trajectory Analyses To understand and summarize the trajectory each of the five ( = 1,…,5) aforementioned statistics take, we fit a quadratic (j function to each of the 32 (i =1,…,32) focal pitcher’s (including Clemens) data at year t, as follows:
Sijt = β 0 ij + β 1ijAgeit + β 2ijAge 2 ij + ε ijt
[1] where Sijt = value of statistic j for pitcher i in their t-th season; Ageit = age of pitcher i in their t-th major league season, β 0ij , β 1ij , and β 2ij are an intercept and coefficients describing how Age and Age2 influence the prediction of the statistics; and ε ijt is a randomly distributed normal error term. As none of the measures studied was near the boundary of their respective ranges, taking transformations (that is standard) had no substantive impact. We also acknowledge that a quadratic curve may not be the best model for every pitcher’s career, including Clemens’. However, the quadratic curve is a simple model with interpretable coefficients that provide a common basis of comparison for all pitchers in our study. The quadratic curve is an appropriate model for the usual trajectory of performance, which expects improvement as a pitcher hits his prime and then decline as he ages (graphically, his performance climbs over and down the proverbial ‘hill’). Our goal is not to model the specific trajectory for every player, but to detect those patterns that stick out as highly unusual with respect to a quadratic reference. So, it would
ERA Career Trajectories
Subset of ERA Career Trajectories
Age
Age
Figures 7a–b CHANCE
27
not be appropriate, for example, to consider a cubic fit to each player for our purpose. It is possible that we may not be able to identify interesting patterns for an individual player’s trajectory by using only quadratic curves. This is not a concern, however, as we are only interested in determining how often the typical quadratic trajectory occurs among the pool of comparison players. Our primary interest centers on the coefficient β2ijij that describes whether the pitcher’s trajectory for that statistic is purely linear ( β 2ij = 0), “hump-shaped” ( β 2ij <0), or ”U-shaped” ( β 2ij >0) as he ages. To provide context, one might predict the following patterns, corresponding a priori to a pitcher hitting a mid-career ‘prime’ and then falling off near the end of his career: WHIP ( β 2ij >0 ) BAA ( β 2ij >0) ERA ( β 2ij >0) BB Rate ( β 2ij >0) K Rate ( β 2ij <0) Note the sign change for K rate for β 2ij as more strikeouts is better, while a lower value for the other statistics is better. Figures 7a and 7b contain a more detailed analysis of the data from the Hendricks report, using ERA. We first present in Figure 7a the ERA curves for the 32 relevant players (31 pitchers + Clemens). Each pitcher’s trajectory is depicted with a thin, gray curve, except for Clemens’, which is depicted with a thick, black curve. Also given is a dotted curve, which is the quadratic trajectory fit to the data for all players except Clemens. Figure 7b contains the players with curves that have
WHIP Career Trajectories
Age
Figures 8a–b
28
VOL. 21, NO. 3, 2008
quadratic terms that are ‘atypical’ ( β 2ij ≤ 0) compared to the prior hypothesis of a mid-career prime. Six players, including Clemens, have these atypical curves, and, in fact, Clemens’ curve looks atypical even within this subset of six players. Figures 8a and 8b contain career trajectories of WHIP for the same 32 players. Clemens is again within a small subset of seven pitchers who show atypical career paths. Further inspection of his WHIP curve suggests he was the only pitcher to get worse as his career went on and then improve at the end of his career. Two additional analyses we performed using ERA and WHIP T were to compute the same figures as Figures 6 and 7, but instead using ERA margin and WHIP margin, defined as the difference between the individual ERA and the league average. In the graphs at www.amstat.org/publicatioins/chance (Appendix2), we show the ERA margin and WHIP margin curves for Clemens and for the average over the 31 other pitchers. We see little difference between the raw curves in Figures 7 and 8 and the margin curves. Figures 9a and 9b contain career trajectories of BB rate (walks per batter faced) for the same 32 players. For BB rate, we note there are 10 pitchers who have “inverted-U” fits to their data, with Clemens being one of them. Furthermore, the ‘steepness’ of his improvement is particularly noticeable in the later years, even among this set of 10. There are several pitching measures for which Clemens’ career trajectory does not look atypical, which is the central assertion of the Hendricks report. In Figures 10a and 10b, we give the strikeout rate (K per non-BB batters faced) for each of the 32 durable starting pitchers. Clemens does have an overall higher K rate than most pitchers in this set, but his career
Subset of WHIP Career Trajectories
Age
BB rate Career Trajectories
Age
Subset of BB rate Career Trajectories
Age
Figures 9a–b
K rate Career Trajectories
Age
Subset of K rate Career Trajectories
Age
Figures 10a–b CHANCE
29
BAA Career Trajectories
Subset of BAA Career Trajectories
Age
Age
Figures 11a–b
trajectory follows a similar shape ( β 2ij <0) to 24 of the other 31 players, at least with respect to the quadratic fit. In Figures 11a and 11b, we examine BAA (batting average against) for each of the 32 pitchers. Similar to K rate, we again observe that Clemens has a typical shape to his career trajectory to most (24 out of 31) of the other starting pitchers, albeit his curve is somewhat flatter. Through the use of simple exploratory curve fitting applied to a number of pitching statistics, and for a well-defined set of long-career pitchers, we assessed whether Clemens’ pitching trajectories were atypical. Our evidence is suggestive that while most long-term pitchers have peaked mid-career and decline thereafter, Clemens (for some key statistics) worsened mid-career and improved thereafter. There are many other ways to approach this question, and we expect other researchers will try different techniques. We warn these brave souls that baseball statistics are extraordinarily variable. For example, it is generally assumed that the league average ERA for the National League is lower than that for the American League. This is true—the average gap is 0.25 runs (in favor of the National League). But, in some years, that gap is huge (0.75 runs in 1996), and, in other years, the gap is negligible (nearly 0 in 2001 and 2007). So, while it is tempting to ‘control’ for patterns such as these, you may just be adding noise to your data by subtracting a random quantity (leaguewide statistics) from another. So, what can we conclude? We can conclude that Clemens’ pitching career was atypical for long-term pitchers in terms of WHIP, BB rate, and ERA. In particular, Clemens shows a mid-career decline followed by an end-of-career improvement that is rarely seen. This is a trajectory not seen at all among the comparison group of pitchers identified by Hendricks Sports Management. We emphasize that our analysis is entirely exploratory—we do not believe there exists a reasonable 30
VOL. 21, NO. 3, 2008
probability model to test relevant hypotheses by calculating significance levels. The data does not exonerate (nor does it indict) Clemens, as an exploratory statistical analysis of this type never proves innocence or guilt. After analyzing this data set, there are at least as many questions remaining as before.
Further Reading http://en.wikipedia.org/wiki/Baseball_statistics#Pitching_statistics Albert, J. (2006) “Pitching Statistics, Talent and Luck, and the Best Strikeout Seasons of All-Time.” Journal of Quantitative Analysis in Sports, 2(1):Article 2. Bradlow, E.; Jensen, S.T.; Wolfers, J.; and Wyner, A. (2008) “Report Backing Clemens Chooses Its Facts Carefully.” The New York Times. February 10, 2008. Fainaru-Wada, M. and Williams, L. (2007) Game of Shadows: Barry Bonds, BALCO, and the Steroids Scandal That Rocked Professional Sports. Gotham Books: New York, New York. Hendricks, R.A.; Mann, S.; and Larson-Hendricks, B.R. (2008) “An Analysis of the Career of Roger Clemens.” www.rogerclemensreport.com, January 18. Mitchell, G.J. (2007) “Report to the Commissioner of Baseball of an Independent Investigation Into the Illegal Use of Steroids and Other Performance Enhancing Substances by Players in Major League Baseball.” DLA Piper US LLP: December 13. (Accessed at http://mlb.mlb.com/mlb/news/ mitchell/index.jsp.) Silverman, M. (1996) “Baseball End of an Era: No Return Fire from Sox Brass Tried to Keep Ace.” Boston Herald, December 14, Page 40. Vecsey, G. (1968) “Baseball Rules Committee Makes 3 Decisions to Produce More Hits and Runs.” The New York Times, December 4, Page 57.
Astrostatistics: The Final Frontier Peter Freeman, Joseph Richards, Chad Schafer, and Ann Lee
H
ow did the universe form? What is it made of, and how do its constituents evolve? How old is it? And, will it continue to expand? These are questions cosmologists have long sought to answer by comparing data from myriad astronomical objects to theories of the universe’s formation and evolution. But, until recently, cosmology was a data-starved science. For instance, before the 1990 launch of the Hubble Space Telescope, the Hubble constant— a number representing the current expansion rate of the universe—could only be inferred to within a factor of two, and cosmologists had to make do performing simple statistical analyses. Since that time, technological advances have led to a flood of new data, ushering in the era of precision cosmology. (The Sloan Digital Sky Survey alone has collected basic data for more than 200 million objects.) To help make sense of all this data, cosmologists have increasingly turned to statisticians, and a new interdisciplinary field has arisen: astrostatistics. Work in astrostatistics uses a wide range of statistical methods, but there are a few particular
statistical issues that are prevalent. Here, we focus on two broad challenges: parameter estimation using complex models and analysis of noisy, nonstandard data types.
Parameter Estimation Using Type Ia Supernovae A supernova is a violent explosion of a star whose brightness, for a short time, rivals that of the galaxy in which it occurs. There are several classes of supernovae (or SNe). Those dubbed Type Ia result from runaway thermonuclear reactions that unbind white dwarfs, Earth-sized remnants of stars such as our sun. In the 1990s, astronomers used observations of Type Ia SNe to infer the presence of dark energy, a still unknown source of negative pressure that acts to accelerate the expansion of the universe, rather than slowing the expansion, as does normal baryonic matter (e.g., protons and neutrons). There are many interesting analyses we can carry out with Type Ia SNe data. For instance, we can construct procedures to test different models of the evolution of dark energy properties as a function of time. Or, we may adopt a model for dark energy and see how SNe data, in concert with that model, constrain basic cosmological parameters. We demonstrate the latter analysis here. One may be surprised to learn that theories regarding the formation and evolution of the universe are sufficiently developed that some of the biggest questions in cosmology are, in fact, parameter estimation problems. For instance, a current standard theory, the so-called L (lambda) Cold Dark Matter, or LCDM, theory can ade-
CHANCE
31
H0
Distance Modulus
Redshift
Wm
Figure 1. Distance modulus (m) versus redshift (z) for Type Ia super supernovae. Wm is the fraction of critical energy density contributed by baryonic matter and dark matter, while H0 is the Hubble constant. Curves represent theoretical predictions for three parameter values.
Figure 2. Ninety-five percent joint confidence region for Wm and H0 using SNe data. The larger region is based on performing a χ2 test at each possible parameter combination. The smaller region is created using a procedure that attempts to optimize the precision of the estimate. Values of (Wm H0) outside the white area are considered implausible based on other observations.
quately describe the universe with as few as six parameters. It is often the case that parameters in statistical models are abstract and lack intrinsic significance, such as the slope and intercept parameters (b0 and b1) in simple linear regression (as it is usually the estimate of the line that is of interest). But this is not the case with cosmological model parameters; the goal is to precisely estimate each of these fundamental constants. Different types of cosmological data help constrain different sets of cosmological parameters. We can use the data of Type Ia SNe in particular to make inferences about two of them: Wm and H0. The former (pronounced “omega m”) is the fraction of the critical energy density (i.e., the amount necessary to make the universe spatially flat) contributed by baryonic matter and dark matter whose presence is inferred by its influence upon baryons, but whose make-up is still unknown. The latter (“h naught”) is the aforementioned Hubble constant (and not an indication of a null hypothesis), the current value of the Hubble parameter H(t) that describes the rate at which the universe expands as a function of time. What truly makes Type Ia SNe special from a data analysis standpoint is that theory holds them to be standard candles: Two SNe at the same distance from us will appear equally bright, so that distance and brightness are monotonically related. To estimate the distance to a supernova, astronomers use its redshift, z, which is a directly observable quantity representing the relative amount by which the universe has expanded since the explosion occurred (which can be up to billions of years ago). Redshifts are measured by examining the distance between peaks and troughs in light waves, which expand
as the universe does. We observe photons emitted from a supernova with wavelength lemit to have a longer wavelength lobs = (1+z) lemit. This motivates the term “redshift”: visible light emitted from SNe of progressively higher redshift appears progressively redder to us, as red light is at the long-wavelength end of the spectrum of visible light. Another measure of astronomical distance is the distance modulus, m, a logarithmic measure of the difference between a supernova’s observed brightness and its intrinsic luminosity, which uses the fact that objects farther away will appear fainter. In Figure 1, we show measurements of z and m for a sample of Type Ia SNe. Measurement errors in z are small (&1%) and are not shown, while estimates of measurement error in m are given by the vertical bars. Note the deviation from linearity in the trend in z versus m is due to the accelerated expansion of the universe, which is attributed to dark energy. Assuming a particular cosmological model, the observables m and z are linked via a function of the cosmological parameters (Wm, H0):
32
VOL. 21, NO. 3, 2008
c(1 + z ) z du µ ( z Ωm, H 0) = 5 log 10 + 25 (1) ∫ 0 3 0 H Ωm(1 + u) + (1 − Ωm)
Other theoretical assumptions about the structure of the universe would suggest the use of other functions. The particular function in equation (1) represents a spatially flat universe, where the actual energy density of the universe exactly equals the critical density, separating open universes that expand forever from closed universes that first expand, then contract back to a point. It also contains a contribution from dark energy in the form of a so-called cosmological
Line Flux (10^-17 erg/cm^2/s/Angstrom)
Line Flux (10^-17 erg/cm^2/s/Angstrom)
Wavelength (Angstroms)
Wavelength (Angstroms)
Figure 3. Flux versus wavelength for a typical Sloan Digital Sky Survey (SDSS) galaxy spectrum
Figure 4. Flux versus wavelength for a typical Sloan Digital Sky Survey (SDSS) quasar spectrum
constant (i.e., the dark energy has a particular form that does not evolve with time); its fractional contribution to the critical energy density is WL ≡ 1-Wm. The cosmological constant was first used by Albert Einstein to make the universe spatially flat and unchanging within the context of his theory of general relativity. Later, after Hubble demonstrated the universe was expanding, Einstein disowned the constant, referring to it as his “biggest blunder.” According to the model, the observed pairs (zi,Y Yi are realY) izations of Yi = m(zi Wm,H H0) + siei, where the ei are independent and identically distributed standard normal. The standard deviations si are estimated from properties of the observing instrument, but are generally taken as known. Thus, there is a simple way of constructing a joint confidence region for Wm and H0 by performing a χ2 test for each possible pair (Wm, H0): using the fact that
Mining Spectra for Cosmological Information
Yi − µ ( zi Ωm, H 0) ∑ σi i =1 182
2
has the χ2 distribution with 182 degrees of freedom when the parameters are correctly specified. Figure 2 shows the 95% confidence region that results from this procedure. Cosmologists are fond of χ2-type statistics because of their intuitive appeal, but the resulting confidence regions are needlessly large (imprecise). Ongoing work focuses on tightening the bounds on parameters by improving the statistical procedures. Figure 2 also depicts a confidence region created using a Monte Carlo–based technique developed by Chad Schafer and Philip Stark that approximates optimally precise confidence regions. These parameters also appear in stochastic models describing other cosmological data sets, and joint analyses of these data sets will lead to yet more precise estimates of the unknown parameters.
Another fundamental challenge in astrostatistics is the extraction of useful information from a large amount of complex data. One example of such data are spectra, measures of photon emission as a function of time, energy, wavelength, etc. They may consist of thousands of measurements, such as the examples of galaxy and quasar data collected by the Sloan Digital Sky Survey (SDSS) that we show in Figures 3 and 4, respectively. Typical galaxies (e.g., the Milky Way) are agglomerations of billions of stars; whereas, quasars are galaxies going through an evolutionary phase in which a supermassive black hole at the center is actively gobbling up gas and emitting so much light that it effectively drowns out the rest of the galaxy. To probe the physical conditions of galaxies and quasars, we might analyze the global, (nearly) smooth continuum emission and/or the narrow and broad spikes that rise above the continuum (emission lines) or dip below the continuum (absorption lines). Working with entire spectra can be computationally tedious, and working with large groups of spectra even more so. SDSS has so far measured spectra for approximately 800,000 galaxies and 100,000 quasars. It may be that we can construct relatively simple statistics that convey nearly as much information as each spectrum, itself. For instance, to perform galaxy classification, astronomers have conventionally used measurements of ratios of photon emission over particular (small) ranges of wavelengths. We have begun work on the challenge of finding a lowdimensional representation of spectra that retains most of the useful information they encode. The first idea most would think of is to perform a principal components analysis (PCA) of the spectra. The basic idea is that, ideally, each spectrum could be written as the weighted combination of m basis functions, where m is much smaller than p, the length of each spectrum. A spectrum is then represented by the projections onto these m basis vectors; the union of the projections of several spectra form a lower-dimensional embedding of the CHANCE
33
Figure 5. An example of a one-dimensional manifold embedded in two dimensions. The path from A to B is representative of the diffusion distance between A and B, and is a better representation of dissimilarity between them than the Euclidean distance.
data often referred to as a principal component (PC) map. PCA was applied to a family of 1,200 SDSS spectra, with 600 from galaxies and 600 from quasars. PCA, however, can do a poor job for astronomical data. There are two main reasons for this. First, the method is only able to pick out global features in the spectra. It ignores that emission line features spanning a small range of wavelengths can be crucial in, for example, determining whether a spectrum belongs to a galaxy or a quasar. Second, the method is linear. If the data points (the spectra) lie on a linear hyperplane, the method works well, but if the variations in the spectra are more complex, one would be better off with nonlinear dimension reduction methods. The first issue—local features—has been addressed by astrostatisticians to some extent. Techniques such as wavelets are useful to describe inhomogeneous features in spectra, and these methods are gaining popularity. The second issue—nonlinearity—has largely been ignored in the field of astronomy, however. In “Exploiting Low-Dimensional Structure in Astronomical Spectra,” a paper submitted to The Astrophysical Journal, the authors applied a nonlinear approach for dimensionality reduction, embedding the observed spectra within a diffusion map. The basic idea is captured in Figure 5. Ideally, we would like to find a distance metric that measures the distance between points A and B along the spiral direction, and then construct a coordinate system that captures the underlying geometry of the data. Diffusion distances and diffusion maps do exactly this by a clever definition of connectivity and the use of Markov chains. Imagine a random walk starting at point A that is only allowed to take steps to immediately adjacent points. Start a similar walk from point B. For a fixed scale t, the points A and B are said to be close if the conditional distributions after t steps in the random walk are similar. The diffusion 34
VOL. 21, NO. 3, 2008
Figure 6. Embedding of a sample of 2,796 SDSS galaxy spectra using the first three diffusion map coordinates, with color representing galaxy redshifts.
distance between these two points is defined as the difference between these two conditional distributions. The distance will be small if A and B are connected by many short paths through the data. This construction of a distance measure is also robust to noise and outliers because it simultaneously accounts for all paths between the data points. The path from A to B depicted in Figure 5 is representative of the diffusion distance between A and B and is a better description of the dissimilarity between A and B than, for example, the Euclidean distance from A to B. In applying this technique for dimensionality reduction, the data set attribute we wish to preserve is the diffusion distance between all points. For the example in Figure 5, a diffusion map onto one dimension (m=1) approximately recovers the arc length parameter of the spiral. A one-dimensional PC map, on the other hand, simply projects all the data onto a straight line through the origin. Therefore, the diffusion map technique for dimensionality reduction will be better suited to the analysis of astronomical data, which is often complex, nonlinear, and noisy. Figure 6 shows the three-dimensional diffusion map constructed from 2,796 SDSS galaxy spectra. Each of these spectra is of length 3500 (i.e., each lies in a 3500-dimensional space) and possesses the sort of noisy, irregular structure seen in Figure 3. Despite the high dimension and noisy data, the presence of a low-dimensional, nonlinear structure is clear. More significantly, this structure can be related to important properties of the galaxies. The colors of the points in the map give the redshifts of the galaxies. It is evident that using these coordinates, one could predict redshift. Results of this type hold great promise for future exploration of complex, high-dimensional data.
The Future The SDSS has provided the astronomical community with a flood of data, but this flood is minuscule compared to that which will be created by the Large Synoptic Survey Telescope, or LSST. Scheduled to begin observations in 2015, it will repeatLSST edly observe half the sky from its perch on Cerro Pachon in Chile, producing wide-field (3o diameter) snapshots. (SDSS, in contrast, has scanned approximately 20% of the sky without repeated observations.) Repetition will allow LSST to not only probe the nature of dark energy and create unprecedented maps of both the solar system and Milky Way, but also to track how astronomical objects change with time. Just how much data will LSST collect? Recent projections indicate it will gather 30 terabytes (30 x 10004 bytes) of data per evening of viewing, which is roughly equal to the amount of data collected by SDSS over its entire lifetime. The total amount of LSST data is expected to exceed 10 petabytes (10 x 10005 bytes), and will include observations of 5 billion galaxies and 10 billion stars. Needless to say, analyses of these data pose interesting challenges that will require the help of statisticians. Two of these challenges are obvious. The first is how to efficiently process the data in a way so as to not discard their important features. The second is how to test ever more sophisticated theories whose development will be motivated by LSST’s high-resolution data. But, there is a third challenge—not necessarily obvious to astronomers—that is likely to arise. As stated, the LSST data will increase the already growing pressure on theoreticians to refine their models. But, could astronomers be getting too much of a good thing? Introductory statistics courses often stress the notion of “practical versus statistical significance” when testing a null hypothesis. We are well aware that with enough data, any hypothesis can be rejected at any reasonable significance level. This will likely occur with the massive influx of cosmological data. The theories, which inevitably contained some level of approximation, will appear to fit the data poorly when subjected to formal statistical testing. Thus, we envision a growing need for formal model-testing and model-selection tools. We also see an increasing interest in nonparametric and semiparametric approaches, which are useful for finding relationships in the data when we lack a physically motivated, fully parametric model or when such a model is complex and a computationally simpler approach may have similar inferential power. For example, there is no physically motivated model encapsulating how dark energy properties evolve with time. In a paper submitted to Annals of Applied Statistics, “Inference for the Dark Energy Equation of State Using Type Ia Supernova Data,” the authors assess different nonparametric models of dark energy, fitting them to the Type Ia SNe data described in Figure 1. Importantly, they show how functional properties such as concavity and monotonicity can help sharpen statistical inference. Currently, the number of SNe is insufficient to rule out all but the most extreme dark energy models. These authors show how a factor of 10 increase in the number of SNe, something readily achievable with LSST, will allow us to determine the veracity of many hypotheses, including whether a dark energy model based on Einstein’s cosmological constant model is consistent with the data.
Large Synoptic Survey Telescope (LSST), from its perch on Cerro Pachon, Chie can create unprecedented maps of both the solar system and Milky Way. Design of LSST Telescope Courtesy of LSST Corporation
This is typical of the outstanding challenges facing statisticians working on inference problems in cosmology and astronomy. Novel methods of data analysis that fully use available computing resources are needed if key questions are to be answered using the soon-to-arrive massive amount of data. Editor's Note: The authors' work in this area is supported by NSF Grant #0707059 and ONR Grant N00014-08-1-0673.
Further Reading The Center for Astrostatistics, http://astrostatistics.psu.edu Freedman, W. (1992) “The Expansion Rate and Size of the Universe.” Scientific American, 267:54. Genovese, C. R.; Freeman, P.; Wasserman, L.; Nichol, R.C.; and Miller, C. (2008) “Inference for the Dark Energy Equation of State Using Type Ia Supernova Data.” Submitted to Annals of Applied Statistics, http://arxiv.org/abs/0805.4136. The International Computational Astrostatistics (InCA) Group, www.incagroup.org The Large Synoptic Survey Telescope (LSST), www.lsst.org Richards, J.; Freeman, P.; Lee, A.; and Schafer, C. (2008) “Exploiting Low-Dimensional Structure in Astronomical Spectra.” Carnegie Mellon University Department of Statistics Technical Report #863. Submitted to The Astrophysical Journal, http://arxiv.org/abs/0807.2900. Schafer, C. and Stark, P. (2007) “Constructing Confidence Regions of Optimal Expected Size.” Carnegie Mellon University Department of Statistics Technical Report #836. Submitted to the Journal of the American Statistical Association. The Sloan Digital Sky Survey (SDSS), www.sdss.org CHANCE
35
Stochastic Stamps: A Philatelic Introduction to Chance Simo Puntanen and George P. H. Styan
The impact of probability and statistics on society as reflected in postage stamps
book, Mathematics and Science: An Adventure in Postage Stamps, found the rich and fascinating world of postage stamps to be “a mirror of civilization”; he wrote that “multitudes of stamps reflect the impact of mathematics and science on society.” We agree, and, in this article, we look at what we will call “stochastic stamps”— postage stamps related in some way to chance (i.e., that have a connection with probability and/or statistics). We identify 26 distinguished people related in some way to chance who have been honored with a postage stamp. Images of 25 stamps from 15 countries illustrate the article.
Christiaan Huygens (1629–1695) Pierre-Simon, Marquis de Laplace (1749–1827) Prasanta Chandra Mahalanobis (1893–1972) Florence Nightingale (1820–1910) Blaise Pascal (1623–1662) Lambert Adolphe Jacques Quetelet (1796–1874)
Bernoulli is depicted (anonymously) with a formula and graph for the law of large numbers in the stamp, which was issued in celebration of the 1994 International Congress of Mathematicians, held in Zürich, Switzerland. This stamp seems to be the only one issued for Bernoulli.
We believe no stamps have been issued (at least we have not identified any) to honor the other 45. Courtesy of Jeff Miller
W
illiam L. Schaaf in his 1978
Ten Distinguished Scholars
Jacob (Jacques) Bernoulli (1654–1705) Roger Joseph Boscovich (1711–1787) Pafnuty Lvovich Chebyshev (1821–1894) Carl Friedrich Gauss (1777–1855) 36
VOL. 21, NO. 3, 2008
Roger Joseph Boscovich: Croatia 1943 Courtesy of Jeff Miller
As a starting point, we identify 10 major suspects, considered distinguished scholars by the editors of both Statisticians of the Centuries and Leading Personalities in Statistical Sciences. Between them, these books include biographical sketches of 165 people, with an intersection of 55 people. Of these 55, we identified stamps in honor of 10, listed as follows:
Jacob (Jacques) Bernoulli: Switzerland 1994
The eldest of four brothers, Bernoulli lived mainly in Switzerland. He was one of eight prominent mathematicians in the Bernoulli family and is famous for his Ars Conjectandi, a groundbreaking work on probability theory (published posthumously in 1713), and his research concerning the law of large numbers.
Shown on this stamp is a portrait of Boscovich, who was a physicist, astronomer, mathematician, philosopher, diplomat, poet, and Jesuit priest from Ragusa, then an independent state, today Dubrovnik in Croatia. “In 1760,” according to Wikipedia, “he developed a simple geometrical method of fitting a straight line to a set of observations on two variables using (constrained) least sum of absolute deviations.”
Courtesy of Jeff Miller
Courtesy of Jeff Miller
Pafnuty Lvovich Chebyshev: USSR 1946
Pierre-Simon, Marquis de Laplace: Mozambique 2001
Chebyshev is probably best known for his inequality, also known as the Bienaymé–Chebyshev Inequality:
Courtesy of Jeff Miller
1
(1) Prob ( X − µ ≥ kσ ) ≤ 2 k where X has expected value m and variance s2 and k is any positive real number. We have not identified a postage stamp that honors statistician IrenéeJules Bienaymé (1796–1878), who proved the inequality some years before Chebyshev. The stamp for Chebyshev is one of a pair issued by the USSR in celebration of the 125th anniversary of his birth. It seems no other stamps were issued for Chebyshev.
Christiaan Huygens: Comoro Islands 1977
Courtesy of Jeff Miller
“Chance was born on April 27, 1657, that is to say ‘chance’ as the published scientific study of chance. On that day in The Hague, Christiaan Huygens ...” With these words, Stephen Stigler begins his article, “Chance Is 350 Years Old,” in
the 20th anniversary issue of CHANCE. Huygens was a Dutch mathematician, astronomer, and physicist. The authors of Statisticians of the Centuries write that he “conceived a calculus of expectations, which considerably influenced the following generation of probabilists.” The stamp from the Comoro Islands also shows the unmanned interplanetary spacecraft Voyager II, launched on August 20, 1977. According to Wikipedia, “Voyager II is perhaps the most productive space probe yet deployed, visiting four planets and their moons … at a fraction of the money later spent on specialized probes such as … the Cassini–Huygens probe. Cassini– Huygens is a joint NASA/ESA/ASI robotic spacecraft mission currently studying the planet Saturn and its moons.” In addition to the five stamps we have identified for Huygens, there is a full portrait of him on a souvenir sheet from Dominica, issued for the International Stamp Exhibition, Amsterdam (Amphilex), 2002. The souvenir sheet honors Nobel Prize winners of The Netherlands, with (clockwise, from top left) Willem Einthoven (Medicine 1924), Peter J. W. Debye (Chemistry 1936), Simon van der Meer (Physics 1984), Jan Tinbergen (Economics 1969), and Frits (Frederik) Zernike (Physics 1953). Also depicted is the medal for the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel, which was
The normal distribution is attributed to Gauss, who did research in astronomy, physics, and geodesy. His principal contributions to statistics are in the theory of least squares estimation, where major work also was done by mathematician Pierre-Simon, Marquis de Laplace (1749–1827). The stamp for Gauss was issued by Germany to commemorate the centenary of his death. We have identified four stamps in honor of Gauss, but only two for Laplace; images of all six of these stamps are given by George Styan and Götz Trenkler (IMAGE 38, 2007).
Courtesy of George P. H. Styan
Carl Friedrich Gauss: Germany 1955
Souvenir sheet with Christiaan Huygens and Nobel Prize Winners of the Netherlands: Dominica 2002
CHANCE
37
Prasanta Chandra Mahalanobis: India 1993
Courtesy of George P. H. Styan
Nightingale, who came to be known as “The Lady of the Lamp,” was a pioneer of modern nursing, a writer, and a noted statistician. Known as a passionate statistician, Nightingale, according to the authors of Statisticians of the Centuries, “was driven by the scandalous legacy of the Crimean War to use statistical weapons in her fight for hospital reform,” and “was also anxious for the [British] army medical officers to begin keeping proper medical statistics.” Nightingale introduced data plots, now known as Nightingale rose plots. This stamp, showing a portrait of Nightingale, was issued in the International Human Rights Year 1968 in a set that also honored Cecil Edgar Allan Rawle (1891–1938), a Dominican crusader for human rights; John Fitzgerald Kennedy (1917–1963), the 35th president of the United States; Pope John XXIII (1881–1963), the 261st pope of the Roman Catholic church and sovereign of Vatican City; and Albert Schweitzer (1875–1965), the Alsatian theologian, musician, philosopher, and physician. We have identified more than 20 stamps for Nightingale; for an annotated list, see A Philatelic Tour of the American Civil War, by Larry Dodson.
Courtesy of Jeff Miller
A father figure in Indian statistics, Mahalanobis did pioneering work on anthropometric variation in India and founded the Indian Statistical Institute. The main building in Kolkata (Calcutta) is shown in the background of this stamp issued by India in 1993, the centenary year of his birth. This stamp is the only one we have identified to honor Mahalanobis, who is well known for his D2 statistic, or Mahalanobis distance: D 2 = d ' S −1d , where d = x 1 − x 2 , S=(n1S1+n2S2)/n )/ , with n=n1+n2 and where xi, Si, i =1,2, are the p x 1 vectors of sample means and the p x p sample covariance matrices with ni = Ni–1 degrees of freedom computed from samples of size Ni > p drawn from a p-variate normal population.
Florence Nightingale: Dominica 1968
Blaise Pascal: Central African Republic 2000
Pascal was a French mathematician, physicist, and religious philosopher. Robin Wilson, author of Stamping 38
VOL. 21, NO. 3, 2008
Through Mathematics, noted that Pascal “corresponded with Pierre de Fermat (1601–1665) on probability theory, strongly influencing the development of modern economics and social science.” Moreover, the authors of Statisticians of the Centuries wrote that he “introduced the concept of mathematical expectation and used it recursively to obtain a solution to the Problem of Points, which was a catalyst that enabled probability theory to develop beyond mere combinatorial enumeration.” This stamp, with an anonymous portrait of Pascal, was issued in celebration of the 125th anniversary (in 1999) of the Universal Postal Union. Also shown on the stamp are the 1957 spacecraft Sputnik and an anonymous astronaut. Sputnik was the first artificial satellite to be put into outer space. We have identified five stamps and one postcard to honor Pascal.
Courtesy of Jeff Miller
Courtesy of George P. H. Styan
first awarded to Ragnar Frisch and Jan Tinbergen, both members of the International Statistical Institute, for having developed and applied dynamic models for the analysis of economic processes. We identified two more stamps for Tinbergen (1903–1994), but none for Frisch (1895–1973).
Adolphe Quetelet: Belgium 1974
The last of the 10 scholars is Quetelet, a Flemish astronomer, mathematician, statistician, and sociologist who founded and directed the Brussels Observatory and was influential in introducing statistical methods to the social sciences. Ed Stephan noted in Topical Time that he also “gathered extensive demographic data to study the average man and promoted standardization in gathering and reporting demographic data.” Chapter 2 of Stigler’s Statistics on the Table is titled “The Average Man Is 168 Years Old” and begins with “The introduction of statistical methods into social science in the 19th century was closely tied to the work of ... Quetelet.” This stamp, which was issued in 1974, the centenary year of Quetelet’s death,
In the main entry by Campbell Read in the Encyclopedia of Statistical Sciences (2nd ed.), there are stamps associated with each of the following six scholars: Enrico Fermi (1901–1954) Paul Adrien Maurice Dirac (1902–1984) James Clerk Maxwell (1831–1879) Ludwig Eduard Boltzmann (1844–1906) Satyendra Nath Bose (1894–1974) Albert Einstein (1879–1955) This group is probably disjoint from the earlier group of 10 because each has distinguished himself primarily in physics (Dirac, Einstein, and Fermi won Nobel Prizes in physics). Norman Johnson and Samuel Kotz, however, consider both Maxwell and Boltzmann to be leading personalities in the statistical sciences and give full biographies in their book, Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present. Fermi–Dirac Statistical mechanics is the application of probability theory to the motion of particles or objects when subjected to a force. In statistical mechanics, Fermi-Dirac statistics determine the statistical distribution of fermions over the energy states for a system in thermal equilibrium. In particle physics, fermions are particles with half-integer spin, such as protons and electrons.
Paul Adrien Maurice Dirac: Guyana 1995
Dirac was a British theoretical physicist and founder of the field of quantum mechanics. Dirac shared the 1933 Nobel Prize in physics with Erwin Rudolf Josef Alexander Schrödinger (1887–1961) “for the discovery of new productive forms of atomic theory.” The stamp with a portrait of Fermi was issued in 2001, the year of his birth centenary. The stamp with a portrait of Dirac was issued by Guyana in 1995 in celebration of the 100th anniversary of the first Nobel Prize. We have identified eight stamps for Fermi and two for Dirac.
Courtesy of Joachim Reinhardt
Courtesy of Joachim Reinhardt
Fermi was an Italian physicist most noted for his work on the development of the first nuclear reactor and his contributions to statistical mechanics. He was awarded the Nobel Prize in physics in 1938 for his work on induced radioactivity.
Maxwell–Boltzmann The special air-mail stamp features portraits of Maxwell on the right and Hertz on the left. The stamp was issued by Mexico for the Second International Telecommunications Planning Conference, October 30–November 15, 1967. Maxwell was a Scottish mathematician and theoretical physicist whose most significant achievement was aggregating a set of equations in electricity, magnetism, and inductance, eponymously named Maxwell’s equations. He also developed the Maxwell distribution, a statistical means to describe aspects of the kinetic theory of gases.
Courtesy of Joachim Reinhardt
Fermi, Dirac, Maxwell– Boltzmann, and Bose–Einstein Statistics
Enrico Fermi: Guinea 2001
Courtesy of Joachim Reinhardt
shows a portrait of Quetelet, apparently by the Flemish portrait painter JosephDenis Odevaere (1775 or 1778–1830). We have identified no other stamp for Quetelet.
Heinrich Rudolf Hertz, on the left, and on the right, James Clerk Maxwell: Mexico 1967
Ludwig Boltzmann: Austria 1981
Boltzmann was an Austrian physicist famous for his founding contributions in the fields of statistical mechanics and statistical thermodynamics. He was one of the most important advocates for atomic theory when that scientific model was still highly controversial. In statistical mechanics, Maxwell–Boltzmann statistics describe the statistical distribution of material particles over various energy states in thermal equilibrium. Moreover, Wikipedia states, “Although Boltzmann first linked entropy and probability in 1877, it seems the relation was never expressed with a specific constant until CHANCE
39
for his work on quantum mechanics in the early 1920s, which provides the foundation for what are now known as Bose–Einstein statistics. In statistical mechanics, Bose–Einstein statistics are used to determine the statistical distribution of identical indistinguishable bosons over the energy states in thermal equilibrium; in particle physics, bosons are force carrier particles, such as the photon. They are distinguished from fermions (matter particles) by their integer spin. Bose’s stamp is the only one we have identified with his portrait; it was issued by India in 1994 to celebrate the centenary of Bose’s birth year.
Courtesy of Joachim Reinhardt
Courtesy of Joachim Reinhardt
Max Karl Ernst Ludwig Planck (1858– 1947) … gave an accurate value for it, in his derivation of the law of black body radiation in December 1900.” The stamp from Austria was issued in recognition of the 75th anniversary of the death of Ludwig Boltzmann in 1981; a first-day cover with this stamp is shown in Edgar Heilbronner and Foil Miller’s book, A Philatelic Ramble Through Chemistry. On the same page, there is a first-day cover of a 1991 stamp for Maxwell from San Marino. There seem to be only two stamps for Maxwell and one for Boltzmann.
Satyendra Nath Bose: India 1994
Bose–Einstein Bose was an Indian physicist who specialized in mathematical physics. He is probably best known
Albert Einstein: Monaco 2005
Einstein is probably best known for his theory of relativity and mass-energy equivalence, E = mc2. He received the 1921 Nobel Prize in physics for his “services to theoretical physics, and especially for his discovery of the law of the photoelectric effect.” Einstein’s many contributions include solving classical problems of statistical mechanics and providing an explanation of the Brownian motion of molecules and atomic transition probabilities. In 1999, Einstein was named “person of the century” by Time Magazine, and a poll of prominent physicists named him the greatest physicist of all time. This stamp from Monaco was issued in 2005 to celebrate the centenary of the publication of Einstein’s Special Theory of Relativity. At least 267 stamps have been issued for Einstein: the ATA Einstein checklist, www.amstat.org/publications/chance (courtesy of Larry Dodson) lists 204, with a further 63 listed by Dodson. Joachim Reinhardt’s online open-access web archive has images of 143 stamps for Einstein, and, in The Jewish World in Stamps, Ronald Eisenberg displays images of more than 30 stamps for Einstein.
Ready Resources Styan, George P. H. and Trenkler, Götz. (2007) “A Philatelic Excursion With Jeff Hunter in Probability and Matrix Theory.” Journal of Applied Mathematics and Decision Sciences. For a checklist of almost 5,000 mathematical stamps, contact Monty Strauss at
[email protected] The Jewish World in Stamps, by Ronald L. Eisenberg A Philatelic Ramble Through Chemistry, by Edgar Heilbronner and Foil A. Miller Mathematics and Science: An Adventure in Postage Stamps, by William L. Schaaf
40
VOL. 21, NO. 3, 2008
Die Mathematik und ihre Geschichte im Spiegel der Philatelie, by Peter Schreiber Stamping Through Mathematics, by Robin J. Wilson PHILAMATH: A Journal of Mathematical Philately The Mathematical Intelligencer For demography on stamps, see “Demography on Stamps,” by Ed Stephan in the Journal of Thematic Philately (includes an image of the Quetelet stamp) For a philatelic introduction to magic squares and Latin squares, see “A
Philatelic Introduction to Magic Squares and Latin Squares for Euler’s 300th Birthyear,” by George Styan in the Proceedings of the Annual Meeting of the Canadian Society for History and Philosophy of Mathematics, Montréal (Québec), Canada, 2007 (includes images of six stamps in honor of Euler) For computers on stamps, see Computers on Stamps and Stationery, Volumes 1 and 2, by Larry Dodson (ATA 1998, 2004) and A Philatelic Tour of the American Civil War, by Larry Dodson (ATA 2006).
We have identified at least one stamp to honor the following three well-known people, two of which one might not expect to have a connection with probability or statistics:
Courtesy of George P. H. Styan
Courtesy of George P. H. Styan
Holy Roman Emperor Joseph II (1741–1790) Leonhard Euler (1707–1783) Wolfgang Amadeus Mozart (1756–1791)
would meet one officer of each of the six ranks and one from each of the six regiments. In 1782, the prolific mathematician Euler showed the impossibility of such an arrangement, known as a 6x6 Graeco-Latin square, “using an argument that is fallacious in method, but correct in its conclusion,” according to Pearce. The first rigorous proof was given more than 100 years later in 1900 by the French mathematician Gaston Tarry (1843–1913), for whom we believe no stamp has been issued.
Emperor Joseph II on the left, and, on the right, his mother Empress Maria Theresa: Austria 1996
It was suggested by S. Clifford Pearce in the IMS Bulletin that Emperor Joseph II might be the emperor who wanted to do the following (which we now know to be impossible): arrange 36 officers in a square, one of each rank from each regiment so that whichever row or column the emperor walked along, he
to find out what was in the elegant box, but the German edition [is] in the British Museum.” We also have not been able to trace this 1806 Wheatstone edition. Mozart was chamber musician and court composer to Emperor Joseph II, and, in the movie “Amadeus” (directed by Miloš Forman and based on Peter Shaffer’s stage play), there is a scene in which Emperor Joseph II (Jeffrey Jones) is playing the piano with Mozart (Tom Hulce) while musician Antonio Salieri (F. Murray Abraham) watches intently. Mozart’s well-known comic opera, “Così fan tutte,” was commissioned by Emperor Joseph II and was well received at its première in Vienna on January 26, 1790. (Its run was cut short when the emperor died on February 20, 1790; Mozart died at the end of the following year, on December 5.)
Wolfgang Amadeus Mozart: Germany 2006
A musical dice game, Musikalisches Würfelspiel, is often attributed to the famous composer and musician Mozart. This game consists of a set of musical phrases in the style of a minuet, together with instructions for determining the order in which they are to be played by rolling dice. Music for a “Mozart Minuet from the Melody Dicer!” is available open-access at www.carousel-music.com. Martin Gardner, in Time Travel and Other Mathematical Bewilderments, says, “The most popular work explaining how a pair of dice can be used ‘to compose without the least knowledge of music’ as many German waltzes as one pleases was first published in Amsterdam and in Berlin in 1792, a year after Mozart’s death.” In Biometrika, M. G. Kendall gives the year of publication as 1793 and says, “This work is attributed to Mozart in the Kochel–Einstein index. It reappeared in 1806 in England (C. Wheatstone, London) as Mozart’s Musical Game, fitted in an elegant box, showing an easy system to compose an unlimited number of Waltzes, Rondos, Hornpipes, and Reels. I have not been able to trace the Wheatstone edition or
Courtesy of Jeff Miller
Holy Roman Emperor Joseph II, Leonhard Euler, and Wolfgang Amadeus Mozart
Leonhard Euler: Switzerland 1957
The stamp for Emperor Joseph II was part of a souvenir sheet issued in celebration of Austria’s 1,000th anniversary in 1996. Shown on the stamp with the emperor is his mother, Holy Roman Empress Maria Theresa (1717–1780), for whom we have identified eight other stamps. For Mozart, we conjecture that there are now more than 250 stamps. The ATA checklist (courtesy of Jo Bleakley) listed 163 as of June 1997, and the Mozart Forum, www.mozartforum.com, has images of 48 stamps issued to celebrate Mozart’s 250th birth anniversary in 2006. For acknowledgements and a full list of refer references, visit www.amstat.org/CHANCE.
CHANCE
41
Probability, Statistics, Evolution, and Intelligent Design Peter Olofsson
I
n the last decades, arguments against Darwinian evolution have become increasingly sophisticated, replacing Creationism by Intelligent Design (ID) and the book of Genesis by biochemistry and mathematics. As arguments claiming to be based in probability and statistics are being used to justify the anti-evolution stance, it may be of interest to readers of CHANCE to investigate methods and claims of ID theorists.
Probability, Statistics, and Evolution The theory of evolution states in part that traits of organisms are passed on to successive generations through genetic material and that modifications in genetic material cause changes in appearance, ability, function, and survival of organisms. Genetic changes that are advantageous to successful reproduction over time dominate and new species evolve. Charles Darwin (1809–1892) is famously credited with originating and popularizing the idea of speciation through gradual change after observing animals on the Galapagos Islands. Today, the theory of evolution is the scientific consensus concerning the development of species, but is nevertheless routinely challenged by its detractors. The National Academy of Sciences and Institute of Medicine (NAS/IM) recently issued a revised and updated document, titled “Science, Evolution, and Creationism,” that describes the theory of evolution and investigates the relation between science and religion. Although the latter topic is of interest in its own right, in fairness to ID proponents, it should be pointed out that many of them do not employ religious arguments against evolution and this article does not deal with issues of faith and religion. How do probability and statistics enter the scene? In statistics, hypotheses are evaluated with data collected in a way that introduces as little bias as possible and with as much precision as possible. A hypothesis suggests what we would expect to observe or measure, if the hypothesis were true. If such predictions do not agree with the observed data, the hypothesis is rejected and more plausible hypotheses are suggested and evaluated. There are many statistical techniques and methods that may be used, and they are all firmly rooted in the theory of probability, the “mathematics of chance.” 42
VOL. 21, NO. 3, 2008
An ID Hypothesis Testing Challenge to Evolution In his book The Design Inference, William Dembski introduces the “explanatory filter” as a device to rule out chance explanations and infer design of observed phenomena. The filter also appears in his book No Free Lunch, where the description differs slightly. In essence, the filter is a variation on statistical hypothesis testing with the main difference being that it aims at ruling out chance altogether, rather than just a specified null hypothesis. Once all chance explanations have been ruled out, ‘design’ is inferred. Thus, in this context, design is merely viewed as the complement of chance. To illustrate the filter, Dembski uses the example of Nicholas Caputo, a New Jersey Democrat who was in charge of putting together the ballots in his county. Names were to be listed in random order, and, supposedly, there is an advantage in having the top line of the ballot. As Caputo managed to place a Democrat on the top line in 40 out of 41 elections, he was suspected of cheating. In Dembski’s terminology, cheating now plays the role of design, which is inferred by ruling out chance. Let us first look at how a statistician might approach the Caputo case. The way in which Caputo was supposed to draw names gives rise to a null hypothesis H0 : p = 1/2 and an alternative hypothesis HA : p > ½, where p is the probability of drawing a Democrat. A standard binomial test of p = 1/2 based on the observed relative frequency pˆ = 40/41 ≈ 0.98 gives a solid rejection of H0 in favor of HA with a p-value of less than 1 in 50 billion, assuming independent drawings. A statistician could also consider the possibility of different values of p in different drawings, or dependence between listings for different races. What then would a ‘design theorist’ do differently? To apply Dembski’s filter and infer design, we need to rule out all chance explanations; that is, we need to rule out both H0 and HA. There is no way to do so with certainty, and, to continue, we need to use methods other than probability calculations. Dembski’s solution is to take Caputo’s word that he did not use a flawed randomization device and conclude that the only relevant chance hypothesis is H0. It might sound questionable to trust a man who is charged with cheating, but as it hardly makes a difference to the case whether Caputo cheated by “intelligent design” or by “intelligent chance,” let us not quibble, but generously accept that the explanatory filter reaches the same conclusion as the test: Caputo cheated. The shortcomings of the filter are nevertheless obvious, even in such a simple example. In No Free Lunch, Dembski attempts to apply the filter to a real biological problem: the evolution of the bacterial flagellum, the little whip-like motility device some bacteria
such as E. coli possess. Dembski discusses the number and types of proteins needed to form the different parts of the flagellum and computes the probability that a random configuration will produce the flagellum (using the analogy of shopping randomly for cake ingredients). He concludes it is so extremely improbable to get anything useful that design must be inferred. A comparison of Dembski’s treatments of the Caputo case and the flagellum is highly illustrative, focusing on two aspects. First, in each case, Dembski only considers one chance hypothesis—the uniform distribution over possible sequences and protein configurations, respectively. He presents no argument as to why rejecting the uniform distribution rules out every other chance hypothesis. Instead, he shifts the burden of proof to the “design skeptic,” who, according to Dembski, “needs to explicitly propose a new chance explanation and argue for its relevance.” In the Caputo case, it may be warranted to test only one chance hypothesis, as there is only one such hypothesis that equates to fairness, but the situation is radically different for the flagellum, where nonuniformity in no way contradicts an evolutionary process of mutation and natural selection. Dembski routinely uses the uniform distribution as a synonym for lack of knowledge, a dubious practice that has been gainfully exposed by probabilist Olle Häggström. Second, the one specific sequence of Democrats and Republicans that Caputo produced must be put together with other comparable sequences to obtain the rejection region. More specifically, we need to consider the set of 42 sequences that have at least 40 Democrats and compute its probability. Dembski does this correctly in the Caputo case, but when it comes to the flagellum, he does not consider the rejection region; he simply computes the probability of the outcome. Dembski’s way around this problem is to use his own term, “specification,” a vague concept that does not have a strict mathematical definition, but is intended to be a generalization of rejection region. In an essay titled “Specification: The Pattern That Signifies Intelligence,” it is said that “Specification denotes the type of pattern that highly improbable events must exhibit before one is entitled to attribute them to intelligence.” In No Free Lunch, the index entry “Specification, definition of” leads to a page where specification is used as a synonym for rejection region. The filter requires us at some point to compute a probability, so whatever “specification” is, it must be possible to convert it into the mathematical object of a set. In the Caputo case, the two descriptions are easily integrated, as cheating can be described as patterns of the type “more Ds than Rs,” which also correspond to sets of sequences. However, when it comes to biological applications such as the flagellum, Dembski merely claims specification “always refers to function” and develops it no further. As opposed to the simple Caputo example, it is now very unclear how a relevant rejection region would be formed. The biological function under consideration is motility, and one should not just consider the exact structure of the flagellum and the proteins it comprises. Rather, one must form the set of all possible proteins and combinations thereof that could have led to some motility device through mutation and natural selection, which is, to say the least, a daunting task.
A general point of criticism against ID is that it does not offer any scientific explanations of natural phenomena, but merely attempts to discredit Darwinian evolution, aiming at inferring ‘design’ by default. Dembski’s filter is streamlined to this approach; by trying to rule out all chance hypotheses, it attempts to infer design without stating any competing design hypotheses. Above, it was demonstrated how the filter runs into trouble, even when it is viewed entirely within Dembski’s chosen paradigm of “purely eliminative” hypothesis testing. Others have criticized the eliminative nature of the filter, claiming that useful design inference must be comparative. In a chapter titled “Design by Elimination vs. Design by Comparison” in his book The Design Revolution, Dembski counters this type of criticism. He starts by doing a ‘reality check’ to conclude that “the sciences look to Ronald Fisher and not Thomas Bayes for their statistical methodology,” referring to the divide in the statistical community (to the extent that such a divide really exists) between the frequentist approach—in which unknown parameters are viewed as constants and are subject to hypothesis testing—and the Bayesian approach— in which unknown parameters are viewed as random variables described by their probability distributions. However, the type of pure elimination he devises is not how statistical hypothesis testing is done in the sciences. A null hypothesis H0 is not merely rejected; it is rejected in favor of an alternative hypothesis HA. Moreover, one can compute the likelihood of the data for various parameter choices specified by HA to conclude the evidence is, indeed, in favor of HA (so-called power calculations). Hence, the statistical methodology of the sciences is eliminative and comparative. One reason for Dembski to try to align with the frequentist camp is that there are indisputable problems with “Bayesian design inference.” For example, to apply Bayesian methods, one would have to assign a prior probability distribution over various chance and design hypotheses, which is obviously a more or less hopeless task. Dembski is not satisfied with such limited countercriticism, but decides to take on Bayesian inference altogether. In doing so, he claims Bayesian inference is “parasitic on the Fisherian approach,” as a Bayesian analysis must also use rejection regions! He even claims Bayesians do so “routinely,” but does not offer any examples. As the entire Bayesian approach is completely incompatible with the concept of hypothesis testing in general and rejection regions in particular, any such example would surely rock the world of statistics. To illustrate his point, Dembski instead revisits the Caputo example. In his notation, the event E is the observed sequence of 40 Democrats and one Republican in some fixed order, CHANCE
43
wishes to make the valid “ Behe point that microbial populations are so large that even highly improbable events are likely to occur without the need for any supernatural explanations…
”
and the event E* is the set of the 42 sequences with at least 40 Democrats. Thus, E* is the rejection region from the hypothesis test above and Dembski’s claim is that a Bayesian analysis must also use E*, rather than E. Here is a typical Bayesian analysis of the Caputo example: Let p, now viewed as a random variable, denote the probability of selecting a Democrat; let f denote the prior density of p, and assume independent trials. The posterior density of p conditioned on the observed sequence E then satisfies the proportionality relation f(p|E) ~ p 40(1 − p)f ) (p), where the )f factor p40(1−p) is the probability of E if the true parameter value is p. For example, if we choose a uniform prior distribution for p, the posterior distribution turns out to be a so-called Beta distribution with mean 41/43. In this posterior distribution, the probability that p is not above 1/2 turns out to be only about 10−11, which gives clear evidence against fair drawing. The Bayesian analysis does not involve the set E* or any other rejection regions. To do Bayesian design inference, one would need to augment the parameter space to allow for various design hypotheses and compute their respective likelihoods. Regardless of how this would be done practically, no rejection regions would ever be formed.
An ID Probability Challenge to Evolution Michael Behe has presented his criticism of evolutionary biology in two books: Darwin’s Black Box, published in 1996, and The Edge of Evolution, the 2007 follow up. The former does not contain much mathematics, but, in The Edge of Evolution, Behe has a chapter titled The Mathematical Limits of Darwinism, where he attempts to use probability and statistics to argue the case for ID. Behe’s central argument against human evolution hinges on how the malaria parasite P. falciparum has become resistant to chloroquine. The reason for invoking the malaria parasite is an estimate from the literature that the set of mutations 44
VOL. 21, NO. 3, 2008
necessary for chloroquine resistance has a probability of about 1 in 1020 of occurring spontaneously. Any statistician is bound to wonder how such an estimate is obtained, and, needless to say, it is very crude. Obviously, nobody has performed huge numbers of controlled binomial trials, counting the numbers of parasites and successful mutation events. Rather, the estimate is obtained by considering the number of times chloroquine resistance has not only occurred, but taken over local populations—an approach that obviously leads to an underestimate of unknown magnitude of the actual mutation rate, according to Nicholas Matzke’s review in Trends in Ecology & Evolution. Behe wishes to make the valid point that microbial populations are so large that even highly improbable events are likely to occur without the need for any supernatural explanations, but his fixation on such an uncertain estimate and its elevation to paradigmatic status seems like an odd practice for a scientist. Behe states a definition that incorporates the 1-in-1020 figure: “Let’s dub mutation clusters of that degree of complexity—1 in 1020—'chloroquine-complexity clusters,' or CCCs.” He then goes on to claim that, in the human population of the last 10 million years, where there have only been about 1012 individuals, the odds are solidly against such an unlikely event occurring even once. In Behe’s own words and italics: On average, for humans to achieve a mutation like this by chance, we would need to wait a 100 million times 10 million years. Since that is many times the age of the universe, it’s reasonable to conclude the following: No mutation that is of the same complexity as chloroquine resistance in malaria arose by Darwinian evolution in the line leading to humans in the past 10 million years. On the surface, his argument may sound convincing. We humans are tremendously complex, and the malaria parasite consists of only one cell. Clearly, it would be absurd to claim we have evolved without experiencing even one mutation as complex as the little bug demonstrably has done. But one does not have to scratch deeply below the surface to recognize problems with Behe’s statements.
First, he leaves the concept “complexity” undefined—a practice that is clearly anathema in any mathematical analysis. Thus, when he defines a CCC as something that has a certain “degree of complexity,” we do not know of what we are measuring the degree. Lack of a clear definition is a fundamental problem when asserting something is proved, but let us nevertheless look further at Behe’s claims. As stated, his conclusion about humans is, of course, flat out wrong, as he claims no mutation event (as opposed to some specific mutation event) of probability 1 in 1020 can occur in a population of 1012 individuals (an error similar to claiming that most likely nobody will win the lottery because each individual is highly unlikely to win). Obviously, Behe intends to consider mutations that are not just very rare, but also useful, as can be concluded from his statement, “So, a CCC isn’t just the odds of a particular protein getting the right mutations; it’s the probability of an effective cluster of mutations arising in an entire organism.” Note that Behe now claims CCC is a probability; whereas, it was previously defined as a mutation cluster, another confusion arising from Behe’s failure to give a precise definition of his key concept. A problem Behe faces is that “rarity” can be defined and ordered in terms of probabilities; whereas, he suggests no separate definition of “effectiveness.” For an interesting example, also covered by Behe, consider another malaria drug, atovaquone, to which the parasite has developed resistance. The estimated probability is here about 1 in 1012, thus a much easier task than chloroquine resistance. Should we then conclude atovaquone resistance is a 100 million times worse, less useful, and less effective than chloroquine resistance? According to Behe’s logic, we should. Behe makes a point of his probability of 1 in 1020 being estimated from data, rather than calculated from theoretical assumptions. This approach leads to a catch-22 situation if we consider the human population with its 1012 members. Behe’s claim is that there has not been a single CCC in the human population, and thus Darwinian evolution is impossible. But, if a CCC is an observed relative frequency, how could there possibly have been one in the human population? As soon as a mutation has been observed, regardless of how useful it is to us, it gets an observed relative frequency of at least 1 in 1012 and is thus very far from acquiring the magic CCC status. Think about it. Not even a Neanderthal mutated into a rocket scientist would be good enough; the poor sod would still decisively lose out to the malaria bug and its CCC, as would almost any mutation in almost any population. In the above sense, Behe’s claim is vacuously true. On the other hand, Behe has now painted himself into a corner, where he cannot obtain any empirical evidence for design because, as soon as a mutation has been observed, its existence is attributable to Darwinian evolution by population number arguments alone. Does there exist any population of any species where some individuals carry a useful mutation and others do not, such that this mutation can be explained by Darwinian evolution? Behe has already told us that one such example is chloroquine resistance in malaria. Does there exist any population of any species where some individuals carry a useful mutation and others do not, such that this mutation cannot be explained by Darwinian evolution? No. If one of n individuals experiences a mutation, the estimated mutation
probability is 1/ 1/n. Regardless of how small this number is, the mutation is easily attributed to chance because there are n individuals to try. Any argument for design based on estimated mutation probabilities must therefore be purely speculative. Arguments against the theory of evolution come in many forms, but most share the notion of improbability, perhaps most famously expressed in British astronomer Fred Hoyle’s assertion that the random emergence of a cell is as likely as a Boeing 747 being created by a tornado sweeping through a junkyard. Probability and statistics are well developed disciplines with wide applicability to many branches of science, and it is not surprising that elaborate probabilistic arguments against evolution have been attempted. Careful evaluation of these arguments, however, reveals their inadequacies.
Further Reading Elsberry, W. and Shallit, J. (2004) “Playing Games with Probability: Dembski’s Complex Specified Information.” In Why Intelligent Design Fails: A Scientific Critique of the New Creationism, Rutgers University Press: Piscataway, New Jersey. Häggström, O. (2007) “Intelligent Design and the NFL Theorems.” Biology and Philosophy, 22:217–230. Matzke, N. (2007) “Book Review: The Edge of Creationism.” Trends in Ecology & Evolution, 22:566–567. National Academy of Sciences and Institute of Medicine (2008). Science, Evolution, and Creationism. Washington, DC: The National Academies Press. www.nap.edu/catalog/11876. html Olofsson, P. (2008) “Intelligent Design and Mathematical Statistics: A Troubled Alliance.” Biology and Philosophy (in press). Perakh, M. (2003) Unintelligent Design. Prometheus Books: Amherst, New York. Shallit, J. (2002) “Book Review: No Free Lunch.” BioSystems, 66:93–99. Sober, E. (2002) “Intelligent Design and Probability Reasoning.” International Journal for the Philosophy of Religion, 52:65–80.
CHANCE
45
A Statistician Reads the Sports Pages
Phil Everson,
Column Editor
Good Offense vs. Poor Defense: The 2007 Women's World Cup
S
occer’s 2007 Women’s World Cup (WWC) began in
China September 10 with defending champion Germany defeating Argentina 11–0. The next day, the U.S. team played North Korea (Korea DPR) to a 2–2 draw. Is Germany a much better offensive team than either the United States or Korea? Did Argentina play especially poor defense? Or, might all the teams be of equal ability and the results due to chance alone? It is impossible to answer these questions based on results from only two matches. With a sufficiently large number of match outcomes, however, it should be possible to fit a probability model to the data that accounts for the varying abilities of teams to score and to prevent goals. Germany went on to win the 2007 WWC without giving up a single goal, beating Brazil 2–0 in the final. Brazil made it to the final match by beating the United States 4–0. That outcome was heavily scrutinized because the U.S. coach had made the decision to play 1999 World Cup and 2004 Olympic champion goalkeeper Briana Scurry instead of Hope Solo, who had been the starter since 2005 and was coming off a shutout against England in the Quarterfinals (the United States won that match 3–0). Solo made some unfortunate comments to the press following the loss, which also drew media attention. In this article, I describe the problem of modeling outcomes for soccer matches. A particular question of interest is how the United States might have fared against Brazil in the 2007 WWC had Solo played, rather than Scurry. Solo did face a similar Brazil team in the gold medal match of the 2008 Beijing Olympic Games this past summer. She did not allow a goal in 120 minutes of play, and helped the United States win 1–0. The analyses here use only the 2007 WWC data. A reasonable student project would be to try out some of the techniques described here to, for example, compare Scurry’s play in the 2004 Athens Olympics to Solo’s play in Beijing.
Terminology The FIFA (Federation Internationale de Football Association) Women’s World Cup is a tournament involving teams from 46
VOL. 21, NO. 3, 2008
United States women's national soccer team goal keepers Briana Scurry, left, and Hope Solo practice grabbing shots during a training session September 26, 2007, in Hangzhou, China. Scurry started in place of Solo, who started the first four matches, when the United States played Brazil in a semifinal match in the FIFA 2007 Women's World Cup soccer tournament. (AP Photo/Julie Jacobson)
16 countries and consists of 32 total matches (including the third-place match). Participating countries are divided into four groups of four. The four teams in a group hold matches between every pair of teams in the group; that is six pairs. Thus, there are 24 pool matches. Pool matches end after two 45-minute periods and may result in draws. Three points are awarded for each win and one point for each draw in deciding which teams advance beyond pool play. The top two teams from each group advance to the knockout stage. This is an ordinary eight-team tournament. Knockout matches must end with a decision. If the score is tied after regulation time, the teams play two additional 15-minute periods. If the score is still tied, the match is
decided by a shootout. A convenient feature of the 2007 WWC data is that no extra time periods were needed during the eight knockout matches. Every team plays three pool matches and zero, one, or three knockout matches. In the first round of the knockout stage, there are four matches. Half of these teams are eliminated and play no more. The four winners play two more matches. The two teams who win in the second knockout stage play each other for the WWC championship. The two losing teams in the second knockout stage play each other to determine third and fourth place. The four teams playing in the third round of the knockout stage (i.e., the semifinalists) play a total of six matches during the tournament. The number of goals scored in a match determines the winner, but it also can be informative to consider the numbers of shots or shots on goal. A “shot” is a subjective designation made by the official designated as the scorer. A “shot on goal” is a shot preceding a goal or a save. That is, the shot would have resulted in a goal had the goalkeeper not stopped it. I focus primarily on shots on goal. Any player credited with a goal also is credited with a shot and a shot on goal. But, not all goals result from a shot on goal. In the 2006 World Cup, the U.S. men’s team tied Italy 1−1, but had 0 shots on goal. When
a player knocks the ball into his or her own goal, it is classified as an “own goal” and no shot on goal is recorded (but the goal counts in the match outcome). 2007 World Cup Data Using match reports from the BBC web site, I recorded the total numbers of shots, shots on goal, and goals for each team in each match. Because no extra periods were played during the knockout stage, all 32 matches consisted of two 45-minute periods. There were three own goals in the 2007 WWC, and none of these were decisive, although one occurred against Scurry in Brazil’s 4–0 victory. I treat this as a 3–0 win for Brazil and use adjusted goal totals that omit own goals for the computations in the following sections. That way, every goal has a corresponding shot on goal, making it appropriate to talk about binomial models and other conditional probability formulations. (For the interested
Shop ASA
Visit the ASA’s online marketplace at www.amstat.org/asastore
Purchase T-shirts, books, JSM Proceedings, and gift items!
SAVE 10%
on your first purchase by entering ASASTORE at checkout. CHANCE
CHANCE
47
Shots-on-Goal Allowed
Shots-on-Goal (Goals)
Shots-on-Goal (Goals) Allowed
Shots-on-Goal
Figure 1. Boxplots of the shots on goal for and against each of the four 2007 World Cup semifinalists. Each team played six matches, and the six individual game totals are marked on the graph, with solid dots indicating matches against other semifinalists. The narrower, offset boxplots represent the distributions of the numbers of goals scored and allowed by each team.
Figure 2. Goals per shot on goal. The bar heights correspond to the goals per shot on goal taken (gray bars) and faced (black bars) by each of the four 2007 World Cup semifinalists. The widths of the bars are proportional to the number of shots on goal in each case, scaled relative to the maximum value (54 shots taken by Brazil).
reader, both the adjusted and actual goal totals are available in the data set at www.swarthmore.edu/NatSci/peverso1.) In the 32 matches of the 2007 WWC, there were a total of 962 shots, 442 shots on goal, and 64 goals scored (excluding the three own goals). So, overall, about 11% of shots and 24% of shots on goal resulted in scores (and 46% of shots were shots on goal). Also, there were exactly two goals per match, on average. Figure 1 shows boxplots of shots on goal attempted by and allowed by Brazil, Germany, Norway, and the United States, the four teams that played in the semifinal matches. Notice that, as well as being alphabetical, this is the ordering for the average values of shots on goal (indicated by the gray horizontal lines). Germany’s average shots on goal and average goals are inflated because they had 23 shots on goal and scored 11 goals (!) against Argentina. The next-largest values are 17 shots on goal and seven goals scored by Norway against 48
Shots on Goal Permutation Test
Density
Goal %
Goals per Shot-on-Goal
VOL. 21, NO. 3, 2008
maximum difference in medium SOG
Figure 3. Simulated maximum differences in median shots on goal for the four 2007 World Cup semifinalists. For each of 10,000 simulations, the 193 shots on goal were randomly reassigned to one of the six matches played by each of the four teams. The results were tallied for each team and the maximum difference in medians recorded. About 30% of simulated values were as large as or larger than the observed difference of 3.5 goals per match (shaded bars).
Ghana. Germany has the smallest median shots on goal of the four teams, but the second-largest average. There are six matches per team, so each boxplot marks the lowest, second-lowest, average of the middle two values (i.e., the median), the second-largest, and the largest observation for a team. The solid dots mark values that correspond to matches played against the other three semifinalists, presumed to be the strongest opponents. The narrower, offset boxplots represent goals scored and allowed by each team. Figure 2 shows side-by-side barplots of the proportions of shots on goal for and against each team that were successfully converted into goals. The gray bars represent the proportions of shots by each team that resulted in goals scored, and the black bars represent the proportions of shots faced by each team that resulted in goals allowed. The widths of the bars represent the number of shots on goal taken or faced, relative to the largest value (54 shots on goal taken by Brazil). There is some evidence for team differences, but the small sample
sizes make it difficult to judge by sight as to whether the differences are due to more than chance variation.
Simple Models and Estimates Before fitting more complicated models, it is helpful to try a few simple procedures, even if the simplified assumptions may not seem entirely appropriate. For example, we could compare the four semifinalists to see if there is evidence of differences in median shots on goal per match, or in goalscoring probabilities. With such small sample sizes, traditional t or z tests may not be appropriate, but a case can be made for permutation tests. Shots on Goal The distributions of shots on goal are skewed, so the median seems like a more appropriate summary than the mean. The medians range from five shots on goal for Germany to 8.5 for Brazil, and this range of 3.5 provides one metric for assessing the variability among teams. A permutation test answers the following question: Ignoring the possibly varying strengths of the opponents, how unlikely is it that the four medians would differ by as much as 3.5 shots on goal per match if the teams were in fact equivalent?
Table 1 — Counts (and Proportions) of Goals Scored, Saves Produced, and Shots on Goal for the 2007 World Cup Semifinalists Goals
Saves
Shots on Goal
Germany
20 (40%)
30 (60%)
50
Brazil
16 (30%)
38 (70%)
54
USA
12 (27%)
32 (73%)
44
Norway
12 (27%)
33 (73%)
45
Total
60 (31%)
133 (69%)
193
Team
Table 2 — Counts (and Proportions) of Goals Allowed, Saves Made, and Shots on Goal Allowed for the 2007 World Cup Semifinalists Team
Goals
Saves
Shots on Goal
Germany
0 (0%)
31 (100%)
31
Brazil
4 (24%)
13 (76%)
17
USA
6 (21%)
22 (89%)
28
Norway
10 (27%)
29 (73%)
39
Total
20 (17%)
95 (83%)
115
There were 193 shots on goal taken in the 24 matches played by the four semifinalists. We can define “equivalent” to mean that, conditional on four teams having taken a total of 193 shots on goal, each shot is equally likely to have been taken by any of the four teams in any of their six matches. For example, this is achieved if we assume shot on goal counts occur in each match according to independent Poisson processes with a common mean number of shots on goal per match for the four teams. The total count is a sufficient statistic for the mean of a Poisson distribution, so the conditional distribution of the individual counts does not depend on the mean. Because all the matches were of the same length, assuming equal means implies the probabilities that each particular shot on goal was taken by a particular team in a particular match are all equal. Assuming a null hypothesis that the four teams are “equivalent,” we can simulate hypothetical shot totals by assigning each of the 193 shots on goal to a randomly chosen team and match and tallying the results for the four teams. First, label the four teams 1, …, 4 and their six match appearances 1, …, 6. For each of the 193 shots on goal, generate a random integer from 1, …, 4 to represent a team and another random integer from 1, …, 6 to represent the appearance for that team. If, for example, the first numbers selected were 1 and 6, the shot would be assigned to team 1 in their last match. If the same pair of numbers was selected again, another of the 193 shots would be counted as part of team 1’s shot total in their sixth match. After all shots have been reassigned, each team has six new shot totals and a new median value. Comparing these yields a new largest difference in medians, randomly generated from the sampling distribution implied by the null hypothesis. For example, in one random partition, the median shots on goal were 2.5 for Brazil, 4 for Norway, and 4.5 for both Germany and the United States. So, the largest median difference for this partition is 4.5 − 2.5 = 2, compared to the observed value of 3.5. Goal-Scoring Percentages We can use a similar permutation test to compare goal-scoring frequencies for shots on goal. Tables 1 and 2 list the counts of goals scored, saves produced, and shots on goal for and against each of the four semifinalist teams. Assuming the outcomes of shots on goal are independent of one another, the null hypothesis of equal scoring probabilities implies an exchangeability similar to what we had with shots on goal. The team differences are more pronounced for the opponents’ scoring frequencies, and this defensive statistic is more relevant for the comparison of Solo and Scurry. So, we focus on the data in Table 2. Given the number of shots on goal faced by each team, and given that there were 20 goals scored, we can simulate the outcomes for the 115 individual shots on goal faced by randomly selecting a shot (without replacement) to go with each of the 20 goals. For each simulated reassignment of goals, I recorded the percentage of goals allowed by each team, and the maximum difference in goals-allowed percentages. Figure 4 shows the approximate sampling distribution of the maximum difference in goal percentages under the null hypothesis that the four teams had equal scoring probabilities. The sampling distribution conditions on 20 goals and the distribution of shots for all teams. The shaded region CHANCE
49
Permutation Test of Opponents' Goal %
a 2x2 table, the probabilities can be worked out exactly. If X represents the number of goals against Solo, then
Density
6 22 ( k ) (18 − k ) , k = 0,1,…6. P( X = k ) = (1828)
Maximum Deviation in Goal % Allowed
Figure 4. Simulated maximum differences in goals-against percentages for the 2007 World Cup semifinalists. For each of 10,000 simulations, the 20 goals scored against these teams were randomly assigned to one of the 115 shots on goal. The results were tallied for each team and the maximum difference in goals-against percentage recorded. About 10% of simulated values were as large as or larger than the observed difference of 0.27 (shaded bars).
Table 3 — Counts (and Proportions) of Goals Allowed, Saves Made, and Shots on Goal Allowed for U.S. Matches Involving Solo and Scurry in Goal Keeper
Goals
Saves
SOG
Solo
3 (17%)
15 (83%)
18
Scurry
3 (30%)
7 (70%)
10
Total
6 (21%)
22 (79%)
28
The numerator represents the number of ways to assign k of the six goals and 18 − k of the 22 saves to the 18 shots on goal for Solo. The denominator is the total number of ways to assign goals and saves to the 18 shots. The probability that Solo will outperform Scurry by as much as or more than she did is 6 22 3 k 18 − k P( X ≤ 3) = ∑ = 0.0006+0.0120+0.0853+ 28 k =0 18 0.2600 = 0.3578.
( )( ) ( )
Notice that, even if Solo had allowed only two of the six goals (and Scurry had allowed four), the difference would still not be large enough to meet the standard 0.05 significance level. Considering the other tail (the possibility that Scurry would have a percentage much lower than Solo’s) makes the difference even less significant. One shortcoming of these comparisons is that shots and goal percentages are considered separately. Germany typically had fewer shots on goal than the other three semifinalist teams, but compensated for it by having the highest goal conversion rate (along with a perfect defensive performance). And the comparisons of goal percentages treat the shots on goal as fixed. A better analysis would incorporate both shots on goal and goal totals, as well as the strengths of opponents. A way to approach this is to fit a model with team parameters that govern propensities to create and allow shots on goal, and to convert shots and allow shots to be converted into goals.
Probability Models for Goals Scored and Allowed marks the simulated differences that were at least as large as the observed largest difference of 10/39 − 0/31 ≈ 0.27 and represents about 10% of the simulated values. So, it would have been only marginally surprising to see four semifinal teams vary at least this much in their goals allowed percentage, even if they all had the same probability of defending a shot on goal. Comparing Solo and Scurry Table 3 gives the counts of goals allowed and shots on goals allowed for the U.S. matches involving Solo and Scurry. With a 2x2 table, we can perform the Fisher Exact Test to compare the differences in goals-allowed percentages. Making an argument similar to that in the Goal Scoring Percentages section, we can imagine each of the six goals being equally likely to be associated with any of the 28 shots on goal faced by the two goalkeepers. The probability that Solo’s percentage would be as much lower than Scurry’s is the probability of three or fewer of the goals going against Solo. We could do a simulation to estimate this probability, but with 50
VOL. 21, NO. 3, 2008
Permutation tests did not demonstrate a difference between teams, but those tests typically do not have much power. Often, we are more interested in finding good estimates of individual team abilities. For ranking teams or predicting future outcomes, a probability model for the observed data makes the problem explicit. In this example, we have a twostage model. First, the two teams in a match produce some number of shots on goal, and then each shot is converted into either a goal or a save. If we assume past outcomes do not influence future events (e.g., no “hot feet”), then it is not necessary to evaluate the outcomes sequentially. After fitting parameters to determine the distributions of shots on goal and goal probabilities in each match, new outcomes for the tournament may be simulated and compared to the actual results. If the actual data stand out in some way as strikingly different from the simulated data, that would be evidence that the model is not accurately representing the ‘random’ process that produced the data. If the data seem like a “typical” data set from the simulation, then there is some reason to think the fitted parameters of this model may reflect something about the nature of the real process.
10
●
●
8
●
●
6
frequency
frequency
● ●
4
●
●
2
●
●
●
●
0
●
● ●
●
0
5
10
●
●
●
15
●
●
●
●
●
20
Shots onGoal Goal Shots on
Figure 5. Histogram of shots on goal. The solid dots represent the predicted counts for a simple Poisson model with a constant mean for all 64 shot totals. The open dots represent the predicted counts based on a model that allows mean shot totals to differ depending on the offensive and defensive teams involved.
As a first attempt at modeling shots on goal, consider a Poisson process with rate μ shots on goal per match. The overall average for the 32 matches in the 2007 World Cup (64 shot totals) is µˆ = 6.9. Figure 5 displays a histogram of the observed frequencies of the various shot on goal totals. The solid dots represent the predicted frequencies based on this simple Poisson model. It is evident from Figure 5 that the constant-mean model tends to under predict the very small and very large shot totals. To allow for more variability, we can allow different Poisson means for different team-by-match combinations depending on the offensive and defensive teams involved. The two largest shot totals (23 by Germany and 18 by England) were both against Argentina (Argentina also allowed 11 shots on goal to Japan), suggesting it is important to consider defense as well as offense. The open dots in Figure 5 represent predicted shots on goal totals based on a model that makes these offensive and defensive adjustments. Details of the model and fitting procedures are available at www.amstat.org/chance. The expanded model fits better at the low and high ends of the distribution. Figure 6 compares the 64 observed shots on goal totals to the mean values estimated by the model. I first generate random values from the conditional (posterior) distribution, given the observed outcomes, of the parameters for a probability model that describes shots on goal. Then, for each combination of offensive and defensive teams, I compute the mean shots on goal by adding the simulated offensive parameter for the offensive team and the simulated defensive parameter for the defending team. The teams that take the most shots on goal have large offensive and small defensive parameter values so they contribute a lot to their own mean and little to their opponent’s mean. The overall estimate of the mean for each of the 64 combinations is the average of 1,600 means generated in this way. These fitted values are plotted on the horizontal axis in Figure 6, with the observed
values on the vertical axis. With the constant-mean model, all predictions would be along the horizontal line, marking the overall average (6.9) shots on goal per match. For each simulated mean in Figure 6, I also generated a random Poisson value with that mean to represent a new observed shots on goal total. The dashed lines are the 0.05 and 0.95 sample quantiles of 1,600 simulated shots on goal for each of the 64 team-by-match combinations. The jaggedness is due to the randomness inherent in the simulation. Note that a 90% prediction interval for a Poisson(6.9) distribution (the constant-mean model) includes between two and 11 shots on goal. Similarly, I allow the probability of converting a goal to vary depending on the team taking the shot and the team defending the shot. For each combination, the probability of scoring a goal on a given shot on goal is computed as the probability of a standard normal variable not exceeding the sum of offensive and defensive parameters for the teams involved. This is related to the probit regression model. The outcomes of different shots in a particular match are assumed independent, so given the number of shots on goal, the number of goals is modeled as binomial. Germany converted 40% of their shots on goal (the largest value for the 16 teams) and allowed 0%, so they would be considered strong in both categories, with a large offensive and a small defensive parameter. Figure 7 displays a comparison similar to that in Figure 6 for observed goal conversion percentages vs. fitted probabilities for each combination of offensive and defensive teams. The gray points are observed percentages based on four or fewer shots on goal (22 of the 64 combinations).
A Two-Level Model If we generate shots on goal from a Poisson distribution with mean μ, and then generate goals scored as binomial with probability ∏, this would be equivalent to generating two CHANCE
51
Observed SOG
Predicted Shots on Goal
Fitted Mean SOG
Figure 6. Observed shots on goal totals for the 64 team-by-match combinations are plotted against fitted values based on a Poisson model with different additive mean contributions depending on the offensive and defensive teams involved. Each of the 32 matches in the 2007 World Cup was simulated 1,600 times, with shots on goal totals recorded for each team in each match. The fitted values are the averages of the means used for these simulations. The dashed lines mark the fifth and 95th percentiles of the simulated shots on goal totals for each combination of offensive and defensive teams.
Observed Goals/SOG
Predicted Goal Percentages
Fitted Goal Probability
Figure 7. Fitted goal conversion probabilities. Observed goals per shot on goal values for the 64 team-by-match combinations are plotted against fitted values based on a probit model with different mean contributions depending on the offensive and defensive teams involved. The fitted values are the averages of the probabilities used for the 1,600 simulations. The dashed lines mark the fifth and 95th percentiles of the simulated goal proportions for each combination of offensive and defensive teams.
52
VOL. 21, NO. 3, 2008
Actual Goal Totals
Comparison of Actual and Fitted Goal Totals
Fitted Mean Goals
0.30 0.25 0.20 0.00
0.05
0.10
0.15
probability
probability
0.20 0.15 0.00
0.05
0.10
probability
probability
0.25
0.30
Figure 8. Observed goal totals for the 64 team-by-match combinations are plotted against fitted values based on a hierarchical Poisson model for shots on goal combined with a hierarchical probit model for goal-conversion probabilities. The fitted values are the averages of the mean goal totals for 1,600 simulations. The solid lines mark 95% confidence intervals for the respective mean values, and the dashed lines mark the fifth and 95th percentiles of simulated goal totals. The horizontal line represents the overall average goal total (1.7 goals per match).
0
2
4
6
8
Goals Scored against U.S. U.S. with Hope Solo Solo Goals Scored against with Hope
0
2
4
6
8
Goals Scored againstU.S. U.S. with with Briana Scurry Scurry Goals Scored against Brianna
Figure 9. Fitted goal distributions. Assuming Brazil takes eight shots on goal, these are the estimated probabilities for 0, 1, … , 8 goals being scored against the United States with either Solo or Scurry in goal. The means of these distributions are 8(0.28) = 2.24 for Solo and 8(0.36) = 2.88 for Scurry. The narrow bars mark probabilities for a binomial distribution with eight trials and these means.
independent Poisson variables with means ∏μ and (1 − ∏) μ to represent goals and saves, respectively. Ultimately, we care about the propensity for teams to score goals, not to take shots on goal. The independence result suggests that, assuming a Poisson model and treating each combination separately, we would learn nothing more by considering shots on goal along with goal totals. With level-two probability models relating the parameters of the various teams, however, the marginal
distributions of the goal totals are no longer Poisson, and shots on goal do provide additional information. The data inform us about the distributions of shots on goal and of scoring percentages in WWC matches, and this combines to yield more information about what to expect in a particular match than we would have from the goal totals alone. The parameter estimates for each team average the team’s own statistics with an estimate based on the ensemCHANCE
53
Simulated Brazil Goals
Brazil vs. U.S. with Solo
Simulated U.S. Goals
Simulated Brazil Goals
Brazil vs. U.S. with Scurry
Simulated U.S. Goals
Figure 10. The graphs plot simulated Brazil and U.S. goals. The areas of the circles are proportional to the numbers of replicates for each combination of goal totals. Points along the diagonal line represent matches that are tied after 90 minutes. Points above the line are Brazil wins; points below the line are U.S. wins. The top graph assumes Solo is the U.S. goalie; the bottom graph assumes the U.S. goalie is Scurry.
ble, in a way similar to the James-Stein estimate. The team estimates are then combined to represent specific offensive and defensive team match-ups. Figure 8 compares actual goal totals to the fitted mean values and gives 95% confidence and prediction intervals for the 64 match-ups. Overall, the 16 teams scored an average of 1.7 goals in the 32 matches. Note that a 90% prediction interval for a Poisson(1.7) distribution (the constant-mean model for goals scored) ranges from zero to three goals. 54
VOL. 21, NO. 3, 2008
USA vs. Brazil In the U.S. match against Brazil, Scurry faced eight shots on goal and saved five (there was also an own goal scored by the United States to give Brazil its fourth goal). Figure 9 shows the estimated distribution of goals allowed (0, 1, …, 8) by Scurry and Solo, each facing eight shots on goal by Brazil. To model the keepers separately, I allowed the two U.S. matches in which Scurry played to have defensive parameter values that differed from those in the four matches with Solo. The overall estimated probability of Brazil scoring on a particular shot on goal is 0.28 for Solo and 0.36 for Scurry. The narrow gray bars in Figure 9 show the corresponding Binomial distributions with these probabilities. My fitted distributions show greater variability because they reflect the uncertainty in the underlying scoring probabilities, as well as binomial variation. The estimated probability that Scurry would allow three or more goals is 0.55, compared to 0.40 for Solo. The estimated probability that Solo would have produced a shutout and forced overtime is 0.15. The simulated parameter values also allow one to play out hypothetical matches between the United States and Brazil with either Solo or Scurry in goal. The results for 1,600 simulated matches with each keeper are displayed in Figure 10. Points above the line correspond to matches won by the United States. Dots on the 45-degree line are tied after regulation, and points below the line correspond to wins for Brazil. Solo allowed about 2.0 goals on average, compared to 2.6 for Scurry, and the United States won after 90 minutes in 30% of matches with Solo and in 23% of matches with Scurry. After playing out extra time for tied matches (with Poisson rates one-third of that for a 90-minute match) and assuming a 0.5 probability of winning a shootout, the United States won about 38% of simulated matches with Solo, compared to 29% with Scurry. To win the 2007 World Cup, the United States would also have had to beat Germany. They did this in only 17% of simulated matches with Solo and 15% of matches with Scurry. All these estimates have large associated uncertainties due to the small sample sizes. At www.amstat.org/chance, I consider the posterior distributions of the parameters that describe Solo and Scurry’s defensive abilities. These fitted distributions reflect the fact that Solo outperformed Scurry in one realization of the 2007 WWC. Solo’s estimated mean goals allowed against Brazil was lower than Scurry’s in 65% of the posterior draws. But, that means there is posterior probability of about 0.35 (with simulation margin of error for 95% confidence of about 1 / 1600 = 0.025 ) that Scurry was the better goalkeeper. Solo’s success against Brazil in the 2008 Olympic gold medal match (zero goals allowed in 120 minutes of play) adds to the impression that she was the better goalkeeper in 2007. It would be interesting to repeat these analyses with an expanded data set that included matches from the 2004 Athens Olympic games, when Scurry helped the United States to a gold medal, and the 2008 Olympic games with Solo. “A Statistician Reads the Sports Pages” will print contributed columns on teaching statistical concepts using data from sports, as well as other articles about statistics and sports. Please contact Phil Everson (
[email protected]) if you are interested in submitting a column.
Here’s to Your Health
Mark Glickman,
Column Editor
Misreporting, Missing Data, and Multiple Imputation: Improving Accuracy of Cancer Registry Databases Yulei He, Recai Yucel, and Alan M. Zaslavsky
C
ancer registries collect information on type of cancer; histological characteristics; stage at diagnosis; patient demographics; initial course of treatment, including surgery, radiotherapy, and chemotherapy; and patient survival. Such information can be valuable for studying the patterns of cancer epidemiology, diagnosis, treatment, and outcome. However, misreporting of registry information is unavoidable; therefore, studies based solely on registry data would lead to invalid results. Past literature has documented the inaccuracy of registry records on adjuvant, or supplemental, chemotherapy and radiotherapy. The Quality of Cancer Care (QOCC) project used data from the California Cancer Registry—the largest geographically contiguous, population-based cancer registry in the world—to study the patterns of receiving and reporting adjuvant therapies for stage II/III colorectal cancer patients.
The study surveyed the treating physicians for a subsample of the patients in the registry to obtain more accurate reports of whether they received adjuvant therapies. This study confirmed the inaccuracy of the registry data in favor of underreporting. Table 1 (line 2 vs. line 1), which is based on this study, implies substantial under-reporting of 20% and 13% in chemotherapy and radiotherapy rates, respectively. Given that the registry is a valuable data source in health services research, how can we improve quality of inferences using the comprehensive but inaccurate registry database? Consider, for example, that our goal is to obtain accurate estimates of treatment rates from the misreported records in the registry. A simple approach is to use only the validation sample (i.e., the physician survey data collected in the QOCC project). However, due to logistic reasons, the survey sample (< 2000 patients) was much smaller than the registry sample (> 12,000 patients) used in the study; hence, analyzing the validation sample alone would greatly reduce precision, especially for complex estimands such as regression estimates. Another approach, the errors-in-variables method, would analyze the registry data while adjusting for reporting error. This approach typically involves modeling the relationship between the correct values and misreported ones, represented here by the validation sample and corresponding registry data.
Table 1— Adjuvant Therapy Rates % (SE) Sample
Chemotherapy Radiotherapy
Survey
73.3
(1.16)
25.4 (1.14)
Registry (in the survey region)
57.9
(0.79)
22.2 (0.67)
Registry (statewide)
51.4
(0.45)
19.6
Imputed Registry (statewide)
61.2
(0.77)
23.1 (0.61) CHANCE
55
Figure 2. Imputation Scheme for a Simplified Case: After initial values of P(YY(O) = 1) and P(YY(R) = 0YY(O) = 1) are chosen, P(YY(O) = 1YY(R) = 0) is computed. The algorithm iterates between imputing missing values of Y(O) on the left and drawing new probabilities on the right. After the iterations converge to the target posterior distribution, a final draw of Y(O) produces a set of imputations. The process is repeated multiple times.
Figure 1. An illustration of using imputation to correct for underreporting. X is a matrix of covariate variable values with one row for each person in the registry. Y(R) is the matrix of reported treat(R) ment status for various treatments. Y(O) is the true treatment. Y(O) (O) is observed in the survey. An observed value of 1 is assumed to be true, but a value of 0 might be incorrect.
Using information from both sources, the errors-in-variables method should yield valid results with increased precision. However, the statistical sophistication of the error-adjustment procedures might be challenging for analysts who typically do not possess statistical expertise to implement such methods. A more appealing strategy might be multiple imputation. In a typical nonresponse problem, this method first “fills-in” (imputes) missing variables several times to create multiple completed data sets. Analysis can then be conducted for each set using complete-data procedures. The results obtained from separate sets of completed data are combined into a single inference using simple rules. In the problem of misreporting, the essence of applying this strategy is to impute the uncollected correct treatment variables in the remainder of the registry, and then to perform analysis on the completed/corrected data. Figure 1 illustrates this strategy. The corrected registry data can then be used by practitioners without additional modeling effort. As with the errors-in-variables approach, the imputation model characterizes the measurement error process and makes the adjustment. The imputer also may incorporate additional information, which may not generally be available to analysts, such as information from other administrative databases, into the imputation model to further improve the analyses.
An Imputation Approach Consider a single therapy variable in the QOCC data (e.g., the adjuvant chemotherapy). Let Y(O) and Y(R) denote the true and reported status of the treatment in the registry sample, respectively. Both are binary variables, with 1=Yes and 0=No. 56
VOL. 21, NO. 3, 2008
In addition, we assume that only under-reporting takes place in the registry; that is, Y(R) = 0 if Y(O) = 0, and Y(R) could be either (R) 1 or 0 if Y(O) = 1. Such an assumption is at least almost entirely accurate for the QOCC data. Variable Y(R) is complete for the (R) registry. Variable Y(O) is observed in the validation sample, but missing for the remainder of the registry. We first consider a simple case with no predictors. By Bayes’ theorem, we can compute the probability that Y(O) is 1 given that Y(R) is 0:
PP(Y(Y(O(O) )==11YY( R( R) )== 00)) =
P(Y ( R ) = 0 Y (O ) = 1) P(Y (O ) = 1) P(Y ( R ) = 0)
.
(1) Because of the incomplete Y(O), P(Y Y(O) = 1) (the true treatment rate) and P(Y Y(R) = 0Y Y(O) = 1) (the rate of under-reporting) are unknown and need to be estimated. If the values of Y(O) had been completely observed in the registry and a uniform prior distribution for the two probabilities is assumed, these rates have Beta distributions whose parameters are determined by the corresponding number of 1s and 0s in the sample. The Beta distribution for estimating P(Y Y(O) = 1) is Beta(#(Y Y(O) = 1) + 1, #(Y Y(O) = 0) + 1). The Beta distribution for estimating P(Y Y(R) = 0Y(O) = 1) is Beta(#(Y Y(R) = 1, Y(O) = 1) + 1, #(Y Y(R) = 0, Y(O) = 1) + 1). One can easily generate values for these probabilities from their distributions, given complete data. Given values for the probabilities, one can compute P(Y Y(O) = 1Y Y(R) = 0) and draw a value of Y((O)) from a Bernoulli random variable with this probability. The imputation procedure first initializes the rates with reasonable prior guesses and then iterates between the following two steps until the rate estimates achieve convergence. Figure 2 illustrates this procedure. The steps in the iteration are stated in sequence below. Step 1. Impute the missing values of Y(O) as Bernoulli draws with probability of success given by (1). Step 2. Based on the completed data, update P(Y Y(O) = 1) and P(Y Y(R) = 0Y Y(O) = 1), and then recalculate P(Y Y(O) = 1Y Y(R) = 0). After convergence of the algorithm, one more draw of all missing Y(O) values is taken, yielding one full set of imputed values. Multiple imputations of Y(O) are created by repeating the procedure several times with different starting values for the initial (unknown) probabilities. We now describe the imputation model of Y(O) for the real data. Clinical and/or demographical information recorded in the cancer registry, such as age, gender, and comorbidity, are
values of Y(O) produces a completed data set. The process is repeated multiple times with different starting probability values in order to produce multiple imputed data sets. Figure 3. Imputation scheme for the full data: After initial values of parameters in the models P(Y(O) = 1X, θ (O)) and P(Y(R) = 0X, Y(O) = 1, θ (R)) are chosen, probabilities P(Y(O) =1X, Y(R) = 0, θ ) are computed. The algorithm iterates between imputing missing values of Y(O) on the left and drawing parameters in the probability models on the right. After the iterations converge to the target posterior distribution, a final draw of Y(O) produces a set of imputations. The process is repeated multiple times.
good candidates as predictors of both the receipt and reporting of adjuvant therapies. For example, younger patients are more likely to receive chemotherapy than elders. Furthermore, both treatment and reporting rates might vary across providers (hospitals). Incorporating patient-level covariates and clustering effects across hospitals in the imputation model helps improve the model predictions. Let X denote the covariate information and assume it is fully observed and recorded without error in the registry. The joint distribution of Y(O) and Y(R) given X can be decomposed into two parts: the outcome model P(Y (O ) X ,θ (O )) and the reporting model P (Y ( R ) Y (O ), X ,θ ( R )) , i.e.
P(Y (O ),Y ( R ) X ,θ ) = P(Y (O ) X ,θ (O )) P(Y ( R ) Y (O ), X ,θ ( R )) . The former corresponds to clinical processes relating receipt of the treatment with patient/hospital characteristics, and the latter characterizes the ways in which misreporting occurs in the registry. They can be viewed as generalizations of the true treatment and under-reporting rates from (1). For each part, we can apply a logistic or probit regression model with hospital random effects, using θ (O) and θ (R) to denote the corresponding model parameters. We assume the missingness of Y(O) in the QOCC data is at random (i.e., dependent only on the observed covariate variables denoted by X). Because the validation sample included all patients from certain regions within a defined period, this can be considered as planned missingness, or missingness due to study design. Consequently, by capturing these factors in the imputation model, missing at random is a plausible assumption. Similar to the simple case without covariates, application of Bayes’ theorem underlies the calculation of the probability used in imputation in the presence of covariates X, P(Y(O ) = 1 Y( R ) = 0, X ,θ ) = P(Y ( R ) = 0 Y (O ) = 1, X ,θ ( R )) P (Y (O ) = 1 X ,θ (O )) P (Y ( R ) = 0 X ,θ ( R ))
.
(2)
As shown in Figure 3, the imputation algorithm iterates between estimating the outcome and reporting model probabilities and imputing Y(O) until the estimates for θ (O) and θ (R) achieve convergence. That is, the sampling continues until it is judged that the algorithm is sampling from the target posterior distributions. Then, one final draw of all missing
Application Study Sample From the 10 regional cancer registries in California, the QOCC project selected all (n = 12,594) patients aged 18 or older who were newly diagnosed with stage III colon cancer or stage II or III rectal cancer and underwent surgery during the years 1994 to 1997. This registry sample included patients from 433 hospitals. From records of those patients diagnosed and treated in 1996 and 1997 in registry regions 1, 3, and 8—representing the San Francisco/Oakland, San Jose, and Sacramento areas in Northern California, respectively—the patients’ treating physicians were identified and mailed a written survey, asking whether their patients received adjuvant chemotherapy or radiotherapy based on their medical records. The survey cohort included 1,956 patients. Physician responses, or direct abstracts from medical records by registry staff, were obtained for 1,450 (74%) of these patients treated at 98 hospitals. In the combined data set, patient-level covariates include age, gender, cancer stage at diagnosis, race, marital status, hospital transfer (whether the patient was transferred between diagnosis and treatment), comorbidity scores, and the median income of the patient’s census block group. Hospital characteristics include hospital volume, presence of tumor registry accredited by the American College of Surgeons (ACOS) Commission on Cancer, teaching status, and location (urban versus non-urban). Imputation Model Fitting We imputed the adjuvant chemotherapy and radiotherapy data separately for each patient for uncollected true therapy status of the registry sample, creating 30 imputed data sets. The adequacy of imputation models was checked by comparing observed data to the predictions under the model fo the same cases. Indeed, the models generated predictions that were fairly similar to the observed data. Several variables were predictive of receiving treatments. Chemotherapy was received more often by younger, married, or stage III rectal cancer patients and less often by those with lower income, stage II rectal cancer, or more comorbidities. Patients transferred before surgery, typically to hospitals with more specialized facilities for cancer care, were also more likely to receive the treatment. Patients at ACOS hospitals were more likely to receive chemotherapy; whereas, those at teaching hospitals appeared to receive it less often. Patients in the region in which the survey was conducted were also more likely to receive the treatment, as were patients who were treated in 1996–1997 (as opposed to 1994–1995). The effects of patients’ age, marital status, comorbidity, hospital transfer, ACOS hospital status, and year of treatment on the receipt of radiotherapy were similar to those on the receipt of chemotherapy. Male patients were more likely to receive radiotherapy. Patients with stage II or III rectal cancer were much more likely to receive radiotherapy than those with stage III colon cancer. Factors predicting completeness of reporting are also of interest to inform efforts to validate or improve the quality CHANCE
57
of registry data. Chemotherapy for older or married patients was more often under-reported. High-volume hospitals reported chemotherapy more completely than others, as did urban hospitals. This might reflect greater investments in data management in these institutions. Radiotherapy also was more completely reported in the high-volume hospitals and for patients with stage II or III rectal cancer. There were fewer significant predictors for the reporting of radiotherapy than chemotherapy, suggesting a better and more consistent pattern of reporting in the former. The random effect estimates showed there existed moderate variation among hospitals in provision and reporting of chemotherapy, but less for radiotherapy, and there is little correlation between the random effects of treatment receiving and reporting processes. Registry Data Analyses Table 1 lists the estimated rates of adjuvant therapies using the survey, the uncorrected registry, and the multiply imputed registry data. Rates calculated from the imputed data are significantly larger than those from the registry alone (line 4 vs. line 3), reflecting the previous findings of under-reporting. They are also substantially lower than those from the survey alone (line 4 vs. line 1) because the rates of therapies were estimated to be lower in the part of the state outside the survey area. Adjuvant therapy variables can be important predictors of clinical outcomes. Using multiply imputed data sets, we fitted a logistic regression model for patients’ two-year survival with predictors, including the receipt of adjuvant chemotherapy and other covariates. Receiving chemotherapy was a strong prognostic factor of survival (odds ratio=1.26, standard error=0.09), consistent with results from clinical trial literature. Although variability has been introduced from different imputations, the multiple imputation analysis still increases the precision compared to analyzing the survey data alone because the survey sample is much smaller than the registry sample. For example, the standard error of the coefficient of receiving chemotherapy in the latter approach is about 70% larger than in the former.
A Promising Tool To correct the under-reporting of adjuvant therapies in a cancer registry, we multiply imputed the accurate treatment information using a model based on a validation sample. This illustrated how inferential tools using multiple imputation can be adapted to deal with a single inaccurately reported therapy variable. There are several extensions with substantive and statistical importance that could be pursued. A first extension pertains to incorporating the dependency and correlation between the two therapies in either the receipt or reporting process. Another extension would be to build the imputation model based on information from multiple sources, such as survey data, medical records, and claims data. This added information could improve predictions and statistical inference. In addition to the adjuvant therapy variables, patients’ birthplace and cancer stage also suffer from misreporting in the registry. Furthermore, misreporting often occurs for important quality indicators or indexes in other administrative systems such as claims databases. The multiple imputation strategy constitutes a promising tool to tackle this general problem in using public databases for health services research. 58
VOL. 21, NO. 3, 2008
Further Reading Ayanian, J.Z.; Zaslavsky, A.M.; Fuchs, C.S.; Guadagnoli, E.;
Creech, C.M.; Cress, R.D.; O’Connor, L.C.; West, D.W.; Allen, M.E.; Wolf, R.E.; and Wright, W.E. (2003) “Use of Adjuvant Chemotherapy and Radiation Therapy for Colorectal Cancer in a Population-Based Cohort.” Journal of Clinical Oncology, 21:1293–1300. Carroll, R.J.; Ruppert, D.; Stefanski, L.A.; and Crainiceau, C.M. (2006) Measurement Error in Nonlinear Models: A Modern Perspective (3rd ed.). CRC Press: New York, NY. Hewitt, M. and Simone, J.V. (1999) Ensuring Quality Cancer Care. National Academy Press: Washington, DC. Little, R.J.A. and Rubin, D.B. (2002) Statistical Analysis with Missing Data (2nd ed.). Wiley: New York, NY. Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. Wiley: New York, NY. Yucel, R.M. and Zaslavsky, A.M. (2005) “Imputation of Binary Treatment Variables With Measurement Error in Administrative Data.” Journal of the American Statistical Association, 100:1123–1132. Zheng, H.; Yucel, R.M.; Ayanian, J.Z.; and Zaslavsky, A.M. (2006) “Profiling Providers on Use of Adjuvant Chemotherapy by Combining Cancer Registry and Medical Record Data.” Medical Care, 44(1):7–10. “Here’s to your Health” prints columns about medical and healthrelated topics. Please contact Mark Glickman ((
[email protected]) if you are interested in submitting an article.
Visual Revelations
Howard Wainer,
Column Editor
Giving the Finger to Dating Services Grace Lee, Paul Velleman, and Howard Wainer
S
tatistics, the science of uncertainty, provides us with tools to help understand and thence navigate a complex world. Sometimes, when we hear perplexing ideas, statistics and a bit of empirical energy can help reduce confusion. This essay is about two such puzzles. The first, dealing with data usage by a computer dating service, was not completely unraveled, but at least one plausible hypothesis was dismissed as unworkable and thus increased the likelihood of a competing one. The second puzzle, which dealt with an apparent quirk in assortative mating among college students, was cleared up completely. The solution to both puzzles involved gathering a modest amount of data and making a scatterplot.
Why a Finger? A report on computer dating services in the local media discussed the odd bits and pieces of information requested, presumably, to aid in finding you a better match. However, the reporter was flummoxed by the request for the length of your index finger. What possible value could such information have? Surely a puzzle. Two plausible explanations suggested themselves to us. The first was as a proxy for height. Tall women often complain men they are matched with have lied about their height. One explained that many men who reported they were six feet tall or more shrank as CHANCE
59
Height (cm.)
121 students at the University of Pennsylvania
Index Finger Length (cm.)
Figure 1. The bivariate relationship between height and the length of index finger among a convenience sample of 121 students at the University of Pennsylvania in 2008
Height (cm.)
Heights and Finger Lengths for 68 Male and 53 Female Students at the the University of Pennsylvania
Index Finger Length (cm.)
Figure 2. The bivariate relationship between height and the length of index finger among a convenience sample of 68 male students and 53 female students at the University of Pennsylvania in 2008
60
VOL. 21, NO. 3, 2008
much as six inches as soon as they slid off their bar stool at the first face-to-face meeting. (Why someone would lie about something so easily discovered is yet another mystery, but one we will not examine in this account.) But, if finger length is an accurate predictor of height, the dating service could use it as a check for the accuracy of reported height and act accordingly. A second possibility is that it has no bearing on anything at all and is merely a McGuffin—meant to distract attention from other, more meaningful questions. Testing the first hypothesis seemed easy. We gathered heights and index finger lengths from 121 students at the University of Pennsylvania and found (see Figure 1) there is a modest correlation (0.57) between finger length and height. The mean height is 172 cm. with a standard deviation of 9.7 cm. But, if we know the length of the person’s index finger, we can predict their height with greater accuracy; the standard deviation will only be 8.0 cm. (3.15 inches). This means 98% of people will be within six inches of the prediction (plus or minus). This hardly seems likely to aid in correcting fraudulent heights. But wait, the sample of data contained both men and women. Perhaps the variation in height (and finger length) associated with sex is adversely affecting our results. In Figure 2, we divided the sample by sex and found a large portion of the bivariate relationship between height and finger length was due to the differences in the heights between men and women. So now, if we focus on just men (the ones most likely to exaggerate their height on dating service questionnaires), we find the unconditional standard deviation among men is 6.3 cm. By using finger length as a covariate with a correlation of .34, we find the standard deviation of the residual is 5.9 cm. (2.3 inches). Thus, knowing the length of a man’s index finger allows us to predict his height within 6 4.6 inches. Though better than the unconditional standard deviation, it is still not likely to reduce the surprise when he slips down from that bar stool. We must conclude that the inclusion of finger length on the dating service questionnaire is not likely to be of any help in predicting height (and, by extension, the length of any more distant body parts). This leaves us with
Ideal Partner's Height (in.)
A Note on Data Gathering
Height (in.)
Figure 3. A display of the height and sex of 147 Cornell students, along with their estimates of the ideal height of their future spouse/partner. Shown are the separate regression lines for each sex and the overall regression. The changing sign of the slope indicates the existence of Simpson’s Paradox.
only the second hypothesis—that it was included to distract our attention from the other questions. A lesson implicit in this analysis is how important it is to look more deeply at the data, beyond their marginal distributions. In this case, when we looked at the conditional distributions—conditional on sex—we found the modest relationship between finger length and height diminished still further. Note that conditional standard deviation was reduced, but this was primarily due to knowing the sex of the person, not the length of their index finger. Sometimes the effect of conditioning on another variable is more profound, as in our second example.
What’s the Opposite of Assortative Mating? We asked 147 Cornell University students to report their heights and the ideal height for their ideal spouse/partner. We found the estimated regression line was the following: Ideal partner’s height = 89.0 – 0.30 x (my height) Or, that the taller the respondent was, the shorter they felt their ideal
partner ought to be; the two variables were negatively correlated (r = -0.3). This result flew in the face of how we thought assortative mating ought to work. Anthropologists tell us humans tend to mate with others similar to themselves. Hence, we expected a positive correlation—tall people wanting a tall mate, short people wanting a short mate. Height might not turn out to be a key factor in mating, but we were surprised to find the relationship was negative. What is strange about Cornell undergraduates? When we plotted the data (see Figure 3), the mystery was resolved. The correlation was strongly positive within each sex (r = 0.6 among both women and men). It was the difference between the means of the two distributions that gave rise to the apparent negative relationship. Once again, Simpson’s Paradox shows itself, only being unmasked through a plot that identifies the sex of the respondents. Thus, conditioning on sex helped us understand what was going on in this situation even more dramatically than it did in the first example. Note the slopes of the two sex-specific regression lines are the same, boding
The first data set was gathered by a University of Pennsylvania undergraduate student in Statistics 112, Spring 2008. As part of a homework assignment, each student gathered 10 other students’ heights, finger lengths, and sex. Fingers were measured directly, but heights were often self-reported by the subjects. The second data set was gathered from 147 Cornell undergraduates who were students in an introductory statistics course. The data were gathered using SurveyMonkey. Response was voluntary, but yielded an 85% response rate. Both each individual’s height and the preferred height for their partner were self-reported.
well for the success of assortative mating on this characteristic.
Inference These were but two small examples of how naked empiricism and clear graphics combined with rudimentary statistics allow us to examine some of the claims (both implicit and explicit) that bombard us daily and help us navigate in this uncertain world. They also illustrate how difficult it is to draw inferences from intact groups; for, in both cases, had we not partitioned the data by sex, we would have reached incorrect conclusions. In the first case, we would have believed the relationship between finger length and height was stronger than it is. In the second, we would have gotten the direction of the relationship wrong. Without random assignment, the danger of a critically important missing third variable is always lurking in the background. Column Editor: Howard Wainer, Distinguished Research Scientist, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104; hwainer@ nbme.org CHANCE
61
Goodness of Wit Test
Jonathan Berkowitz, Column Editor
“If nothing ever changed, there’d be no butterflies.” — Author Unknown
How do you make a change in CHANCE? Simple, replace the fifth letter. Sorry about that. This issue does, indeed, mark a change in CHANCE. As you read in the previous issue, Thomas Jabine stepped down after a dozen years as puzzle editor. He produced nearly 50 excellent statistics-themed puzzles—cryptics, anagrams, spirals, and others. On behalf of all of us, I offer thanks to him for the hours of pleasure his creations have provided. I am delighted to have the opportunity to try to fill his shoes, or, more accurately, his grids. I actually have an early connection to CHANCE. I created a statistical acrostic that was published in 1989 (Vol. 2,
Goodness of Wit Test #1
Past Winners Tim Green graduated from The
Johns Hopkins University and earned his PhD in statistics at Stanford University. In 1989, he joined the Centers for Disease Control and Prevention in Atlanta, Georgia, where he is chief of the Quantitative Sciences and Data Management Branch in the Division of HIV/ AIDS Prevention. He rates cryptic puzzles second only to Hanjie (paint-by-number)—a pleasant way to waste time without the use of computers.
Kelly Marie Sullivan is a highschool teacher at Schutz American School in Alexandria, Egypt. She earned her BA in mathematics at California State University, Fresno, and her MS at the University of Nevada, Las Vegas. She also has taught at high schools in Spain, Brazil, and California. Some of Sullivan’s interests outside of mathematics and working with young people include family, travel, books, films, games, and puzzles. 62
VOL. 21, NO. 3, 2008
No. 2). You may have it if you were a charter subscriber and haven’t sold your collection on eBay. Over the years, I have constructed puzzles for various audiences. I am also an active member of the National Puzzlers’ League, which I shall tell you more about in a future issue. I look forward to entertaining and challenging CHANCE readers. Woodrow Wilson once said, “If you want to make enemies, try to change something.” My first change is to the title of this column; I hope it made you smile. Although the puzzle for this issue is a standard cryptic, I plan to introduce other puzzle types—including variety cryptics—over the coming issues. To encourage more reader participation, I will be introducing other innovations, as well, beginning with a clue-writing challenge.
My first puzzle is a standard cryptic. If you have solved cryptics before, you know each clue consists of two parts: a definition and a wordplay section. The challenge is in identifying which part of the clue is which. Wordplay involves anagrams, hidden words, charades, homophones, double definitions, and other devices. If you are a new solver, consult "How to Solve Cryptic Crosswords" on Page 64. For tips on solving cryptic crosswords, write to me or send a stamped return envelope to Cryptic Solving Guide, GAMES, P.O. Box 184, Fort Washington, PA 19034. A one-year (extension of your) subscription to CHANCE will be awarded for each of two correct solutions chosen at random from among those received by the column editor by November 1, 2008. As an added incentive, a picture and short biography of each winner will be published in a subsequent issue. Please mail your completed diagram to Jonathan Berkowitz, CHANCE Goodness of Wit Column Editor, 4160 Staulo Crescent, Vancouver BC Canada V6N 3S2, or send him a list of the answers by email at
[email protected]. Please note that winners to the puzzle contest in any of the three previous issues will not be eligible to win this issue’s contest.
Write Your Own Cryptic Clue Contest One way to better understand how cryptic clues work is to try to write one yourself. You are invited to create a cryptic clue for a statistical term of your choice and send it to me. As encouragement, I will print some of the entries, and, where possible, try to incorporate (with acknowledgment) the best entries into future puzzles.
Goodness of Wit Puzzle #1
ACROSS 1. Lines up to hear hints (6) 4. Lacking scent old roses misplaced (8) 10. Arm author and famous uncle at 99 (9) 11. More hazardous without Rhode Island slope expert? (5) 12. Measure first instances of technical instant message error (4)
DOWN 1. Value on either side of box edges away from rectangle and Scrabble piece (8) 2. Strange East Timor statistic (9) 3. Listen to medical examiner giving award (4) 5. Concludes selenium cut into pieces from the rear (7) 6. Again testing maples arranged in circle (10)
13. Ms. Anderson developed lack of purpose (10)
7. Innocuous Internet super achievers? (5)
15. Recent cold in New York period of dormancy (7)
8. Emphasize short hair (6)
16. Backbone has large piecewise function (6)
9. Perhaps stayed fixed (6)
19. Beethoven’s piece’s heartless sexy writing (6)
14. Mistakenly enter fluid not blocked by software (10)
21. Enlarge complex rank (7)
17. Benin or Mali tyrant giving up what analysts usually assume (9)
23. Potential natural in game code breaker (10, 2 words) 25. Makes correspondence returning unwanted mail (4) 27. Welcome from child always in verse (5) 28. Express to the audience what the infatuated conductor had for the percussionist? (9) 29. Program she clued poorly (8) 30. Dolly’s crazy insurer from London (6)
18. Groupings from temperature scale with incorrect results (8) 20. On-line service provider gets sore over suspension (7) 21. Unfinished gel bad omen for chromosome set (6) 22. Closed meeting in mountains lacking a trace of snow (6) 24. Male physician pens letter (5) 26. Head question (4) CHANCE
63
Past Solution This puzzle appeared in CHANCE, Vol. 21, No. 1, p. 64. The answers to the clues were: Clockwise: 1. TERESA [anagram: teaser] 7. RELIABLE [rebus: re + liable] 15. NAP [reversal: Pan] 18. MILLINER [rebus: mil + liner] 26. UNISON [container: uni(s)on] 32. ARP [anagram: par] 35. OSCINE [anagram: cosine] 41. [GUENEVERE [anagram: Rev. Eugene] 50. POOL [double definition: collect, type of cue] 54. BEDIM [rebus: be + dim] 59. [ORBIT: double definition: trajectory, environment] 64. TONE [rebus: to + N.E.] 68. MODAL [anagram: a mold] 73. LIT [double definition: illuminated, writings] 76. NOMAD [reversal: Damon] 81. EGG [container: b(egg)ar] 84. ANSWER [hidden word: RomANS WERe] 90. DECIMAL [rebus with container: de(C + I’m)al] 97. LEAP [anagram: plea] Counterclockwise: 100. PAELLA [rebus: pa + Ella] 94. MICE [rebus: M + ice] 90. DREW [double definition: Barrymore, took card] 86. SNAGGED [anagram: gangs Ed] 79. AMONTILLADO [anagram: Laotian mold] 68. MENOTTI [anagram of time with container: Me(not)ti] 61. BROMIDE [anagram: bid more] 54. BLOOPER [anagram: lob rope] 47. EVEN [rebus: event – t] 43. EUGENIC [anagram: cu + genie] 36. SOPRANOS [anagram: so parson] 28. INURE [container:
ERASE [hidden word: VERA’S Envelope] 3. RET [homophone: Rhett] 1 36 35 34 33 32 31 30 29 28
T S O P R A N O S I
2 37 64 63 62 61 60 59 58 27
E C T T I B R O M N
3 38 65 84 83 82 81 80 57 26
R I O A G G E D I U
4 39 66 85 96 95 94 79 56 25
E N N N L A M A D R
5 40 67 86 97
S E E S L
6 41 68 87 98
100 P
99
93
92
78 55 24
I M E E
77 54 23
A G M W E A C O B N
7 42 69 88 89 90 91 76 53 22
R U O E R D E N L I
8 43 70 71 72 73 74 75 52 21
E E D A L L I T O L
9 44 45 46 47 48 49 50 51 20
L N E V E R E P O L
10 11 12 13 14 15 16 17 18 19
I A B L E N A P M I
i(NU)re] 23. NILL [hidden word: BostoN I’LL] 19. IMPANEL [anagram: lean imp] 12. BAIL [hidden word: samBA I’Ll] 8.
How to Solve Cryptic Crosswords Clues consist of a definition and a wordplay section. The apparent meaning (or surface sense) of a clue is irrelevant. For the most part, you should ignore punctuation and capitalization; their function is to make the surface sense seem more reasonable and to confuse you! Words in clues are often used in misleading ways. Verbs can be disguised as nouns, or common words can be used in extremely obscure senses. Words in clues are also often replaced by well-known abbreviations in answers. Some words in the clues are indicators of the type of wordplay involved. Here are examples of all clue types: Anagrams – answer is a transposal of another word or phrase Example: Examiners confused streets. Answer: testers (anagram of “streets”) Indicators: mixed up, confused, moving, upset, bananas, in flux, otherwise Reversals – read a word or phrase backward Example: Clever trams run in reverse. Answer: smart (trams backward) Indicators: back, returns, in retrospect, leftward/west for across words and rising/northward for down words Hidden Words – answer word is concealed within a longer word or phrase, without rearrangement Example: Dame Diane reveals average. Answer: median (hidden in “da-me dian-e”) Indicators: revealing, displays, masks, carrying, in part, essentially, at heart
64
VOL. 21, NO. 3, 2008
Charades – answer is built up from parts given by the cluing Example: Monarch, after victory, giving a sly sign. Answer: winking (win+king) Indicators: before, after, with; indicators may be minimal or absent Containers – a word or fragment is inserted into another word to give the answer Example: Huge loss in fuel. Answer: colossal (“loss” inside “coal”) Indicators: swallows, surrounding, about (for containing words); within, inhabits, wearing, invades (for contained word) Homophones – answer has the same pronunciation as another word or phrase Example: Counted frozen chicken out loud. Answer: numbered (“numb bird”) Indicators: echoed, noisily, in recital, on tape, anything to do with speaking or listening
Deletions – one or more letters are removed Example: Fiery bird without a tail. Answer: flaming (drop last letter from “flamingo”) Indicators: dropping, flees, without, loses, after the debut (first letter deleted), unfinished (last letter deleted), without limits (first and last deleted) Double Definitions – instead of wordplay, two unrelated meanings of answer are clued Example: Scooter was blue. Answer: moped Indicators: generally absent; clues can be very short And Literally So (“&lit”) – the entire clue can be read as either definition or wordplay Example: Dog in wild! Answer: dingo (anagram of “dog in” and literally so) Indicators: clues end in an exclamation point