Biostatistics (2001), 2, 2, pp. 163–171 Printed in Great Britain
Large tables G. R. LAW∗ Leukaemia Research Fund Centre for Clinical Epidemiology, University of Leeds, Leeds, UK Email:
[email protected] D. R. COX Nuffield College, University of Oxford, Oxford, UK N. E. S. MACHONOCHIE London School of Hygiene and Tropical Medicine, London, UK J. SIMPSON, E. ROMAN Leukaemia Research Fund Centre for Clinical Epidemiology, University of Leeds, Leeds, UK L. M. CARPENTER Nuffield College, University of Oxford, Oxford, UK S UMMARY The traditional exploration of large contingency tables leads to multiple comparisons with the inherent generation of chance associations. To allow for this, a simple empirical Bayesian approach is used here to derive estimates of association ‘shrunk’ towards a global mean. Estimates are displayed on ordered normal plots, to allow visual detection of outliers, with the addition of ‘guide rails’, derived from simulation, to facilitate their detection. The methods, and the interpretation of results, are illustrated using a large table of occupations for cancer registrations in England and Wales for 1971–90. Keywords: Empirical Bayesian; Proportional registration ratio; Ordered normal plot; Occupational cancer.
1. I NTRODUCTION We consider a method for the exploratory analysis of large contingency tables, especially such as those which arise in biomedical datasets involving large numbers of events or exposure categories. Amongst the expanse of possible applications some specific examples might include studies of geographical distributions of disease or genetic polymorphisms and health outcome. As an illustration we consider a large routinely collected set of cancer registrations in England and Wales (Law et al., 1999; Simpson et al., 1999). The data consist of cancer diagnosed at one of 39 possible body sites with occupation at registration coded to one of 212 jobs. The primary data thus form a 39 × 212 contingency table with 8268 cells. Even with approaching 106 registrations some of the cells have quite small frequencies, with a mean of 108 registrations per cell. There are a number of features of such data that dictate the style of analysis appropriate. First, for about one-third of the registrations occupational information is missing, and while the pattern of missingness may be largely haphazard, the possibility of bias cannot be ruled out. Further, even when an occupation ∗ To whom correspondence should be addressed.
c Oxford University Press (2001)
164
G. R. L AW ET AL.
is recorded it may not be the main lifetime occupation of the individual. Also, only one cancer site is given for each individual. Finally, there are no reliable denominators available, that is the number of person-years at risk in the different occupations is not known. We take these limitations to imply that the method of analysis should be simple and easily explained and the results of any analysis should be regarded as suggestions for further investigation. That is, they generate hypotheses rather than firmly establish conclusions. For that reason the calculation of statistical significance of individual effects is not an objective, although some assessment of the effects of random errors of estimation is, of course, desirable. In a previous paper (Carpenter et al., 1997) we outlined, with illustrations, an empirical Bayes approach to these issues. The object of this paper is to expand on the detailed interpretation of the method, to discuss aspects of applying the method in more depth and to extend the visual detection of associations. There is a long history of the use of routine data for the study of work-related disease, in particular reports published in the UK as Decennial Supplements (Drever, 1995). Because of the absence of denominators our method is an adaptation of the proportional registration ratio (PRR). In this, the number of registrations in each cell is compared with the number fitted under simple proportionality. A high observed-to-fitted ratio does not necessarily mean that the cancer rate is high, but rather that the proportion at that site is relatively high compared with that at other sites. 2. E MPIRICAL BAYES METHODS The notion of applying empirical Bayes methods to contingency tables of occupational mortality has a long history (Laird, 1978), for a recent application to adverse reactions (DuMouchel, 1999). These authors considered quite small tables and used fairly elaborate methods based on the formally efficient fitting of an assumed parametric model. We use simpler methods which assume less and may be thought more intuitive. The general idea of the empirical Bayes approach is that there are two levels of statistical variation involved. One is the Poisson variation of the observed frequencies around notional means determined by a ‘true’ rate for each occupation–site combination. The other is the frequency distribution of these underlying true rates corresponding to the 8268 cells. The general idea is as follows. Let Ois denote the number of individuals in occupation i reported with cancer at site s. We write Oi• and O•s for the occupation and site marginal totals. The fitted number of cases, E is , is derived under an assumption of an overall model of proportionality as E is =
Oi• O•s . O••
(2.1)
Then the log of the unadjusted PRR is Ris = log{(Ois + 12 )/(E is + 12 )}.
(2.2)
Except for the 12 terms which are inserted to avoid problems with the very occasional zero cell these are exactly the log PRRs as conventionally defined. It is entirely appropriate to derive the fitted number of events using denominators, and additionally with some form of adjustment for confounding factors. The estimation error connected with such a value is specified by the sampling variance of Ris arising from the Poisson distribution. This is to a close approximation νis = 1/(Ois + 12 ).
(2.3)
Note that values obtained from very small counts have high variance. The general idea is that of smoothing those Ris based on small counts much more heavily than those based on substantial counts. For this we suppose that for each occupation i there are underlying ‘true’ R’s
Large tables
165
having a distribution with mean zero and variance σi2 . We shall estimate this by σ˜ i2 , using a method to be explained later. There are now a number of possibilities. First, it may happen that the estimate of σ˜ i2 is negative, i.e. essentially zero. This implies that for occupation i the variation across sites is consistent with purely random departures from the marginal distribution across all occupations. That is, there is no basis from the data for seeing any of the cancer sites as anomalous as compared with the marginal distribution across all occupations. Next, the estimate σ˜ i2 may be positive. We then construct smoothed or shrunk estimates which are a weighted average of zero and the unadjusted estimate Ris , shrinking the relatively imprecise cell estimates more than the relatively precise ones with the formula ∗ Ris =
σ˜ i2 Ris /υis = R . is 1/σ˜ i2 + 1/υis σ˜ i2 + υis
(2.4)
∗ is an outlier, i.e. if the frequency distribution of the R ∗ is reasonably smooth, we If none of the Ris is conclude that there is a general departure of the pattern across sites from that obtained in the marginal distribution. We may call the standard deviation σ˜ i an index of sensitivity of the occupation in question. Its meaning is discussed in more detail below. The third possibility is that there may be a small number of outliers either corresponding to atypically ∗ . These correspond to isolated sites which it may be of special interest high or atypically low values of Ris to study further. To distinguish between a smooth curve and a plot with anomalous points, and to aid ∗ against the detection of these associations, a visual approach is very helpful. We plot the ordered Ris expected order statistics for a random sample from a standard normal distribution. If the values plotted were a random sample from a normal distribution a straight line plot is to be expected. The present situation is more complicated, for example in that the points plotted are not of equal precision, but more importantly there is no special interest in normality of distribution as such. The plotting device is largely a tool for ensuring that if the frequency distribution is well behaved a smooth curve will be produced.
3. I NTERPRETATION OF KEY QUANTITIES ∗ . As an Central to our discussion are the characterization of a site by σ˜ i and by the set of the Ris illustration of the meaning of σ˜ i suppose that it takes the value 0.1. Then, assuming very approximate normality, about two-thirds of the underlying ratios are estimated to be between −0.1 and 0.1 and only about one in twenty to be larger than 0.2 in absolute value. That is, most of the proportions are within 10% of those specified by the marginal distribution across all occupations. ∗ is in a sense the best estimate on a log scale of how much the number of registrations Similarly, Ris in site s at occupation i deviates from that to be expected were the proportion the same as in the marginal ∗ ) − 1 gives the proportional deviation. It is formed by downweighting those distribution. That is, exp(Ris observed ratios that are based on small frequencies and hence of low precision.
4. S OME DETAILS ABOUT ESTIMATION There are two broad approaches to estimation in the context we are considering. One is to complete the mathematical specification by imposing a parametric assumption on the underlying distribution of the log PRRs, for example that it is of log normal or log gamma form. Formally efficient, although numerically quite complicated, methods of estimation are then possible. An extended version of the model adds an unknown probability that each cell is an outlier from the main distribution and a probability distribution for the magnitude of an outlier.
166
G. R. L AW ET AL.
We have not followed such an approach for a number of reasons. Very specific and somewhat arbitrary assumptions of distributional shape are required. Formidable computation is involved. More importantly, though, we consider that for the type of data with which we are concerned a simpler, more transparent and informal analysis is much more suitable. An alternative approach estimates σ˜ i2 from the mean square of Ris minus the average of νis within each stratum i. All cells within the stratum are used for the estimation of σ˜ i2 , and this leads us to refer ∗ with their standard errors υ , can be to this value as a ‘total index of sensitivity’. Following this, Ris is calculated, and then plotted as explained above. The definition of an outlier is not clear-cut, or more explicitly can be made objective only by making strong and unrealistic assumptions of the distributional form about the underlying PRRs. In practice, the precise definition is not critical in that we recommend in each case looking at the three or four sites with ∗ . Their interpretation will depend somewhat on whether some or all of the values highest and lowest Ris fall naturally on the smooth curve formed by extrapolation from the central part of the distribution. While there is no sense in which we want to report a site as having a statistically significant outlier, it is helpful to put on the plot ‘guide rails’ indicating limits within which the plotted points would be expected to fall, were the plotted points a sample from a normal distribution. These guide rails have been found by a mixture of simulation and simple theory. There are two reasons why they should not be overinterpreted. One is that they depend on a provisional assumption of normality which we do not wish to make. Secondly, they assume that the plotted points are of equal precision. A value above a guide rail but based on a very small observed frequency should be given less credence than one based on a large count. While it would be possible to amend the procedure to take account of this we have, for reasons of simplicity, chosen not to do so. The general effect of varying precision in the plotted points is that even if the underlying distribution were normal there would be a sigmoid shape in the plots. 4.1
A simulation approach
Using similar methods to Olgu´ın and Fearn (1997), simulations provide a series of limits, which can be plotted and we refer to as ‘guide rails’. We randomly generate Yi , for all i = 1 to n, with a standard normal distribution, rank the values Yi and plot against normal order statistics. We estimate the slope, σˆ i , from the central part of the distribution of Yi , by fitting a straight line using least squares to the central 80% of values and taking σˆ i to be the slope. This part of the analysis is best done purely numerically. Note that it is unaffected by a small number of outliers and therefore does not have to be recalculated should some outliers be detected. We refer to this as a ‘general index of sensitivity’. The general index of sensitivity gives the number of standard deviations Yi lies from zero for the most extreme point (i = n). This is defined as wn =
Yn . σˆ
(4.1)
Repeat this a suitable number of times, for example 100 000. We then derive wαn , the value of wi , for i = n, which bounds the highest 100α% of the cumulative simulated data (Table 1). To simulate the next points of the guide-line, repeat the above step but only simulate (n − 1) points and choose point n − 1. The simulations continue until the central 80% of the distribution are reached, at which point the simulations terminate. 4.2 A simple theoretical method Although we recommend simulation for finding the guidelines, it may sometimes not be feasible to run special simulations for every new size of table encountered. A rather crude approximation to the guidelines for the most extreme observation can be obtained as follows. For n independent points from the
Large tables
167
Table 1. Critical values, wαn , for the greatest five estimates, for various n α
Point
n 20
39
40
50
70
100
150
200
212
300
Calculated1 0.05 Simulation2
n
3.02
3.22
3.22
3.29
3.38
3.48
3.58
3.66
4.07
4.15
0.05
n n−1 n−2 n−3 n−4 n n−1 n−2 n−3 n−4
3.36 2.97
3.47 3.25 3.10 2.89
3.47 3.28 3.09 2.87
4.14 3.67
4.00 3.81 3.59 3.33
4.00 3.85 3.67 3.33
3.48 3.36 3.25 3.05 2.87 4.00 3.94 3.82 3.53 3.35
3.55 3.44 3.35 3.26 3.18 4.09 3.96 3.84 3.68 3.64
3.60 3.52 3.51 3.41 3.34 4.13 3.94 3.94 3.88 3.80
3.69 3.64 3.58 3.56 3.55 4.12 4.10 3.97 3.96 3.96
3.71 3.71 3.66 3.63 3.61 4.12 4.12 4.07 4.11 4.05
3.73 3.73 3.70 3.64 3.66 4.13 4.13 4.11 4.04 3.99
3.82 3.80 3.76 3.77 3.73 4.22 4.20 4.14 4.20 4.12
0.01
1 From Section 4.2 2 From Section 4.1
normal distribution of zero mean Wn , the largest value is such that pr(Wn w) = {(w)}n .
(4.2)
It follows that wαn , the value of Wn exceeded with probability α is such that 1 − α = {(wαn )}n , wαn =
−1
{(1 − α)
1/n
(4.3) }.
(4.4)
This suggests putting a guide-line at wαn σ˜ , ignoring errors of estimating σ . Table 1 compares true values with the simulated values. The values agree to within about 10%. 5. I LLUSTRATIVE EXAMPLES Cancer registration data, collated for England and Wales by the Office for National Statistics from information supplied by regional cancer registries (Law et al., 1999), is used here for illustrative purposes. This large data set comprised over three million anonymized individual registrations diagnosed between 1971 and 1990, each one of which was coded to one of 39 sites and, for those who were employed at the time of diagnosis, one of 212 possible occupations. The following examples are based on the 900 000 men whose occupation and other essential information was fully coded at the time of cancer registration (Law et al., 1999). For each combination of cancer site and occupation, fitted (expected) number of cases were calculated using a proportional model with no adjustment for potential confounders. The exact process by which a large table would be analysed is, in part, dependent upon the characteristics of the data, such as previously suspected associations. The approach taken for these data may not be appropriate for other applications and is not meant as a prescription. Illustration of the method is given by both examination of individual sites and individual occupations of interest. The total index of sensitivity, σ˜ i , was calculated separately for each of the 39 cancer sites and 212 occupations. The index for the site ‘Thymus and mediastinum’ was less than zero, indicating that occupation appears to have had
168
G. R. L AW ET AL. α = 0.01 α = 0.05
.6
R*is
0
−6 − 2.76
2.76 ordered normal deviate
Fig. 1. Ordered normal plot of the log of the empirical Bayesian observed–expected ratio for ‘cancer of the colon’ in ∗ ), with guide-rails (set at α = 0.05 and α = 0.01). males (Ris
no effect on the incidence of this particular cancer. For all other sites, however, variation in excess of a Poisson distribution was observed: σ˜ i ranging from 0.03 to 0.68, with, for example, 36 of 39 sites (92%) in the range 0.0–0.6. The distribution of cancer within 19 of the 212 occupations did not vary in excess of a Poisson distribution; these included ‘Foresters and woodmen’ and ‘Personnel managers’. The mean value for the occupations was 0.33, with a maximum of 0.73 (‘Spinners, doublers and twisters’). In comparison with the site-specific estimates, there were 189 of 212 (89%) occupations with an total index of sensitivity between 0.0 and 0.6. ∗ ) of sites with low and high σ Examples showing plots of empirical Bayesian ratio estimates (Ris ˜ i are shown in Figures 1 and 2 respectively. The diagonal solid line on the plots indicates a line with a slope of 1, to allow a visual comparison of σ˜ i between plots, with guide rails at α = 0.05 and α = 0.01 from simulation. Figure 1 offers little support for the suggestion that colon cancer has an occupational aetiology, the points forming a smooth curve, with no apparent anomalous occupations, either high or low, and all points remain substantially within the guide rails. By contrast, the well-known relation between occupational exposure to asbestos and pleural cancer (Hutchings et al., 1995) (Figure 2) is supported by the relatively high value of σ˜ i (0.62), considerably steeper than that for colon cancer (0.03). Further, for pleural cancer the highest empirical Bayesian estimate for the observed-to-expected ratio of 6.20 for ‘Metal plate workers and riveters’, lies well above the guide rails and calculated limit. This may not have been so apparent from the plot without the assistance of the guide-lines. The suggested association between pleural cancer and ‘Metal plate workers and riveters’, highlighted in Figure 2, is supported by inspection of the corresponding plot for this occupation (Figure 3). The total index of sensitivity for ‘Metal plate workers and riveters’ was 0.32, and the clear anomalous point at the top of the graph relates to the pleural cancer estimate. The empirical Bayesian estimate for the observed-to-expected ratio was 6.17, in this particular case, similar to the site-specific estimate.
Large tables
169
2
R*is
0
−2 − 2.76
2.76 ordered normal deviate
∗ ), with Fig. 2. Ordered normal plot of the log of the observed expected ratio for ‘pleural cancer’ in males (Ris guide-rails (set at α = 0.05 and α = 0.01).
1.85
R*is
0
− 1.85 − 2.15
2.15 ordered normal deviate
Fig. 3. Ordered normal plot of the log of the observed–expected ratio for ‘Metal plate workers and riveters’ in males ∗ ), with guide-rails (set at α = 0.05 and α = 0.01). (Ris
170
G. R. L AW ET AL. 1.2
R*is
0
− 1.2 − 2.15
2.15 ordered normal deviate
Fig. 4. Ordered normal plot of the log of the observed–expected ratio for ‘Plumbers, fitters and heating engineers not ∗ ), with guide-rails (set at α = 0.05 and α = 0.01). elsewhere classified’ in males (Ris
The importance of examining all the disease and exposure plots side by side is further illustrated in Figure 4, which shows the occupational plot for ‘Plumbers, fitters and heating engineers not elsewhere classified’. The anomalous point at the top of the plot is for pleural cancer. The observed to expected ratio for this job group was second highest on the pleural cancer line, but did not exceed the guide-lines. The anomalous estimate on the occupational plot (Figure 4), taken together with the magnitude of σ˜ i for pleural cancer (Figure 2), is clearly indicative of a strong occupational effect. 6. C ONCLUSION The examination of large two-way tables, such as those arising in biomedical studies, is complicated by the number of potential associations with abnormally (statistically significant) high or low associations occurring simply by chance. Our approach has been to explore the data holistically, rather than to formally test specific hypotheses, with the aim of reducing the number of spurious associations. The illustrative examples were taken from a large table of data which, for reasons detailed earlier, does not warrant over complicated methods of exploration. These tools give the analyst a certain amount of freedom to draw conclusions from data. The methods have wide applications in the field of epidemiology and genetics and potentially in other fields of science. R EFERENCES C ARPENTER , L. M., M ACONOCHIE , N. E. S., ROMAN , E. AND C OX , D. R. (1997). Examining associations between occupation and health by using routinely collected data. Journal of the Royal Statistical Society, Series A 160, 507–521.
Large tables
171
D REVER , F. (1995). Occupational Health. Decennial Supplement for England and Wales, (Series DS no. 10). London: HMSO. D U M OUCHEL , W. (1999). Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. American Statistician 53, 177–190. H UTCHINGS , S., J ONES , J. AND H ODGSON , J. (1995). Asbestos related diseases. Occupational Health. Decennial Supplement for England and Wales, (Series DS no. 10). London: HMSO, pp. 127–148. L AIRD , N. M. (1978). Non-parametric maximum-likelihood estimation of a mixing distribution. Journal of the American Statistical Association 73, 805–811. L AW , G. R., ROMAN , E. AND S IMPSON , J. (1999). Occupational cancer: the role of routine cancer registration. Health Statistics Quarterly 1, 16–20. O LGU´I N , J. AND F EARN , T. (1997). A new look at half-normal plots for assessing the significance of contrasts for unreplicated factorials. Applied Statistics 46, 449–462. S IMPSON , J., ROMAN , E., L AW , G. AND PANNETT , B. (1999). Womens occupation and cancer: preliminary analysis of cancer registrations in England and Wales, 1971–90. American Journal of Industrial Medicine 36, 172–185. [Received April 3, 2000; revised July 31, 2000; accepted for publication August 1, 2000]