F iTitie summarize — Summary statistics
Syntax summarize [var/rsf] |WigAf] [if exp\ [in rakge] [, [ detail | meanonly j format ] by ... : may be used with suamarize; see [R] by. aweights and freights are allowed. The varlisi following summarize may contain time-series diperators; see [U] 14.4.3 Time-series varlists,
Description •i
summarize calculates and displays a variety of univariate summary statistics. If no varlist is specified, then summary statistics are calculated for all the variables in the dataset. Also see [R] ci for calculating the standard error and confidence intervals of the mean.
Options detail produces additional statistics, including Newness, kurtosis. the four smallest and largest values, and various percentiles. meanonly, which is allowed only when detail Is not specified, suppresses the display of results and calculation of the variance. Ado-file writers will find this useful for fast calls. format requests that the summary statistics be displayed using the display formats associated with the variables, rather than the default g display' format; see [U] 15.5 Formats: contlroHing how data are displayed.
iRemarks summarize can produce two different sets of j summary statistics. Without the detail option, the number of nonmissing observations, the meajji and standard deviation, and the minimum and maximum values are presented. With detail, trie same information is presented along with the variance, skewness, and kurtosis; the four smallest and four largest values; and the 1st. 5th. 10th, 25th. 50th (median), 75m, 90th. 95th, and 99th pefcentiles. 'i
''> Example
j
You have data containing information on variois automobiles, among which is the variable mpg, the mileage rating. We can obtain a quick summaly of the mpg variable by typing . summarize mpg
I
Variable
Obs
Hean
mpg
74
21.2973
Std. Dev. 5.78^503
1
Min
Max
12
41
summarize — Summary statistics
We see that we have 74 observations. The mean of mpg is 21.3 miles per gallon, and the standard deviation is 5.79. The minimum is 12 and the maximum is 41. If we had not specified the variable (or variables) we wanted to summarize, we would have obtained summary statistics on all the variables in the dataset: . summarize Variable
Obs
Mean
make price mpg rep78 weight foreign
0 74 74 69 74 74
6165.257 21.2973 3.405797 3019 . 459 .2972973
Std. D«v.
Min
Max
2949 . 496 5.785503 ,9899323 777.1936 .4601885
3291 12 1 1760 0
15906 41
5 4840 1
Notice that there are only 69 observations on rep78, so some of the observations are missing. There are no observations on make since it is a string variable. 0
t> Example The detail option provides all the information of a normal summarize and more. The format of the output also differs: . summarize mpg, detail Mileage (mpg)
IX 5X 107. 25%
soy. 75%
soy. 95% 99'/.
Percentiles 12 14 14 18 20 25 29 34 41
Smallest 12 12 14 14
Obs Sum of Wgt.
Std. Dev.
74 74 21.2973 5.785S03
Variance Skewness Kurtosis
33.47205 .9487176 3.975005
Mean Largest 34 35 35 41
As in the previous example, we see that the mean of mpg is 21.3 miles per gallon and that the standard deviation is 5.79. We also see the various percentiles. The median of mpg (the 50th percentile) is 20 miles per gallon. The 25th percentile is 18 and the 75th percentile is 25. When we performed summarize, we learned that the minimum and maximum were 12 and 41, respectively. We now see that the four smallest values in our dataset are 12, 12, 14, and 14. The four largest values are 34, 35, 35, and 41. The skewness of the distribution is 0.95, and the kurtosis is 3.98. (A normal distribution would have a skewness of 0 and a kurtosis of 3.) (Skewness is a measure of the lack of symmetry of a distribution. If the coefficient of skewness is 0, the distribution is symmetric. If the coefficient is negative, the median is usually greater than the mean and the distribution is said to be skewed left. If the coefficient is positive, the median is usually less than the mean and the distribution is said to be skewed right. Kurtosis (from the Greek kyrtosis meaning curvature) is a measure of peakedness of a distribution. The smaller the coefficient of kurtosis, the flatter the distribution. The normal distribution has a coefficient of kurtosis of 3 and provides a convenient benchmark.) (On a historical note, see Plackett (1958) for a history of the concept of the mean.)
summarize — Summery statistics
ij» Example summarize can usefully be combined with the b| varlist: prefix. In our dataset we have a variable foreign that distinguishes foreign and domestic cars. We can obtain summaries of mpg and weight within each subgroup by typing . by foreign: summarise mpg «eight -> foreign = !>omestic Obs Variable
Mean
5? 52
19.82692 3317.115
mpg weight
Std. f)ev.
Min
Max
4.743JZ97 695.3J537
12 1800
34 4840
Std. tev.
Kin
Max
6.611J187 433.0b35
14 1760
41 3420
-> foreign = I"oreign Variable
Obs
Mean
opg weight
22 22
24,77273 2315.909
Domestic cars in our dataset average 19.8 miles pdr gallon, whereas foreign cars average 24.8. Since by varlist: can be combined with summarize, it can also be combined with s-ummarize, detail: . by foreign: summarize mpg, detail
-> foreign = Domestic iy. 57, 107, 257,
soy. 757. 907, 997.
Per cent iles 12
14 14 16.5 19 22 26 29 34
Mileage (mpg) Smallest 12 12 14 14
Largest 28 29 30 34
i
Obs! Sum of Wgt . Mea4 StdJ Dev.
52 52 19.82692 4.743297
Variance Skewness KuHfosis
22.49887 .7712432 3.441459
Obs Sum of Wgt. Meat Stdj Dev.
22 22 24 . 77273 6.611187
Variance Skelmess Kurtosis
43.70779 .657329 3.10734
-> foreign = Foreign Mileage (m;
17. 57. lO'/i 257, 507,
757. 907. 957. 997.
Per cent iles 14 17 17 21 24.5
28 35 35 41
Smallest 14 17 17 18 Largest 31 36 35 41
summarize — Summary statistics
Q Technical Note summarize respects display formats if you specify the format option. When we type summarize price weight, we obtain . summarize price weight Variable Obs price weight
74 74
Mean 6165.257 3019,459
Std. Dev.
Min
Max
2949.496 777.1936
3291 1760
15906 4840
The display is accurate but is not as aesthetically pleasing as you may wish, particularly if you plan to use the output directly in published work. By placing formats on the variables, you can control how the table appears: . format price weight 7,9.2fc . summarize price weight, format Variable Obs Mean price weight
74 74
6,165,26 3,019.46
Std. Dev. 2,949.50 777.19
Min
Max
3,291.00 15,906.00 1,760.00 4,840.00
Q If you specify a weight (see [U] 14.1.6 weight), each observation is multiplied by the value of the weighting expression before the summary statistics are calculated, so that the weighting expression is interpreted as the discrete density of each observation.
t> Example You have 1980 Census data on each of the 50 states. Included in your variables is medage, the median age of the population of each state. If you type summarize medage, you obtain unweighted statistics: . summarize medage Variable Obs medage
50
Mean
Std. Dev.
29.54
1.693445
Min
Max
24.2
34.7
Also among your variables is pop. the population in each state. Typing summarize medage [w=pop] produces population-weighted statistics: . summarize inedage [w=pop] (analytic weights assumed) Variable Obs Weight medage
50
225907472
Mean 30 11047
Std. Dev.
Min
Max
1 66933
24 2
34.7
The number listed under Weight is the sum of the weighting variable, pop. It indicates that there are roughly 226 million people in the U.S. The pop-weighted mean of medage is 30.11 (as compared with 29.54 for the unweighted statistic), and the weighted standard deviation is 1.67 (as compared with 1.69).
summarize — Summary statistics
0 Example You can obtain detailed summaries of weighted data as well. When you do this, all the statistics are weighted, including the percehtiles. ! . summarize medage [w=pop], detail (analytic weights assumed) Median age
I'/. 107, 25'/.
soy.
Percentilee 27.1 27.7 28.2 29.2 29,9
75V. 90'/, 95'/. 997.
30.9 32.1 32.2 34.7
Smallest 24.2 26.1 27.1 27.4 Largest 32 32.1 32.2 34.7
Dbs Si|im of Wgt. Mean SJbd. Dev. Variance Skewness Kiirtosis
50 225907472 30.11047 1.66933 2.786661 .5281972 4.494223
Q Technical Note You are writing a program and need to access the mean of a variable. The raeanonly option provides for fast calls. For example, suppose y4ur program reads as follows: program define mean summarize '1', meanonly display " mean = " r(mean)
! ;
end
The result of executing this is , mean price mean = 616S.2568
i
Saved Results summarize saves in r(): Scalars r(N) r(mean) r(sketmess) r(min) r(max) r(sum_w) rCpl) r(p5) r(plO) r(p25)
number of observations : mean ! skewness (detail only) minimum I maximum : sum of the weights ; 1st percentile (detail only)] 5th percentiie (detail only)| !0th percentile (detail onhl) 250) percentile (detail onl)l)
r(pSO) r(p75) r(p90) r(p95) r(p99) r(Var) r(kurtosis) r(sum) r(sd)
50th pereentile (detail 75th percentile (detail 90th percentile (detail 95th percentile (detail 99th percentile (detail variance kurtosis (detail only) sum of variable standard deviation
only) only) only) oniy) only)
summarize — Summary statistics
Methods and Formulas Let x denote the variable on which we want to calculate summary statistics, and let x», i = 1 , . . . , n, denote an individual observation on x. Let u; be the weight, and if no weight is specified, define Vi — \ for all i. Define V as the sum of the weight
ta=l
Define Wi to be v^ normalized to sum to n, Wi = u,-(n/V).. The mean, x, is defined as
n T ™ —
r , n^—' i=l
'
-
The variance, s2, is defined as «2 _
•*•
MT^,.. /~
^\2
The standard deviation, s, is defined as vs*. Define mr as the rth moment about the mean x:
i mr = — n
n
— 3/2
The coefficient of skewness is then defined as m^m^
. The coefficient of kurtosis is defined as
-2
Let £{j) refer to the or in ascending order, and let w^ refer to the corresponding weights of x^. The four smallest values are x^j, X( 2 ), £(3), and £(4). The four largest values are £(n), £( n _i), £( n _2), and £( n _ 3 )To obtain the pth percentile, which we will denote as a;^, let P — rip/100^ Let
Find the first index i such that M^^j > P. The pth percentile is then
otherwise
References Gleason, J. R. 1997. sg67: Univariate summaries with boxplots. Stata Technical Bulletin 36: 23-25. Reprinted in Siafa Technical Bulletin Reprims, vol. 6, pp. 179-183.
summarize — summary statistics . 1999. sg67.1: Update to univar. Stata Technical1 Bulletin 51: 27-28. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 159-161. j Hamilton. L. C. 1996. Data Analysis for Social Scientists. Pacific Grove. CA: Brooks/Cole Publishing Company. Placketi, R. L. 1958. The principle of the arithmetic mejkn. Biometrika 45: 130-135. Stuart, A. and J. K. Ord. 1994. Kendall's Advanced Thfory of Statistics, Vol. J. 6th ed. London: Edward Arnold, Weisberg. H. F. 1992. Cental Tendency and Variability. • Newbury Park, CA: Sage Publications.
Also See Related:
[R] centiie, [JR] cf, [R] ci, [R] code||>ook, [R] compare, [R] describe, [R] egen, [R] inspect, [R] Iv, [R] means, [R] jpctile, [R] st stsum, [R] svymean, [R] table, [R] tabstat, [R] tabsum, [R] xtsuni
Title sureg — Zellner's seemingly unrelated regression
Syntax Basic syntax
sureg (depvar\ varlisti ) (depvar? varlist-z ) . . . (depvar^ varlistpj ) [we/g/tf]
[if exp]
[in range]
Full syntax sureg ( [eqnamei :]depvaria [depvaru, . . , = ]varlisti [, noconstant ] ) ( \eqname2 '-\depvaria [depvar^b • • • - ] varlist^ [, noconstant ] )
( [eqname w : ] depvar^a [depvar^t, . . . = ] varlist^ [, noconstant ] ) [we?"g/if] [if exp] [in range] [, corr constraint s(numlist') i.sure dfk dfk2 small noheader notable level (#) maximize -options 1 by . . . : may be used with sureg; see [R] by. aweights and fweights are allowed; see [U] 14.1.6 weight, The depvan and the varlisK may contain time-series operators; see [U] 14.4.3 Time-series varlists. sureg shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. Explicit equation naming (eqname:) cannot be combined with multiple dependent variables in an equation specification.
Syntax for predict predict [type] newvarname [if exp] [in range] [, equationCegnt? [,£qno]~) xb stdp difference stddp residuals ] These statistics are available both in and out of sample; type predict . . . if e (sample) . . . if wanted only for the estimation sample.
Description sureg estimates seemingly unrelated regression models (Zellner 1962, Zellner and Huang 1962, Zellner 1963). The acronyms SURE and SUR are often used for the estimator.
sureg -4 Zellner's seemingly unrelated regression
Options ;j
noconstant omits the constant term (intercept) from the equation on which the option is specified, corr displays the correlation matrix of the residiials between equations and performs a Breusch-Pagan test for independent equations; i.e., the disturbance covariance matrix is diagonal. constraints (numUst} specifies by number t|e linear constraint(s) to be applied to the system. By default, sureg estimates an unconstrained system. See [R] reg3 for an example using constraints with a system estimator isure specifies that sureg should iterate over the estimated disturbance covariance matrix and parameter estimates until the parameter estimates converge. Under seemingly unrelated regression, this iteration converges to the maximum livelihood results. If this option is not specified, sureg produces two-step estimates. dfk specifies the use of an alternate divisor in computing the covariance matrix for the equation residuals. As an asymptotically justified estimator, sureg by default uses the number of sample observations (n) as a divisor. When the dflj option is set, a small-sample adjustment is made and the divisor is taken to be -^/(n — fcj){n - Mj), where k{ and k^ are the numbers of parameters in equations i and j respectively. dfk2 specifies the use of an alternate divisor ;in computing the covariance matrix for the equation residuals. When the df k2 option is set, the divisor is taken to be the mean of the residual degrees of freedom from the individual equations, this was the default divisor for sureg before version 6.0. ! small specifies that small sample statistics ^re to be computed. It shifts the test statistics from chi-squared and Z statistics to F statistics;and t statistics. While the standard errors from each equation are computed using the degrees or freedom for the equation, the degrees of freedom for the t statistics are all taken to be those for the first equation. Before version 6.0. sureg reported small-sample statistics. noheader suppresses display of the table reporting F statistics, j?-squared, and root mean square error above the coefficient table. i notable suppresses display of the coefficient table. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. maximize-options control the maximization process; see [R] maximize. You should never have to specify them. \
Options for predict ecpi
the equations by their names, equation (income) would refer to the equation named income jand equation (hours) to the equation named hours. If you do not specify equ&tionQ, the res fits are as if you specified equation(#1). difference and stddp refer to between-eqi lation concepts. To use these options, you must specify two equations; e.g., equation(tl,#2) or equation (income, hours). When two equations must be specified, equation() is not optional. \
1o
sureg — Zellner's seemingly unrelated regression
xb, the default, calculates the fitted values— the prediction of Xjb for the specified equation. stdp calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is also referred to as the standard error of the fitted value. difference calculates the difference between the linear predictions of two equations in the system. With equation(#i ,#2) , difference computes the prediction of equation(#l) minus the prediction of equation(#2). stddp is allowed only after you have previously estimated a multiple-equation model. The standard error of the difference in linear predictions (xi^b — X2jb) between equations 1 and 2 is calculated. residuals calculates the residuals. For more information on using predict after multiple-equation estimation commands, see [R] predict.
Remarks Seemingly unrelated regression models are so called because they appear to be joint estimates of several regression models, each with its own error term. The regressions are related because the (contemporaneous) errors associated with the dependent variables may be correlated.
l> Example When you estimate models with the same set of right-hand-side variables, the seemingly unrelated regression results (in terms of coefficients and standard errors) are the same as estimating the models separately (using, say, regress). The same is true when the models are nested. Even in such cases, sureg is useful when you want to perform joint tests. For instance, let us assume that you think price — /?o + J3\f oreign + /32length + HI weight = 70 + 7if oreign + 72length + u-2 Since the models have the same set of explanatory variables, you could estimate the two equations separately. Yet, you might still choose to estimate them with sureg because you want to perform the joint test 3\ = 71 = 0. We use the small and df k options to obtain small-sample statistics comparable to regress or mvreg. . sureg (price foreign length) (weight foreign length), small dfk Seemingly unrelated regression Equation price weight
Obs 74 74
Farms 2 2
RMSE 2474.593 250.2515
"R-sq"
F-Stat
P
0.3154 0.8992
16.35382 316.5447
0.0000 0.0000
r
sureg — Zdllner's seemingly unrelated regression
Coef .
Std. Err.
t
j
P>ltl
11
[957, Conf . Interval]
price foreign length _cons
2S01.143 90.21239 -11621.35
766.117 15.83368 3124.436
3.66 |-70 4.72 i
0.000 0.000 0.000
1286.674 58.91219 -17797.77
4315.611 121.5126 -5444.93
weight foreign length _cons
-133.6775 31.44455 -2880.25
77.47615 1.601234 31B.9691
-1-73 If. 64 ~f. 02
0.087 0.000 0.000
-286.8332 28.27921 -3474.861
19.4782 34.60989 -2225.639
These two equation have a common set of regr^ssors and we could have used a shorthand syntax to specify the equations: ; . sureg (price weight = foreign length), small dfk i
In this case, the results presented by sureg aije the same as if we had estimated the equations separately: . regress price foreign length (output omitted) . regress weight foreign length (output omitted)
\
There is, however, a difference. We have allowed fii and u% to be correlated and have estimated the full variance-covariance matrix of the coefficients, sureg has estimated the correlations, but it does not report them unless we specify the corr option! We did not remember to specify corr when we estimated the model, but we can redisplay the results: . sureg, notable noheader corr Correlation matrix of residuals: price weight price 1.0000 weight 0.5840 1,0000 | Breusch-Pagan test of independence: chi2(|) = j
25.237, Pr = 0.0000
The notable and noheader options prevented s^reg from redisplaying the header and coefficient tables. We find that, for the same cars, the correlation of the residuals in the price and weight equations is .5840 and that we can reject the hypothesis that this correlation is zero. We can perform a test that the coefficients on foreign are jointly zero in both equations—as we set out to do—by typing test foreign: see [R] test. When we type a variable without specifying the equation, that variable is tested for zero in all equations in which it appears: . test foreign ( 1) [price]foreign = 0 . 0 ( 2) [weight]foreign = 0 . 0 F( 2, 71) = 17.99 Prob > :F = 0.0000
sureg — ZeHner's seemingly unrelated regression
12
E> Example When the models do not have the same set of explanatory variables and are not nested, sureg may lead to more efficient estimates than running the models separately as well as allowing joint tests. This time, let us assume you believe price = @Q -f /?iforeign + /^mpg + /fedispl + ui weight = 7o + 71 foreign -f 72length -f u^ To estimate this model, you type . sureg (price foreign mpg displ) (weight foreign length) , corr Seemingly unrelated regression Equation price weight
Dbs
Farms
RMSE
74 74
3 2
2165.321 245.2916
Coef.
fl
R-sq"
0.4537 0. 8990
z
Std. Err.
P>iz|
chi 2
P
49 . 6383 0 . 0000 661.8418 0.0000
1957. Conf. Interval]
price foreign displacement _cons
3058.25 -104.9591 18.18098 3904.336
685.7357 4.46 58.47209 -1.80 4.286372 4.24 1966.521 ; 1.99
0.000 0.073 0.000 0.047
1714.233 -219.5623 9.779842 50 . 0263
4402.267 9.644042 26.58211 7758.645
weight foreign length _cons
-147.3481 30.94905 -2753.064
75.44314 1.539895 303.9336
0.051 0.000 0.000
-295.2139 27.93091 -3348.763
.517755 33.96718 -2157.365
-1.95 20.10 -9.06
Correlation matrix of residuals: price weight price 1 . 0000 weight 0.3285 1.0000 Breusch-Pagan test of independence: chi2(l) =
7 . 984 , Pr = 0.0047
By way of comparison, had we estimated the price model separately: . regress price foreign mpg displ SS Source df
MS
Model Residual
294104790 340960606
3 98034929.9 70 4870865 . 8-1
Total
635065396
73
price
Coef.
foreign
3545.484 -98.88559 22.40416 2796.91
mpg displacement _cons
Std.
Number of obs F( 3, 70) Prob > F R-squared Adj R-squared Root MSE
8699525.97 Err.
712.7763 63.17063 4.634239 2137.873
t 4.97 -1 .57 4.83 1 .31
P>lt)
0 000
0 122 0 000 0 195
74 = 20.13 = 0 . 0000 = 0.4631 = 0.4401 = 2207.0
[957, Conf. Interval] 2123.897 -224.8754 13.16146 -1466.943
4967,072 27.10426 31.64686 7060.763
The coefficients are slightly different but the standard errors are uniformly larger. This would still be true if we specified the df k option to make a small-sample adjustment to the estimated covariance of the disturbances.
w
sureg — Zellner's seemingly unrelated regression
13
Q Technical Note Constraints can be applied to SURE models! using Stata's standard syntax for constraints. For a general discussion of constraints, see [R] constraint: for examples similar to seemingly unrelated regression models, see [R] reg3.
i
Saved Results sureg saves in e(): Scalars e(K) e(k) e(k_eq)
number of observations number of pararjjieters in system number of equafons e(msB_#) model sum of siuares for equation # model degrees
•(11)
Macros e(cnkl) e(depvar) e(exog) e (eqhames) e(corr) e(wtype) e (wefcp) e(inethod) e(snmll) e (predict) e(dfk)
sureg ,j name(s) of depehdent variable(s) names of exogenous variables names of equations correlation structure weight type weight expressidjn requested estim^ion method small i program used tqi implement predict alternate divisor (dfk or dfk2 only)
Matrices e(b) e(V) e (Sigma)
coefficient vectoi' variance-covariance matrix of the estimators £ matrix
Functions e (is ample)
marks estimation sample
a
14
sureg — Zellner's seemingly unrelated regression
Methods and Formulas sureg is implemented as an ado-file, sureg uses the asymptotically efficient, feasible generalized least-squares algorithm described in Greene (2000, 614-636). The computing formulas are given on page 675. The /^-squared reported is the percent of variance explained by the predictors. It may be used for descriptive purposes, but R-squared is not a well-defined concept when GLS is used. sureg will refuse to compute the estimators if (1) the same equation is named more than once, or (2) the covariance matrix of the residuals is singular. The Breusch and Pagan (1980) x2 statistic—a Lagrange multiplier statistic—is given by M m-1 ,2 ' mn
m=l n=l
where rmn is the estimated correlation between the residuals of the M equations and T is the number of observations. It is distributed as x2 with M(M — l)/2 degrees of freedom.
References Breusch, T. and A. Pagan. 1980. The LM test and its applications to model specification in econometrics. Review of Economic Studies 47: 239-254. Greene, W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River, NJ: Prentice-Hall. Weesie, J. 1999. sg!21: Seemingly unrelated estimation and the cluster-adjusted sandwich estimator. Stata Technical Bulletin 52: 34-47. Reprinted in Stain Technical Bulletin Reprints, vol. 9, pp. 231-248. Zellner, A. 1962. An efficient method of estimating seeming unrelated regressions and tests for aggregation bias. Journal of the American Statistical Association 57: 348-368. -. 1963. Estimators for seemingly unrelated regression equations: Some exact finite sample results. Journal of the American Statistical Association 58: 977-992. Zellner. A. and D. S. Huang. 1962. Further properties of efficient estimators for seemingly unrelated regression equations. International Economic Review 3: 300-313.
Also See Complementary:
[Rl iincom. [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce
Related:
{R] ivreg, [R] mvreg, [R] reg3, [R] regress
Background:
[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands
Tltfe svy — Introduction to sun'ey commands
Description The prefix svy refers to survey data. All thfc commands for analyzing these data begin with svy, The svy commands aYe svyreg svyivreg svyintreg svylogit svyprobit svymlogit svyologit svyoprobit svypois
[R] svy estimators Linear! regression for survey data [R] svy estimators Instrumental variables regression for survey data
Msvy estimators [R| svy estimators [R] svy estimators [R] svy estimators [Rjsvy estimators [Rjsvy estimators [R] svy estimators
Censored and interval regression for survey data Logistic regression for survey data Probit: models for survey data Multiijomial logistic regression for sun'ey data Ordered logistic regression for survey data Ordered probit models for sun'ey data Poissoh regression for survey data
svydes
[R] svydes
svylc
[R] svylc
Descrijbe strata and PSUs of sun-ey data i Estimate linear combinations of parameters (e.g., lifferences of means, regression coefficients)
svymean svyprop svyratio svytotal
[R] [Rj [R] [R|
Estimation Estimation Estimation Estimation
svyset
[R] svyset
Set variables for survey data
svytab
[Rj svytab
Two-4ay tables for survey data
svytest
[R] svytest
Hypotheses tests for survey data
svymean svymean svymean svymean
of population of population of population of population
and subpopulation means and subpopulation proportions and subpopulation ratios and subpopulation totals
Remarks Data from sample surveys generally have th"ree important characteristics: 1. sampling weights, also called probability weights—pweights in Stata's syntax, |
2. cluster sampling,
I
3. stratification.
1
The svy commands can be used when the sample has any or all of these features. For example, if the data are weighted, but the design did not involve clustering or stratification, these data can still be analyzed with the survey commands. The svy; commands are also suitable for multistage sampling designs. For a general: discussion of various asjpects of survey designs and how to account for them in your analysis, see [U] 30 Overview of survey estimation.
15
16
svy — Introduction to survey commands
Several other estimation commands in Stata have features that make them suitable for certain limited designs. For example, regress handles sampling weights properly when pweights are specified, and regress also has a cluster () option. However, the svy commands have capabilities that these other estimation commands do not have. Persons with bona fide survey data who care about getting all the details right should use the svy commands. The svy commands compute the design effects deff and deft. The svy commands calculate adjusted Wald tests for the model F test. Using svytest after a svy estimation command, one can compute adjusted Wald tests and Bonferroni tests for other hypotheses. For survey data, svytest is used in place of test. svymean is the sole command in Stata (other than setting up the estimation as a regression) that handles sampling weights properly for the estimation of means. Testing the differences of two means (the equivalent of a two-sample t test with sampling weights) can be done by running svymean with a by() option and then running the command svylc. svylc computes estimates of linear combinations of parameters, whether means, total, ratios, proportions, or regression coefficients, after a svy estimation command. Used after svy logit, it can compute odds ratios for any covariate group relative to another. For survey data, svylc is used in place of lincom (see [R] lincom). Persons wishing to use the svy commands should first glance at the svyset command described in [R] svyset. This allows you to set your pweight, strata, PSU (cluster), and FPC (finite population correction) variables at the outset rather than specifying them when you issue a command. Programmers may want to use the -robust command to compute robust variance estimates for their own estimators; see [P] ^robust, svyreg and the other svy estimation commands described in [R] svy estimators call this command to produce their variance estimates. For more detailed introductions to complex survey data analysis, see Scheaffer et al. (1996), Stuart (1984), and Williams (1978), and, in particular, see Levy and Lemeshow (1999) who use Stata commands throughout their discussion. Advanced treatments and discussion of important special topics are given by Cochran (1977), Korn and Graubard (1999), Sarndal et al. (1992), Skinner et al. (1989), Thompson (1992), and Wolter (1985).
Acknowledgment The svy commands were developed in collaboration with John L. Eltinge, Department of Statistics, Texas A&M University. We thank him for his invaluable assistance.
References Cochran, W. G. 1977. Sampling Techniques. 3d ed. New York: John Wiley & Sons. Eltinge, J. L., and W. M. Sribney. 1996. svyl: Some basic concepts for design-based analysis of complex survey data. Stata Technical Bulletin 31: 3-6. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 208-213. Korn, E. L. and B. 1. Graubard. 1999. Analysis of Health Surveys. New York: John Wiley &. Sons. Kott, P. S. 1991. A model-based look at linear regression with survey data. American Statistician 45: 107-112. Levy. P. and S. Lemeshow. 1999. Sampling of Populations. 3d ed. New York: John Wiley & Sons. Sarndal, C.-E., B. Swensson, and J. Wretman. 1992. Model Assisted Survey Sampling. New York: Springer-Verlag. Scheaffer, R. L., W. Mendenhall, and L. Ott. 1996. Elementary Survey Sampling. 5th ed. Boston: Duxbury Press. Skinner, C. J., D. Holt, and T. M. F. Smith, eds. 1989. Analysis of Complex Surveys. New York: John Wiley & Sons. Stuart, A. 1984. The Ideas of Sampling. 3d ed. New York: Macmillan.
r
— Introduction to survey commands Thompson, S. K. 1992. Sampling. New York: John Wiley & Sons. Williams, B. 1978. A Sampler on Sampling. New York: Jij>hn Wiley & Sons. Wolter. K. M. 1985. Introduction to Variance Estimation. Hew York: Springer- Verlag.
Also See Complementary:
.
[R] svy estimators, [R] svyoes, [R] svylc, [R] svymean, [R] svyset, [R] svytab, [R] svytest
|
Related:
[P] .robust
Background:
[u] 36 Overview of survey estimation
17
Title svy estimators —• Estimation commands for complex survey data
Syntax svyreg varlist [weight] [if exp] [in range] [, common-options svyivreg depvar \varlisti] (varlist^ «= varlist-lv) [weight] [if exp] [in range] [, common-options svyintregdepvaridepvarz [indepvars] [weight] [±fexp] [in range] [, constraints(numlist) offset (varname} maximize ^options common-Options svylogit varlist ^weight] [if exp] [in range] [, or offset(varname) asis maximize-options common-options svyprobit varlist [weight] [if exp] [in range] [, offset(vamame') asis maximize-Options common—options svymlogit varlist [weight] [if exp] [in range] [, rrr basecategory(#) constraints (numlist) maximize ^options common-options ] svyologit varlist [weight] [if exp] [in range] [, offset(varname) maximize ^options common-options svyoprobit varlist \weighi\
[if exp] [in range]
[, off set (varname)
maximize ^options common-options ] svypois varlist [weight] [if exp] [in range] [, irr exposure (varname) off set (varname) constraints (numlist) maximize-options common -options ] The common-options are noconstant strata (varname) psu(varname) fpc(varname) subpop (varname) srssubpop noadjust ef orm level(#) prob ci deff deft meff meft 18
r
svy estimators — Estimation commands for complex survey data
19
The commands typed without arguments redisplay previous results. The following options can be given when redisplaying results: or rrr irr efonh level(#) prob ci &eff deft :j
These commands allow pweights and iweights; see [U] 114.1.6 weight. These commands share the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. Warning: Use of if or in restrictions will not produce cdjrrect variance estimates for subpopulations in many cases. To compute estimates for a subpopulation, use the subpopk) option.
Syntax for predict After svyreg and svyivreg, the syntax for predict is predict [type] newarname [if exp] [in rahge] [, xb After svyintreg. the syntax for predict is
residuals
stbp
'
predict [type] newvarname [if exp] [in tange\ [, xb
stbp
pr(0,i)
After svylogit and svyprobit. the syntax for predict is predict [type] newvarname [if exp] [in ra^ge] [, { p j xb | stbp ] rules as if nooffiset j After svymlogit, svyologit, and svyoprobit,:the syntax for predict is j predict [type] newvarname(s) [if exp] [in range] j , { p j xb i stbp'} outcome (outcome} Qooffjset ]
!
With the options xb and stdp, one new variable is specified. For svymlogit. the options xb and stdp require outcome () to be specified. With p (which is the default), if out come () is specified, then only one new variable is specified: if outcome (). is not specified, then k new variables must be specified (where k is the total number of outcomes). I
After svypois, the syntax for predict is
:
predict [typel^ newvarname [if exp] [in ralpge] [, { n j ir
xb j stbp }
nooffset I J
!
For all of the svy estimation commands, the statistics commuted by predict are available both in and out of sample: type predict . . . if efjsample) . . . if wanted onjy for the estimation sample. For subpopulation estimates, further restriction to the siibpopulaaon may be warranted.
20
svy estimators — Estimation commands for complex survey data
Description These commands estimate regression models for complex survey data. Each is used in the same manner as the nonsurvey version of the command, so users should first familiarize themselves with the nonsurvey version. svyreg estimates linear regression; see [R] regress. svyivreg estimates linear regression models with instrumental variables; see [R] ivreg. svyintreg estimates linear regression models with censored or interval data; see intreg in [R] tobit. svylogit estimates logistic regression; see [R] logit. svyprobit estimates probit models; see [R] probit. svymlogit estimates multinomial logistic regression; see [R] mlogit. svyologit estimates ordered logit models; see [R] ologit. svyoprobit estimates ordered probit models; see [R] oprobit. svypois estimates Poisson regression; see [R] poisson. All of these estimators except svyreg and svyivreg are estimated using pseudo-maximumlikelihood methods. That is, the point estimates are those from a weighted "ordinary" .maximumlikelihood estimator. For complex survey data, however, this weighted "likelihood" is not the distribution function for the sample; hence, it is not a true likelihood. Thus, it is termed a pseudo-likelihood. One of the consequences of this is that standard likelihood-ratio tests are no longer valid. Since the pseudo-likelihood is not suitable for statistical inference in the standard manner, it is not displayed by the svy estimators. See Skinner (1989, Section 3.4.4) for a discussion of pseudo-MLEs. These svy estimation commands allow any or all of the following; probability sampling weights, stratification, and clustering. Associated variance estimates and design effects (deff and deft) are computed. The subpopO option will give estimates for a single subpopulation. For a general discussion of various aspects of survey designs, including multistage designs, see [U] 30 Overview of survey estimation. Many of the options here are the same as those for the svymean., svytotal, and svyratio commands. See [R] svymean for a more thorough description of these shared svy command options. To describe the strata and PSUs of your data and to handle the error message "stratum with only one PSU detected", see [R] svydes. To estimate linear combinations of coefficients for any covariate group relative to another, see [R] svylc. svylc can also produce odds ratios after svylogit, relative risk ratios after svymlogit, and incidence rate ratios after svypois. To perform hypothesis tests, see [R] svytest. See [P] _robust for a programmer's command that can compute variance estimates for survey data for a user-defined program.
Options The common-joptions are noconstant estimates a model without the constant term (intercept).
_ svy
estimators
— ;
Estimation
commands
for
complex
survey
data _ 2_1
:
j strata (vamame) specifies the name of a variable (numeric or string) that contains stratum identifiers. strata () can also be specified with the djvyset command; see the following examples and [R] svyset.
psu(wjra0me) specifies the name of a variable) (numeric or string) that contains identifiers for the primary sampling uriit (i.e., the cluster), psup can also be specified with the svyset command, :
j
f pc (vamame) requestsia finite population correction for the variance estimates. If the variable specified has values less than or equal to 1, it is interpreted as a stratum sampling rate fh — n-h/Nh, where n/i = number of PSUs sampled from stratum j/i and Nh = total number of PSUs in the population belonging to stratum k. If :the variable specified has values greater than or equal to n/j, it is interpreted as containing Nh. fpc() can als be specified with the svyset command. subpop(vamame) specifies that estimates be computed for the single subpopulation defined by the observations for which vamame 7^ 0 Typiqjally, vamame = 1 defines the subpopulation. and vamame — 0 indicates observations not belonging to the subpopulation. For observations whose subpopulation status is uncertain, vamame should be set to missing ('.')• srssubpop can only be specified if subpopO is specified, srssubpop requests that deff and deft be computed using an estimate of simpjte-random-sampling variance for sampling within a subpopulation. If srtesTibposp is not specifie|, deff and deft are computed using an estimate of simple-random-sampling variance for samplirig from the entire population. Typically, srssubpop would be given whefi computing subpopulatiJHi estimates by strata or by groups of strata. noadjust specifies that the model Wald test rj^ carried out as W/k ~ F(k, d), where W is the Wald test statistic, k is the number of terms In the model excluding the constant term, d = total number of sampled f*SUs minus the total nurrjber of strata, and F(k,d) is an F distribution with k numerator degrees of freedom and d denoftiinator degrees of freedom. By default, an adjusted Wald test is conducted: (d-k+ l)W/(kd) f- F(k,d- k + 1). See Korn and Graubard (1990) for a discussion of tlje Wald test and the adjustments thereof. mef f requests that the meff measure of misspecpfication effects be displayed. See [R] svymean for details. mef t requests that the ijneft measure of misspec^fication effects be displayed. See [R) svymean for details. ! i
:
The following options can be specified initially §r when redisplaying results: ;
!
or (svylogit), rrr (stytalogit), irr (svypofs), and ef orm (after any of the commands) report the estimated coefficients transformed to, respectively, odds ratios, relative risk ratios, incidence rate ratios, and generic exponentiated coefficients; i.e., exp(fe) is displayed rather than 6. Standard errors and confidence! intervals are similarly transformed. level (#) specifies the confidence level (i.e., ijominal coverage rate), in percent, for confidence intervals. The default; is level (95) or as set' by set level; see [U] 23.5 Specifying the width of confidence intervals. prob requests that the t statistic and p-value be displayed. The degrees of freedom for the t statistic are d — total numberj of sampled PSUs minus |he total number of strata (regardless of the number of terms in the moddl). If no display options! are specified, then, by default, the t statistic and p-value are displayed!; iI ci requests that confidence intervals be displayed.: If no display options are specified, then, by default, confidence intervals ajre displayed. deff requests that the design-effect measure defff be displayed; see [R] svymean for details. deft requests that the disign -effect measure deftj be displayed: see [R] svymean for details.
22
svy estimators — Estimation commands for comptex survey data
The following options apply to only one or some of the commands: offset(varname') specifies that varname is to be included in the model with coefficient constrained to be 1. exposure (varname) (svypois only) is equivalent to specifying off set(ln(vamame)). asis, see [R] probit. basecategory(#) (svymlogit only) specifies the value of the dependent variable that is to be treated as the base category. The default is to choose the most frequent category. constraints (numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command; see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. maximize-options control the maximization process; see [R] maximize. You may want to specify the log option when estimating models on large datasets to view the progress of the maximum-likelihood estimation steps. You should never have to specify the other maximize-.options.
Options for predict See the entry for the nonsurvey version of the command for a description of the corresponding options for predict. Note that some of the options for predict after the nonsurvey version may not be available after the svy version. ,
Remarks We illustrate a few of the svy estimation commands with some examples. t> Example We use data from the Second National Health and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981) as our example. First, we set the strata, psu. and pweight variables. . svyset pweight leadwt . svyset strata stratid . svyset psu psuid
Once the strata, psu, and pweight variables are set, we can use svyreg just as we would regress with nonsurvey data. See [R] svyset for details on setting, unsetting, and displaying these variables.
(Continued on next page)
w
syy estimators — Estimation commands for complex survey data
23
. svyreg loglead age fetoale black orace region2-region4 smsal smsa2 Survey linear rejgressioo pseight : leadwt Strata: stratid PSU: psuid
loglead
age female black orace region2 regions region4 smsal smsa2 _cons
Number of obs Number of strata Number of PSUs Population size F( 9, 23) Prob > F R-squared
:
Coef . ' .0028425 -.3641964 . 1462126 -.07544i89 -.0206953 - . 1272898 -.0374591 . 1038S86 .0995961 2.623901
t
Std. Err.
.0004282 I 6.64 .0112612 -32.34 .0277811 5 . 26 .0370151 ' -2.04 . 0456639 -0.45 .0528061 -2.41 .0422001 -0.89 .0432539 2.40 .0365985 ! 2.72 .0421096 62.31
P>lt! 0.000 0.000 0.000 0.050 0.654 0.022 0.382 0.023 0.011 0.000
= 4948 31 = 62 = = 1.129*+08 134.62 = 0,0000 0.2443 =
[957. Conf . Interval] .0019691 -.3871637 .0895527 -.1509418 -.1138274 -.2349586 - . 1235268 .0156417 .0249129 2.538018
.0037159 -.3412291 . 2028725 .0000439 .0724369 -.0195611 . 0486085 . 1920755 .1741993 2.709784
If we wish to test joint hypotheses after the: regression, we can use the svytest command; see [R] svytest. | Running logistic regressions with svylogilj is as simple as running the logit command. Note that, just like logit, the dependent variable shoiild be a 0/1 variable (or, more precisely, a zero/nonzero variable).
1> Example . svylogit highljp height weight age ag^2 female black j
Survey logistic regression pweight: finalygt Strata: stratid PSU: psuid
highbp height weight age age2 female black _cons
: j "
Coef .
Std. Err.
-. 0325996 ; . 049074 ; .1541151 ;-. 0010746 -.356497 • .3429301 -4.89574
.0058727 .0031966 .0208709 .0002025 .0885354 . 1409005 1.159135
!
Number of obs Number of strata Number of PSUs Population size F( 6, 26) Prob > F t
-5.55 15.35 7.38 -5.31 -4.03 2.43 -4.22
We can redisplay the results expressed as odds ratios.
P>lt| 0.000 0.000 0.000 0.000 0.000 0.021 0.000
= 10351 = 31 = 62 = 1.172e+08 = 87.70 = 0.0000
[95X Conf. Interval] -.0445771 .0425545 .1115486 -.0014877 - . 537066 .0555615 -7.259813
-.0206222 .0555936 .1966815 -.0006616 -.175928 .6302986 -2.531668
24
svy estimators — Estimation commands for complex survey data . svylogit, or Survey logistic regression pweight : f inalwgt Strata: stratid PSU: psuid
highbp
Odds Ratio
height weight age age2 female black
.967926 1.050298 1 . 166625 .998926 .7001246 1 . 40907
Number of obs Number of strata Number of PSUs Population size F( 6, 26) Prob > F
Std. Err. .0056843 .0033574 .0243485 .0005023 .0619858 .1985388
t -5.55 15.35 7.38 -5.31 -4.03 2.43
10351 = 31 = 62 = = 1 . 172e+08 87.70 = 0.0000
P>|tl
£95% Conf. Interval]
0.000 0.000 0.000 0.000 0.000 0.021
.9564019 1.043473 1.118008 .9985135 .5844605 1.057134
.979589 1.057168 1.217356 . 9993386 .8386784 1.878171
svylc can be used to estimate the sum of the coefficients for female and black. . svylc female + black ( 1) female + black = 0.0 highbp
Coef.
(1)
-.0135669
Std. Err. . 1653936
P>lt| -0 08
0 935
[957, Conf. Interval] -.3508894
.3237555
This result is more easily interpreted as an odds ratio. . svylc female + black, or ( 1) female + black =0.0 highbp
Odds Ratio
Std. Err.
(1)
. 9865247
. 1631648
t -0.08
P>lti
[951/. Conf. Interval]
0,935
.7040616
1.382309
The odds ratio 0.987 is an estimate of the ratio of the odds of having high blood pressure for black females over the odds for our reference category of nonblack males (controlling for height, weight, and age). You now know enough to use svylc for odds ratios; see [R] svylc for its other uses; see [R] lincom for more examples of odds ratios.
(Continued on next page)
F
svy estimators — Estimation commands for complex survey data
25
Example To estimate a model for aj subpopulation, the sub$op () option is used: . svylogit Mghlead ate female, subpop(blackj or Note: 1 stratum omitted because it contains io subpopulation members Survey logistic regression 4784 pweight: finalwgt Number of obs = Strata: stratid • Number of strata = 30 PSU: psuid j \ Nuaber of PSUs = 60 i | Population size = 54538S76 Subpopulation no. of ibs = 506 F( 2, 29) = 13.78 Subpopulation size [ = 5199006 ; Prob > F = 0.0001 , — _ — highlead Odds & tio Std, Err. P>lt| [957, Conf . Interval]
i
age female
1 . 01'8d8 .0265 989
.0079836 .0195109
l.iT
-5.4°
0.071 0.000
.9986331 .0061714
1.031244 .118115
ij
Note that this time we specified ifled the or option when ^e first issued the command.
The subpop(varaflme) option takes a 0/1 variabl^; the subpopulation of interest is defined by varname = 1. All other memi ers of the sample not in t|e subpopulation are indicated by vamume = 0. If a person's subpopulatio.i status is unknown, theji varname should be set to missing ( ' . ' ) and those observations will be omitted from the analysis as] they should be. For instance, in the preceding example, if person's race is junknown, race should be coded as missing rather than as nonblack (race = Oj. ; i Note that using 'if black?=l' to model the subpopjulation would not give the same result. All the discussion in the section Walking about the use of if pnd in in [R] svymean applies to the variance estimates for svyreg, svyloeit, and svyprobit as jwell.
Q Technical Note
I i
i
Actually, the subpop(vompme) option takes a zero^nonzero variable; the subpopulation of interest is defined by varname ^ 0 anq not missing. All other n|embers of the sample not in the subpopulation are indicated by varname = 0. 'But 0.1, and missing are typically the only values used for the subpop () variable.
Example
j
\
In the NHANES II dataset, we have a variable healtfi containing self-reported health status, which takes on the values 1-5, win 1 being "poor" and \ being "excellent". Since this is an ordered categorical variable, it makes: ense to model it using s^yologit or svyoprobit. As predictors, we use basic demographic variab :s: female (1 if female* 0 if male), black (1 if black, 0 otherwise). age. and age2 (= age2):
28
svy estimators — Estimation commands for complex survey data
Saved Results The svy estimation commands save in e (): Scalars
e(N) e(N_strata) e(N_psu) e(N_pop) e(N_subpop) e(N_sub) e(df_m) e(df_r)
e(F) e(k_cat) e(basecat) e(ibasecat) Macros e(cmd) e(depvar) e(wtype) e(wexp) «(strata) e(psti) e(fpe) e(offset)
e(predict) e(cnslist) Matrices e(b) e(V) e(V_srs) e(V_srs»r) e(deff) e(deft) e(cat)
number of observations m number of strata L number of sampled PSUs n estimate of population size M estimate of subpopulation size subpopulaticjn number of observations model degrees of freedom variance decrees of freedom = n~L model F statistic number of Categories (svymlogit, svyologit, svyoprobit) base category value of dependent variable (svymlogit) base category number (svymlogit) command name (e.g., svyreg) dependent variable name weight type weight variable or expression strata () variable psttO variable fpc() variable offsetQ Variable program used to implement predict constraint numbers (svyintreg, svymlogit, svypois) vector of estimates 0 design-based (co)variance estimates V simple-randcjm-sampling-without-replacement (co)variance Vsrswor simple-randcjm-sampling-with-feplacement (co)variance Vsrswr (only creajted when fpc() option is specified) vector of dejff estimates vector of deft estimates vector of category values (svymlogit, svyologit, svyoprobit)
Functions
e(sample)
marks estimation sample
Methods and Formulas All of the svy estimators are implemented as ado-files that call _robust; see [P] _robust. These commands use a variant on the basic weighted-point-estimation methods used by svytotal. They use ulinearization"-based variance estimators that are natural extensions of the variance estimator used in svytotal. For general methodological background on regression and generalized-linear-model analyses of complex survey data, see, for example, Binder (1983), Cochran (1977), Fuller (1975), Godambe (1991), Kish and Frankel (1974), Sarndal et al. (1992), Shao (1996), and Skinner (1989). The notation and development presented below is adapted from Binder (1983). We use here the same notation as in the Methods and Formulas section of [R] svymean; that section should be read first.
if
_
svy
[
estimators
--
Estimation
commands
complex
survey
data
_
z»
:
!
•At.
for
;
Linear regression We let (/i,i, j) index the elements in the popujation, where h — 1 ..... I are the strata, i = I . ..., Nh, are the PSUs in! stratum h, and j — 1] . . ., Mhi are the elements in PSU (h,i). The regression coefficients /? b (/3o>$i 5 • • --.0k} are Viewed as fixed finite-population parameters that we wish to estimate, The$e parameters are define^ with respect to an outcome variable YMJ and a k 4- 1-dimensional row vector of explanatory variables X^j — (Xhijo- • • • • X^jk}. As in nonsurvey work, we often have J^/ujo identically equal to unity, so that /?o is an intercept coefficient. Within a finite-population context, we can formally define the regression coefficient vector 0 as the solution to the vector estimating equation \
G(P) = X'Y\-X'XP = Q
I i
(1)
! j
where Y is the vector of outcomes for the full population and X is the matrix of explanatory variables for the full population. Assuming (X'X}~1 existif, the solution to (1) is /3 = (X'X)~1X'Y. Given observations (j^. x/uj), collected through /3 in a way that accounts, for the sample design. JTo X'Y can be viewed as majtrix population totals. Fcjr Thus, we estimate X'X and X'Y with the weighted
X X
'
a complex sample design, we need to estimate do this, note that the matrix factors X'X and example, X'Y - ]Ch~i Y^,i=i Tlj~i XhijYhijestimators
=
h=l t=l j=l
and
\
=
L
i\h "ih
' ^ EEE ;
h=H=lj = l
where XB is the matrix pf explanator)' variables 'for the sample, Ys is the outcome \'ector for the sample, and W = diag(uj^tj) is a diagonal mjatrix containing the sampling weights WMJ. The corresponding coefficient; estimator is
: 3 = (XlXrlX'Y^(X'8WXsrlX'ilWYe
(2)
Note that equation (2) is what the regress corimand with aweights or iweights computes for point estimates. ; j The coefficient estimator 13 can also be definexj as the solution to the weighted sample estimating equation , ,
d(P) = X'Y - X^Xp =j X'sWYt - X'SWX53 - 0 We can write G(B} as
where dfll} = x'hij€hij kn& €hi} ~ ytiij — xhi]i is
tne
regression residual associated with sample
unit ( h . i , j ) . Thus, G(/3J) can be viewed as a special case of a total estimator Our variance estimator for 3 is based on tht following "linearization" argument. A first-order Taylor expansion shows ithat '
"-^-{wi
3D
svy estimators — Estimation commands for complex survey data
Thus, our variance estimator for ft is -T'
\
J
V{G(fi)}
0=0
^(X>
Viewing G(ft) as a total estimator according to equation (3), the variance estimator V{G(ft)}\a_g can be computed using equation (3) from [R] svyraean with yhij replaced by dhij and with (3 used to estimate e/,»j. Pseudo-maximum-likelihood
estimators
To develop notation for our pseudo-maximum-likelihood estimators, suppose that we observed (Y/uj, Xhij) for the entire population, and that (Yhij^Xhij) arose from a certain likelihood model (e.g., a logistic distribution). Let /(/?; Yhij,Xhij) be the associated "log likelihood" under this model. Then, for our finite population, we define (he parameter ft by the vector estimating equation
L
f h Mhi
where S = dl/d/3 is [he score vector; i.e., : the first derivative with respect to ft of I (ft; Yhij,Xhij). Then, the "pseudo-maximum-likelihood" estimator ft is the solution to the weighted sample estimating equation
Note that the solution ft of equation (4) is what the nonsurvey version of the command with iweights produces for point estimates. Again, we use a first-order matrix Taylor series expansion to produce the variance estimator for ft i
V(ff] = 0=0
where H is the Hessian for the weighted sample log-likelihood. We can write G(/3) as L
rih HIM
where dhl} = shl}X)HJ and shij is the score index for element (h,i,j). The term by rewriting the sample log-likelihood l(3;yhij,xhij) as a function of xhij0: s 3
*
is computed
svy iestimafors — Estimation commands for complex survey data
31
Thus, again, (?(/?) can be'' viewed as a special cast of a total estimator, and the variance estimator V{G(fl}}\Sa£ is computed using equation (3) frtiim [R] svymean with y^j replaced by d^ and with 0 used to estimate
Acknowledgments The svyreg. svylogifc, and svyprobit commands were developed in collaboration with John L. Eltinge, Department of Statistics, Texas A&M University. We thank him for his invaluable assistance. We thank Wayne Johnson of the National Centejr for Health Statistics for providing the NHANES II dataset.
References Binder. D. A. 1983. On the variances of asymptotically normal estimators from complex surveys. International Statistical : Re view 51: 279-292. Cochran. W. G. 1977. Sampling Techniques. 3d ed. New Y0rk: John Wiley & Sons. Eltinge, J. L, and W. M. Sribn^y. 1996. svy4: Linear, logistjc. and probit regressions for survey data, Sfafa Technical Bulletin 31: 26-31. Reprinted in Stats Technical Bul/etiij? Reprints, vol. 6, pp. 239-245. Fuller. W. A. 1975. Regression [analysis for sample survey. Sankhya, Series C 37: 117-132. Godambe, V. P. ed. 1991. Estimating Functions. Oxford: Clarendon Press. Gonzalez J. F, Jr.. N. Krauss. ajnd C. Scon. 1992. Estimatioi| in the 1988 National Maternal and Infant Health Survey. In Proceedings of the Section on Statistics Education. America/! Statistical Association, 343-348. Johnson. W. 1995. Variance estimation for the NMIHS. Technical document. National Center for Health Statistics. Hyattsville, MD. ; Kish. L. and M. R. Frankel. 1974. Inference from complexi samples. Journal of the Royal Statistical Society B 36:
1-37.
:
:
Korn, E. L,. and B. I. Graubard. 1990. Simultaneous testing of regression coefficients with complex survey data: use of Bonferroni / statistics. Tfie American Statistician 44: 270-276. McDowell. A.. A. Enge!, J. T. JMassey, and K. Maurer. 198j. Plan and operation of the Second National Health and Nutrition Examination Survtjy. 1976-1980. Vital and Heklth Statistics 15(1). National Center for Health Statistics. l Hyacisville. MD. \ Sarndal. C.-E., B. Swensson. artd J. Wretman. !992. Model lAssisted Survey Sampling. New York: Springer-Verlag. Shao. J. 1996. Resampling methods for sample surveys (with discussion). Statistics 27: 203-254. Skinner. C. J. 1989. Introduction to Part A. In Analysis -of Complex Surveys, ed. C. J. Skinner. D. Holt, and T. M. F. Smith, 23-58. Ne* York: John Wilev & SonsJ f
..
|
-I
Also See Complementary:
[R] ajdjtist, [R] cwistraint. [R| mfx, [R] svydes. [R] svylc, [R] svymean. [R] sjvyset, [R] svytab, [R] svytest, [R] testnl
Related:
fp] _]robust j
Background:
:!
[u] 3(0 Overview of survey estimation. [R]
Title svydes — Describe survey data
Syntax svydes [vor/isr] [weig/if] [if exp] [in range]
[, strataOwzrname) psu(varname)
fpc(vamame) bypsu ] pweights and weights are allowed; see [U] 14.1.6 weight.
Description svydes displays a table that describes the strata and the primary sampling units for sample survey data.
Options strata(varname) specifies the name of a variable (numeric or string) that contains stratum identifiers, strata () can also be specified with the svyset command; see [R] svyset. psu(vamame) specifies the name of a variable (numeric or string) that contains identifiers for the primary sampling unit (i.e., the cluster). psu() can also be specified with the svyset command. fpc(varmzme) can be set here or with the svyset command. If an fpc variable has been specified, svydes checks the fpc variable for missing values. Other than this, svydes does not use the fpc variable. See [R] svymean for details on fpc. bypsu specifies that results be displayed for each PSU in the dataset; that is, a separate line of output is produced for every PSU. This option can only be used when a PSU variable has been specified using the psu() option or set with svyset. Note: Weights are checked for missing values, but are not otherwise used by. svydes.
Remarks Sample-survey data are typically stratified. Within each stratum, there are primary sampling units (PSUs), which may be either clusters of observations or individual observations. svydes displays a table that describes the strata and PSUs in the dataset. By default, one row of the table is produced for each stratum. Displayed for each stratum are the number of PSUs, the range and mean of the number of observations per PSU, and the total number of observations. If the bypsu option is specified, svydes will display the number of observations in each PSU for every PSU in the dataset. : If a varlist is specified, svydes will report the number of PSUs that contain at least one observation with complete data (i.e., no missing values) for all variables in the varlist. These are precisely the PSUs that would be used to compute estimates for the variables in varlist using the svy estimation commands: svymean, svytotal, svyratio, svyprop, svytab, or any of the commands described in [R] svy estimators. 32
r
svydes — Describe survey data
33
The variance estimation formulas for the svy estimation commands require at least two PSUs per stratum. If there are sorde strata with only a single PSU, an error message is displayed: . svyaean x • stratum with onlyjone PSO detected r(499); , svydes x
ifc
m
The stratum (or strata) wjth only one PSU can be Ibcated from the table produced by svydes x. After locating this stratum, it dan be "collapsed" into aji adjacent stratum, and then variance estimates can be computed. See the following examples for an ^illustration of the procedure. For details on the svy estimation commands, see [R] svymean and [R] svy estimators.
Example KE*
We use data from thi Second National Healtjh and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981) is our example. First, w4 set the strata, psu, and pweight variables. . svyset strata stjratid . svyset psu psui4 . svyset pweight ^inalwgt
Typing svydes will shoi> us the strata and PSU arrangement of the dataset. . svydes i pweight: finalwglj Strata: stratid' PSU: psuid ; #0^s per PSU Strata stratid
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26
27 28 29
#0bs
#PSOs
2j 2j
i i
2
!
2J 2i 2!
380 185 348 460 252 298 476 338 244 262 275 314 342 405 380 336 393 359 285 214 301 341 438 256 261 283 299 503
min
mean
165 ; 190.0 67 92.5 149 174.0 230.0 229 105 I 126.0 131 149.0 206 \ 238.0 158 169.0 100 ; 122.0 119 i 131.0 120 •1 137.5 144 157.0 154 ! 171.0 200 1 202.5 189 190.0 159 168.0 180 196.5 144 179.5 125 142.5 107.0 102 128 150.5 159 | 170.5 205 219.0 116 i 128.0 129 I 130.5 139 141.5 136 \ 149.5 215
i 251.5
max 215 118 199 231 147 167 270 180 144 143 155 170
188 205 191 177 213 215 160 112 173 182 233 140 132 144 163 288
34
svydes — Describe survey data 30 31 32
2 2 2
31
62
iee
36B 308 450
10351
143 211
182.5 154.0 225.0
199 165 239
67
167.0
288
Our NHANES II dataset has 31 strata (stratum 19 is missing) and 2 PSUs per stratum. The variable hdresult contains serum levels of high-density lipoproteins (HDL). If we try to estimate the mean of hdresult, we get an error. . svymean hdresult stratum with only one PSU detected r(460);
Running svydes with hdresult as its varlist will show us which stratum or strata have only one PSU. . svydes hdresult pweight : f inalwgt Strata: stratid psuid PSU: Strata stratid 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26
27 28 29 30 31 32 31
SPSUs included 1* 1* 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 60
#0bs wjlth #0bs with ffPSUs complete missing omitted datfe data 1 1 0 0 0 0
0 0
0 0 0
114 98 277 340 173
255 409 299 218 233 238
0
275
0 0
J297 355 329 280 352 -'335 J240 198 263 304 388 239
0 0
0 0 0 0
0 0 0 0 0
0 0 0
0 0
0
240 259
284 440 326 279 383
8720 10351
266 87 71 120 79 43 67 39 26 29 37 39 45 50 51 56 41 24 45 16 38 37 50 17
21 24 15 63 39 29
#0bs per included PSU min 114 98 116 160 81 116 191 129 85 103 97 121 123 167 151 134 155 135 95 91 116 143 182
106 119 127 131
67
193 147 121 180
1631
81
mean
max
114.0 98.0 138.5 170.0 86.5 127.5 204.5 149.5 109.0 116.5 119.0 137 .-5 148.5 177.5 ' 164.5 140.0 176.0 167.5 120.0 99.0 131.5 152.0 194.0 119.5 120.0 129.5 142.0 220.0 163.0 139.5 191.5
114 98 161 180 92 139 218 170 133 130 141 154 174 188 178 146 197 200 145 107 147 161 206 133 121 132 153 247
145.3
247
179
158 203
r
svydes — Describe survey data
35
Both of stratid = 1 arid stratid — 2 have orily one PSU with nonmissing values of hdresult. Since this dataset has onljy 62 PSUs, the bypsu option will give a manageable amount of output:
. svydes hdresult,jbypsu pweight: Strata: PSU:
Strata stratid
finalwgt stratid psuid i
, #0bs with #0bs with PSU \ complete missing psuid ; data data
1
2i
0 114 98 0 161 116
215 51 20 67 38 33
32 32
1 2i
180 203
59 8
31
62i
8720
1631
1 2 2 3 3
li 2!
li 2;
li
(output omitted )
10351
It is rather striking thjat there are two PSUs Without any values for hdresult. All other PSUs have only a moderate number of missing values. Obviously, in a case such as this, a data analyst should first try to ascertajn the reason why these data are missing. The answer here (Johnson 1995) is that HDL measurement^ could not be collected ;until the third survey location. Thus, there are no hdresult data for the first two locations: stratid = 1, psuid = 1 and stratid = 2. psuid = 2. Assuming that we wish; to go ahead and analyze [he hdresult data, we must "collapse" strata—that is. merge them together-fso that every stratum h$s at least two PSUs with some nonmissing values. We can accomplish this by collapsing stratid ;= 1 into stratid = 2. To perform the stratum collapse, we create a newj strata identifier newstr and a new PSU identifier newpsu. This is easy to do using basic command^ in Stata. . gen newstr = striatid . gen neupsti = psufid . replace newpsu =! psuid + 2 if stratid==!l (380 real changes bade) ; . replace newstr = 2 if stratid==l (380 real changes bade) •
We set the new strata an| PSU variables. I . svyset strata neiwstr . svyset psu newps|u
We use svydes to check what we have done
svydes — Describe survey data
36
, svydes hdresult , bypsu pweight : final wgt Strata: newstr newpsu PSU: fObs with #0bs with Strata PSU complete missing newstr newpsu data data
2 2 2 2 3 3
1 2 3 4 1 2
98 0 0 114 161 116
20 67 215 51 38 33
32
1 2
180 203
59 8
30
62
8720
(output omitted ) 32
1631 10351
The new stratum, newstr = 2, has 4 PSUs, 2 of which contain some nonmissing values of hdresult. This is sufficient to allow us to estimate the mean of hdresult. . svymean hdresult Survey mean estimation pweight: finalwgt Strata: newstr PSU: newpsu
Number of obs Number of strata Number of PSUs Population size
Mean
Estimate
Std. Err.
[951/. Conf . Interval]
hdresult
49.67141
.3830147
48.88919
50.45364
= 8720 = 30 = 60 = 98725345
Deff 6.257131
Methods and Formulas svydes is implemented as an ado-file.
References Eitinge. J. L. and W. M. Sribney. 1996. svy3: Describing survey data: sampling design and missing data. State Technical Bulletin 31: 23-26. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 235-239. Johnson, C. L, 1995. Personal communication. McDowell, A., A. Engel, J. T. Massey, and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey. 1976-1980. Vital and Health Statistics 15(1). National Center for Health Statistics, Hyattsville. MD.
Also See Complementary:
[R] svy estimators, [R] svylc, [R] svymean, [R] svyset, [R] svytab, [R] svytest
Background:
[U] 30 Overview of survey estimation, [R] svy
Title
•
1
svylc — Estimate inear combinations afteij survey estimation 4
1
Syntax svylc \exp\ \, show or irr rrr eforjn level(#) deff deft neff meft
Description svylc produces est mates for linear combinations of parameters after a svy estimation command; i.e., any of the comnunds svymean, svytotajl, svyratio, or the commands described in [R] svy estimators. Estimating differences of subpopulation means, for example, can be done by running svymean with a by() option, ajnd then running svylc. The svylc command computes estimates of linear combinations of parameters, whether means, total, ratios, proportions, or regression coefficients. i Used after svylogilt, it will compute odds ratios for any covariate group relative to another. After svymlogit, it can compute relative risk ratio|, and after svypois, it can compute incidence rate ratios, svylc is the equivalent of lincom for purvey data. See [R] lincom for a thorough coverage of odds ratios. '
Options
<
show requests that thej labeling syntax for the iprevious svy estimates be displayed. This is useful when the svy estimation command produced) estimates for subpopulations using the by() option. When show is specified, no expression exp js specified. ]
,
:
or, irr, rrr. and ef oirm all do the same thing;; they all report coefficient estimates as exp(/?) rather than 3- Standard enors and confidence intervals are similarly transformed. The only difference is how the output is labeled: or gives "Odds Ritio" (appropriate for svylogit), rrr gives "Relative Rate Ratio" (appropriate for svymlogit), :irr gives "Incidence Rate Ratio" (appropriate for svypois), and efo:iB gives a generic label.; level(#) specifies th£ confidence level (i.e.,;nominal coverage rate), in percent, for confidence intervals. The default is level(95) or as s^jt by set level; see [V] 23.5 Specifying the width of confidence intervals. deff requests that the design-effect measure dd)ff be displayed. deft requests that the design-effect measure deft be displayed. See [R] svymean for a discussion of deff and deft. j • mef f requests that the beff measure of misspe^ification effects be displayed. meft requests that the :neft measure of misspecjfication effects be displayed. See [Rj svymean for a discussion of meff and meft. ;
37
svylc — Estimate linear combinations after survey estimation
38
Remarks Remarks are presented under the headings The use of svylc when there are no by() subpopulations Missing data: The complete and available options The use of svylc after svy model estimators Subpopulations with one by() variable Subpopulations with two or more by() variables The use of svylc after svyratio
By default, svylc computes the point estimate, standard error, t statistic, p-value, and confidence interval for the specified linear combination. Design effects (deff and deft) and misspecification effects (meff and meft) can be optionally displayed; see [R] svymean for a detailed description of these options.
The use of svylc when there are no by() subpopulations > Example We use data from the Second National Health and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981) as our example. Suppose that we wish to estimate the difference of the means of systolic (variable bpsystol) and diastolic (variable bpdiast) blood pressures. First, we estimate the means, and then we use svylc. . svymean bpsystol bpdiast Survey mean estimation
10351 Number of obs = 31 Number of strata = 62 Number of PSUs Population size = 1 172e+08
pweight: finalwgt Strata: strata PSU: psu
Mean
Estimate
Std. Err.
[9B1/. Goaf. Interval]
bpsystol bpdiast
126.9458 81.01726
.603462 .5090314
125.715 79.97909
128.1766 82.05544
Deff 8.230475 16.38656
. svylc bpsystol - bpdiast ( 1) bpsystol - bpdiast =0,0 Mean
Estimate
Std. Err.
(1)
45.92852
.2988395
t 153.69
P>lt|
[957, Conf . Interval]
0.000
45.31903
46.53801
We can also specify any of the options deff, deft, meff, or meft. or change the confidence level (i.e., nominal coverage rate) of the confidence interval.
svylc — Estimate linear combinations after survey estimation
39
. svylc bpsystol - bpdiast, level(90) de|f meff ( 1)
bpsystol - Jbpdiast = 0.0 Mean
stimate
Std. Err.
(1)
-3 5.92852
.2988395
Mean
(1)
Deff
1J53.69
P>ltl
[907, Conf, Interval]
0.000
45.42183
46.43521
Number of obs Number of strata Number of PSUs Population size
= 10351 = 31 = 62 = 1.172e+08
Meff
S .835532
3.087148
j
svylc works in the same manner after using thejsubpop option. . svynean bpsysto] bpdiast, subpop(femal4) Survey mean estiiMtion pveight: Strata: PSU: Subpop.:
finalwglj strata ; psu j female=4l
Mean
Estir ate
Std. Err.
[951J Conf. Interval]
bpsystol bpdiast
124.: 027 79.05 227
.7051858 . 5207306
122 i 7644 77.^7023
125.6409 80.09431
Deff 5.162487 8.973799
svylc bpsystol -j bpdiast (1)
bpsystol - bpdiast = 0 . 0 Mean
! stimate
Std.. Err.
(1)
i 5.17039
.4040852
t ill. 78
P>it|
[95X Conf. Interval]
0.000
44.34625
45.99453
Missing data: The complete and available options The svymean, svytotal. and svyratio corjimands can handle missing data in two ways: see [R] svymean. The available option (which is tHe default when there are missing values and two or more variables) uses every available nonmissing Rvalue for each variable separately. The complete option (which is the default when there are noj missing values or only one variable) uses only those observations with i.onmissing values for all: variables in the varlisi. Here is an example where available is the default: I
(Continued bn next page)
40
svylc — Estimate linear combinations after survey estimation
Example . svymean tcresult tgresult Survey mean estimation p»eight: f inalwgt Strata: strata PSU: psu
Number of obs(*) = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08
Mean
Estimate
Std. Err,
[951/. Conf . Interval]
Deff
tcresult tgresult
213.0977 138.576
1.127252 2.071934
210.7986 134.3503
215.3967 142.8018
5 . 602499 2.356968
(*) Some variables contain missing values.
We redisplay the results using the obs option to see how many observations were used for each estimate. . svymean, obs Survey mean estimation pwe ight: f inalwgt Strata: strata PSU: psu
Number of obs(*) = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08
Mean
Estimate
Std. Err.
tcresult tgresult
213.0977 138.576
1.127252 2.071934 i
Obs
10351 5050
(*) Some variables contain missing Values.
Because we estimated the mean of tgresult using a different set of observations than tcresult, we could not compute the covariance between the two, and hence, we cannot estimate the variance of the difference. So if we now ask svylc to estimate this difference, we get. an error message. . svylc tcresult - tgresult must run svy command with "complete" option before using this command r(301);
We need not know in advance whether the default was available or complete, svylc will always tell us if we need to run the estimation command again with the complete option. With the complete option, only those observations that have nonmissing values for both tgresult and tcresult are used when we compute the means. When this option is specified, svymean computes the full covariance matrix.
Example . svymean tcresult tgresult, complete Survey mean estimation pweight:
finalwgt
Number of obs
=
50BO
Strata:
strata
PSU:
psu
Number of strata = Number of PSUs = Population size =
31 62 56820832
Estimate linear combinations after survey estimation
Mean
Est innate
Std. Err.
tcresult tgresult
213 .3975
1.252274 2.071934
1; 8. 576
[S|SX Conf . Interval] j 213.9515 2d|8. 8435 lck.3503 142.8018
41
Deff 3.571411 2.356968
Now we can estimate tjhe variance of the difference. . svylc tcresultj - tgresult ( 1) tcresult - tgresult ==0.0 Mean
Estimate
Std. Err.
(1)
72.82146
2.039786
•
t 35.70
P>|tj
[957. Conf. Interval]
0.000
68.66129
76.98163
The use of svylc after svy model estimators svylc can be used i fter any of the commands described in [R] svy estimators to estimate a linear combination of the coefficients (i.e., the /3's). j Using svylc after s vyreg is straightforward:
t> Example . svyreg tcresul bpsystol bpdiast age ege2 Survey linear re press ion pwe ight: f inalw jt Strata: strata PSU: psu
tcresult bpsystol bpdiast age age2 _cons
Coef . .1060743 .2966662 3.35711 . 0247207 83.8242
Std. Err. I jj
!
Coef.
(1)
-.1905919
t
P>|t|
.0346796 3 . 06 .0569B94 I 5.21 . 2099842 15.99 .0020795 1-11.89 5.649261 14.84
. svylc bpsystol - bpdiast ( 1) bpsystol • bpdiast = 0.0 tcresult
Number of obs = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08 F( 4, 28) = 307.00 Prob > F = 0.0000 R-squared = 0.1945
. 0353449 . 1804969 2.928844 -.0289619 72 . 30246
. 1768038 .4128356 3.785375 -.0204796 95 . 34594
i
Std. Err. .: .0818056
0.005 0.000 0.000 0.000 0.000
[957, Conf. Interval]
t -2.33
p>ltl 0.027
[957. Conf. Interval] -.3574354
-.0237483
svylc — Estimate linear combinations after survey estimation
42
Note that the svy commands in [R] svy estimators always use only complete cases so that the covariance is always computed, and svylc can always be run afterward.
Example The variable highbp is 1 if a person has high blood pressure and 0 otherwise. We can model it using logistic regression. . svylogit Mghbp height weight age age2 female black Survey logistic regression pweight : f inalsrs Number of obs Strata: strata Number of strata PSU: psu Number of PSUs Population size F( 6, 26)
Prob > F highbp
Coef .
Std. Err.
height weight age age2 fesale black _cons
-.0325996 . 049074 .1541151 -.0010746 -.356497 .3429301 -4.89574
. 0058727 .0031966 .0208709 .0002025 .0885354 . 1409005 1.159135
t
P>lt|
-5.55 15.35
0.000
7.38 -5.31
0.000 0.000 0.000 0.021
-4.03 2.43 -4.22
0.000
0.000
10351 = 31 = 62 = = 1 . 172e+08 87.70 = 0 . 0000
[95'/. Conf . Interval] -.0445771 .0425545 .1115486 -.0014877 - . 537066 .0555615 -7.259813
-.0206222 .0555936 .1966815 -.0006616 -.175928 .6302986 -2.531668
We can redisplay the results expressed as odds ratios. . svylogit, or Survey logistic regression pweight: final vgt Strata: strata PSU: psu
Number of obs = 10351 Number of strata = 31 Number of PSUs = _ 62 Population size = 1.172e+08 F( 6, 26) = 87.70 Prob > F = 0.0000
highbp
Odds Ratio
Std. Err.
height weight age age 2 female black
.967926 1.050298 1.166625 . 998926 .7001246 1 . 40907
.0056843 .0033574 .0243485 .0002023 .0619858 . 1985388
t -5 .55
15 .35 7 .38 -5 .31 -4 .03 2 .43
P>lt!
[95V, Conf. Interval]
0.000 0,000 0.000 0.000
.9564019 1.043473 1.118008 .9985135 . 5844605 1.057134
0.000
0.021
.979589 1.057168 1.217356 .9993386 .8386784 1.878171
svylc can be used to estimate the sum of the coefficients for female and black. . svylc female + black ( 1)
female + black =0.0 highbp
Coef.
(1)
-.0135669
Std. Err. .1653936
t
-0.08
P>|t| 0.935
[957, Conf. Interval] -.3508894
.3237555
r
s vylc — Estimate linear combinations after survey estimation
43
This result is more easily i iterpreted as an odds ratio. . svylc female + bis ck, or ( 1)
I
female + bladk = 0 . 0
highbp
Odds Ratio
(1)
.9£ 65247
Std. Err. .1631648
t -d.08
p>ltl
[957. Conf. Interval]
0.935
.7040616
1.382309
The odds ratio 0.987 is an estimate of the ratio of the odds of having high blood pressure for black females over the odds for <pr reference category of nonblack males (controlling for height, weight, and age). See [R] lincom for mori examples of odds ratio!.
Using svylc after estimators that estimate multiple-equation models is a little trickier, but still straightforward. Users mere ly need to refer to the coefficients using the syntax for multiple equations; see [U] 16.5 Accessing coe: Bcients and standard errors for a description and [R] test for examples of its use.
[> Example In the NHANES II data, 've have a variable health containing self-reported health status, which takes on the values 1-5, with 1 being "poor" anil 5 being "excellent". Since this is an ordered categorical variable, it makes sense to model it using svyologit or svyoprobit. We will do so in the next example, but we will first use svymlogit since it is a good example of a multiple-equation estimator at its simplest, j So. we estimate a multinomial logistic regressiori model:
(Continued oninext page)
44
svylc — Estimate linear combinations after survey estimation . svymlogit health female black age age2 Survey multinomial logistic regression pweight: finalvgt Strata: strata PSU: psu
health
Coef .
Std. Err.
female black age age2 _cons
- . 1983735 .8964694 .0990246 -.0004749 -5.475074
. 1072747 .1797728 .032111 .0003209
female black age age2 _cons
female
Number of obs Number of strata Number of PSUs Population size F( 16, 16) Prob > F
P>|t|
t
= 10335 = 31 = 62 = 1.170e+08 = 36.41 = 0.0000
[957. Conf . Interval]
poor
. 7468B76
-1. 85 4.99 3. 08 -1. 48 -7.33
0.074 0.000 0.004 0.149 0.000
-.4171617 .5298203 .0335338 -.0011294 -6 . 9983
.0204147 1.263119 .1645155 .0001796 -3.951848
.1782371 .4429445 . 0024576 .0002875 -1.819561
.0726556 . 122667 .0172236 .0001684 .4018153
2.45 3. 61 0. 14 1. 71 -4. 53
0.020 0.001 0.887 0.098 0.000
.030055 .1927635 - . 0326702 -.0000559 -2.639069
.3264193 .6931256 .0375853 .000631 -1.000053
black age age2 _cons
-.0458251 -.7532011 -.061369 .0004166 1.815323
.074169 .1105444 .009794 .0001077 .1996917
-0.62 -6.81 -6.27 3. 87 9.09
0.541 0.000 0.000 0.001 0.000
- . 1970938 -.9786579 -.081344 .000197 1.408049
. 1054437 -.5277443 -.0413939 .0006363 2.222597
excellent female black age age2 _cons
- . 222799 -.991647 -.0293573 - . 0000674 1.499683
.0754205 . 1238&06 .0137789 .0001505 .286143
-2. 95 -8. 00 -2. 13 -0.45 5. 24
0.006 0.000 0.041 0.657 0.000
-.3766202 -1.244303 -.0574595 -.0003744 .9160909
-.0689778 -.7389909 -.001255 .0002396 2.083276
fair
good
(Outcome health==average is the comparison group) One might want to calculate the estimate for black females for the "excellent" category: . svylc [excellent]female + [excellent]black ( 1)
[excellent]female +• [excellent]black = 0 . 0 health
Coef.
(1)
-1.214446
Std.
Err.
.1428188
t
-8.50
P>lt| 0.000
[957. Conf. Interval] -1.505727
-.9231652
This result might be better interpreted as a relative risk ratio. Since the estimate was negative, one could reverse signs to get a relative risk ratio that is greater than one: . svylc -[excellent]female - [excellent]black, rrr ( 1} - [excellent]female - [excellent]black = 0.0 health
RRR
(1)
3.368427
Std. Err.
t
P>|t|
[95% Conf. Interval]
.4810747
8.50
0.000
2.517245
4.507429
— Estimate linear combinations after survey estimation
45
RRR = 3,37 is the ratio o relative risk for nonblick males to black females, with "relative risk" being the probability of be ng in the "excellent" category divided by the probability of being in the "average" base category. Hence, this relative risk ritio is RRR =
Pi (excellent nonblack ma|e)/Pr(average | nonblack male) ^{excellent black femate)/Pr(average j black female)
We now estimate the sane model using svyologit: , svyologit health ieraale black age age2 Survey ordered logistic regression pweight: finalwgt Strata: strata psu PSU:
Number of obs Number of strata Number of PSUs Population size F( 4, 28) Prob > F
Std. Err.
health
Coef .
female black age age2
-.if. 15219 -.S 86568 -.01 19491 -.00 03234
.0523678 .0790276 .0082974 .000091
-3.08 -13-48 -11.44 -3.55
0.004 0.000 0.160 0.001
-.2683266 -1 . 147746 - . 0288717 -.000509
-.0547171 -.8253901 .0049736 -.0001377
/cut! /cut 2 /cut3 /cut4
-4.5 66229 -3.0 57415 -1.5 20696 _ ' 42785
. 1632559 . 1699943 .1714341 . 1703964
-27*. 97 -17=. 99 -187 -1.42
0.000 0.000 0.000 0.164
-4.899192 -3.404121 -1.870238 - . 5903107
-4.233266 -2.710709 -1 . 170954 . 1047407
: t
P>lt|
= 10335 = 31 « 62 = 1.170e+Q8 = 223.27 = 0.0000
[957. Conf. Interval]
j
Although svyologit and s /yoprobit are multiple-equation estimators, one can refer to the estimates in the first equation using s gle-equation syntax: . svylc female + black ( 1)
[health] female + [health]black = 0.0 !
i
health
;oef.
Std. Err.
(1)
-1. 14609
. 1008367
t -111 39
P>ltl 0.000
[957. Conf. Interval] -1.353748
-.942432
The single-equation syntax does not work when referring to the cutpoints: . svylc cutl - cut2 j cutl not found | r(lll); j
When in doubt, always use the show option. It will jihow you exactly how the equations are labeled.
(Continued on tiext page)
46
svylc — Estimate linear combinations after survey estimation , svylc, show Coef.
Std. Err.
female black age age2
-.1615219 -.986568 -.0119491 -.0003234
. 0523678 .0790276 .0082^74 -000091
-3.08 -12.48 -1.44 -3.55
0.004 0.000 0.160 0.001
-.2683266 -1.147746 -.0288717 - . 000509
-.0547171 -.8253901 .0049736 -.0001377
_cons
-4.566229
. 1632559
-27.97
0 . 000
-4.899192
-4.233266
_cons
-3.057415
.1699943
-17.99
0.000
-3.404121
-2.710709
_cons
-1.520596
.1714341
-8.87
0.000
-1.870238
-1.170954
_cons
-.242785
. 1703964
-1.42
0 . 164
-.5903107
. 1047407
t
P>ltl
[95% Conf . Interval]
health
cutl
cut2
cut4
The output of svyologit and svyoprobit is actually quite deceptive. The first equation contains all the coefficient estimates, but then there is one equation for each cutpoint. To estimate differences of the cutpoints, use the multiple-equation syntax: . svylc [cut2]_cons - [cutl]_cons ( 1) - [cutl]_cons + [cut2]_cons = 0.0 health
Coef.
(1)
1.508814
Std. Err. .0501686
t 30.07
P»ti
[95'/, Conf . Interval]
0.000
1.406495
1.611134
Subpopulations with one by{) variable The svymean. svytotal, and svyratio commands allow a by () option which produces estimates for subpopulations; see [R] svymean. Frequently, one wishes to compute estimates for differences of subpopulation estimates. It is easy to use svylc to compute estimates for differences or any other linear combination of estimates. The only thing one must know is the proper syntax for referencing the subpopulation estimates. In this and the next two sections, we illustrate the syntax with a series of examples.
> Example Suppose that we wish to get an estimate of the difference in mean vitamin C levels (variable vitaminc) between males and females. First, we compute the means of vitaminc by sex. . svymean vitaminc, by(sex) Survey mean estimation pweight: finalwgt strata Strata: psu PSU:
Number of obs Number of strata Number of PSUs Population size
= 9973 = 31 = 62 = I.l29e+08
svylc — Estimate linear combinations after survey estimation
Mean
Subpop .
—i———• i Estimate
Std. Err.;!
[957. Conf. Interval]
.9312051 1.12753
.0169297 : .0173704
.8966768 1 . 092103
47
Deff
vitaminc Kale Female
.9657333 1.162957
4.926449 5.028652
Then we use the svylc (command. . svylc [vitaminc Male - [vitaminc] Femalfe ( 1) [vitaiainc]'.iale - [vitaminc]Female = 0.0 Mean
Estimate
(1)
- 1963252
Std. Err.
t
.015981
K2.28
P>|t|
0.000
[957. Conf. Interval] -.2289186
-.1637318
When svymean or sv^ 'total is used with a by (!) option, the syntax for referencing the subpopulation estimates is
[varname] subpop Jabel For example, we use [idtamincjMale to refef to the subpopulation estimates. This is the same syntax that is used with the test command wh|n there are multiple equations; see [R] test for full details. Be sure to type the \ ariable names and subpcj>pulation labels exactly as they are displayed in the output. Remember that >tata is case-sensitive. . svylc [vitamin<]male - [vitaminc] female Bale not found r(lll);
If there are no subpopu ation labels, simply use the numbers displayed in the output. , svymean vitamii c, by(sex) nolabel Survey mean estii ation pweight: final wj Strata: strata PSU: psu
Mean
Subpop.
Number of obs Number of strata Number of PSUs Population size
Estimate
Std. Errl
[957, Conf. Interval]
.9312051 1.12753
.0169297 .0173704 i
.8966768 1.092103
= 9973 = 31 = 62 = 1.129e+08 Deff
vitandnc sex==l sex==2
.9657333 1.162957
4 926449 5 028652
. svylc [vitamin :] 1 - [vitaminc] 2 ( i) [vitaminc 1 - [vitaminc] 2 =0.0 Mean
Estimate
Std. Err.
(1)
-.1963252
.015981
t ,;-12.28
i
P>|tl 0.000
[957. Conf. Interval] -.2289186
-.1637318
48
svyic — Estimate linear combinations after survey estimation
Subpopulations with two or more by() variables If there are two or more by() variables, you must refer to the subpopulations by numbers (1,2, 3, . . . ) when using svylc.
l> Example , svymean vitaminc, by(sex race) Survey mean estimation pweight: finalwgt Strata: strata PSU: psu
Mean
Number of obs Number of strata Number of PSUs Population size
9973 = 31 62 = = 1 . 129e+08
Subpop .
Estimate
Std. Err.
[95"/, Conf. Interval]
Deff
White Black Other White Black Other
.9475117 .7382045 1.021363 1.151125 .9222313 1 . 0804
.0168982 .0477521 .0521427 .0168117 , 0348224 . 0412742
.9130475 .6408135 .915017 1.116838 .8512105 . 9962202
. 9819758 .8355955 1 . 127708 1.185413 .993252 1 . 164579
4.646413 2.165885 1.739788 4.032603 2.915009 1.00135
vitaminc Hale Male Male Female Female Female
You can see the numbering scheme by running svylc with the show option. . svylc, show Mean
Coef .
1 2 3 4 5 6
.9475117 .7382045 1.021363 1.151125 .9222313 1.0804
Std. Err.
t
P>|t|
[95X Conf. Interval]
0.000 0.000 0.000 0.000 0.000 0.000
.9130475 .6408135 .915017 1.116838 .8512108 .9962202
vitaminc .0168982 .0477521 .0521427 .0168117 .0348224 .0412742
56.07 15.46 19.59 68.47 26.48 26.18
.9819758 .8355955 1.127708 1.185413 .993252 1 . 164579
So if we want to test the hypothesis that vitamin C levels are the same in white females and black females, we need to test subpopulation 4 versus subpopulation 5. . svylc [vitaminc]4 - [vitaminc]5 ( 1) [vitaminc]4 - [vitaminc]5 = 0 . 0 Mean
Estimate
Std. Err.
(1)
.2288941
.0337949
t
P>|ti
[957, Conf. Interval]
6.77
0.000
. 1599688
.2978193
svylc — Estimate linear combinations after survey estimation
49
Iftie use of svylc after svyratio Using svylc after svyrstio is a little more complicated. But, again, the show option on svylc will guide you.
> Example . svyratio yl/xl y2/i2 Survey ratio estimatj on pwe ight: f inalwgt Strata: strata PSU: psu
Number of obs Number of strata Number of PSUs Population size
10351 31 62 1.172e+08 Deff
Ratio
Estimate
Std. Err.
[957, Conf. Interval]
yl/xl y2/x2
.9918905 .9962729
.0102386 .0083088 '
. 9710087 ,9793269
1.012772 1.013219
1.647415 1.0771
. svylc, show
yi
Ratio
( oef.
xl
,99: 8905
.0102386
x2
.99( 2729
.0083088
i
Std. Err.
P>ltl
[957. Conf. Interval]
96. ^8
0.000
.9710087
1.012772
119. fl
0.000
.9793269
1.013219
, svylc [yl]xl - [y£ x2 ( 1) [yllxl - [y2j:t2 = 0.0 ;
1
Ratio
Est: mate
Std. Err.
it
P>lt|
(1)
-.00' ,3824
.0125921
-Oi35
0.730
[95X Conf. Interval] - . 0300641
.0212993
The following examples illistrate the syntax when there are by() subpopulations.
Example svyratio yi/xl, by race) Survey ratio estimation pwe ight: f inalwgt Strata: strata j PSU: psu
Ratio
Number of obs = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08
Subpop .
Estimate
Std. Err.
White Black Other
.995116 .9525558 i . 026876
.0116867 .0381059 i
•
[957. Conf. Interval]
Deff
yl/xl .0447707
.9712807 . 8748384 .9355659
1.018951 1,030273
1.879759 2 . 242268
1.118187
. 8308877
svylc — Estimate linear combinations after survey estimation
50
. svylc, show Ratio
Coef.
White Black Other
.995116 .9525558 1.026876
Std. Err.
P>|t|
t
[957, Conf. Interval]
1
.0116867 .0381059 .0447707
85.15 25.00 22.94
0.000
.9712807
0 . 000 0 . 000
. 8748384 . 9355659
1.018951 1.030273 1.118187
. svylc [1] White - [1] Black ( 1) [1] White - [l3Black = 0.0 Ratio (1)
Estimate
Std. Err.
t
P>lt|
. 0425602
. 0439945
0 . 97
0.341
. svyratio yl/xl, by (sex race) Survey ratio estimation pweight: finalwgt Strata: strata PSU:
psu
Ratio
Subpop.
Estimate
Std. Err.
yl/xl Male Male Male
White Black Other
1.000215 .9726418 1.000358
.0150805 .0486307 .0732775
Female
White
.9904237
.0169396
Female Female
Black Other
.9362548 1.056553
.0409748 .082305
[957. Conf. Interval] -.0471671
.1322875
Number of obs Number of strata Number of PSUs Population size
10351 = 31 = 62 = = 1 . 172e+08
[957, Conf. Interval]
Deff
.9694585 .8734589 . 850907 .9558752 .8526861 .8886906
1.030972 1.071825 1 . 149808 1.024972 1.019823 1.224415
1.460442
1.426839 1.266913 2.109029 1.619815 1.228803
. svylc, show Ratio
Coef.
1
1.000215 .9726418 1.000358 ,9904237 .9362548 1.056553
Std. Err.
t
P>|t|
[95V. Conf. Interval] -
66.33 20.00 13.65 58.47 22.85 12.84
0.000 0.000 0.000 0.000 0.000 0.000
.9694585 .8734589 .850907 .9558752 .8526861 .8886906
1 2
3 4 5
6
.0150805 .0486307 .0732775 .0169396 .0409748 .082305
1.030972 1.071825 1 . 149808 1.024972 1.019823 1.224415
. svylc [1]1 - [1]4 ( 1) [1]1 - [1]4 = 0.0 Ratio
Estimate
Std. Err.
t
P>|tl
(1)
.0097916
.0221119
0.44
0.661
[957. Conf. Interval] -.0353058
.054889
svyic — Estimate linear combinations after survey estimation
51
r
$aved Results svylc saves in r(): Scalars r(est) r(se) r(N_strita) r(N_psu] r(deff) r(deft) r(meft)
point estimate of lineajr combination standard error (square ;root of design-based variance estimate) number of strata I number of sampled P$Us deff ; deft : meft
•I Methods and Formulas svylc is implemented .s an ado-file. svylc estimates r? = Cfd, where 8 is a q x 1 Vector of parameters (e.g., population means or population regression coefficients) . and C is any i x q vector of constants. The estimate of r\ is ! r\ — C9, and the estimate >f its variance is
Similarly, the simple-randdm-sampling variance esiimator used in the computation of deff and deft C'. The variance estimator used in the computation of meff and meft is s mi p }C'. See the Method's and Formulas section of [R] svymean for details Mnspwmsp) — on the computation of deff deft, meff, and meft.
References Eltinge. J. L. and W. M. Sribn^y. 1996. svy5: Estimates ofilinear combinations and hypothesis iests for survey data. Stata Technical Bulletin 3!: 31-42. Reprinted in Stata technical Bulletin Reprints, vol. 6. pp. 246-259. McDowell. A.. A. Engel. J. T. Massey. and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey. 1976-1980. Viral and Health Statistics 15(1). National Center for Health Statistics, Hvansville, MD.
Also See Complementary:
[R] svy estimators. [R] svycUs, [R] svymean, [R] svyset. [R] svytest
Related:
[R] Bncom
Background:
[U] 6.5 Accessing coefflcierits and standard errors. ;i [u] '. SO Overview of survey Estimation. [R]«vy
Title svymean — Estimate means, totals, ratios, and proportions for survey data
I
Syntax svymean varlisl {weight} [if exp] [in range] [, common-options ] svytotal varlist [weight] [if exp] [in range] [, common ^options svyratio varname [/] varname [varname [/] varname ...] [weight] [if exp] [in range] [, common-options ] svyprop varlist [weight] [if exp] [in range] [, strata(vamame) psu(vai7zaw«) fpc(varname) by(iw//,y/) subpop(varname') nolabel format(7,fmt) ] The common ^options for svymean, svytotal, and svyratio are strata (varname') psn(\'arname) fpc(varname) by(varlist) subpop(varname) srssubpop nolabel { complete available } level(#) ci deff deft meff meft obs size svymean, svyratio, and svytotal typed without arguments redisplay previous results. Any of the following options can he used when redisplaying results: level(#) ci deff deft oeff meft obs size AH tiu-sc commands allow pweights and iweights; see [U] 14.1.6 weight. Warning: Use- of if <« in resiriciions will not produce correct variance estimates for subpopulations in many cases. To L-onipuic fs!ini;itfs iot Mih|»pulaiions, use the byO or subpopO options.
Description svymean. svytotal. svyratio. and svyprop produce estimates of finite-population means, toials. ratios, and proportions. Associated variance estimates, design effects (deff and deft), and misspecificulion effeois (mcff and meft) are also computed. Estimates for multiple subpopulations can be obtained using the by() option. The subpop() option will give estimates for a single subpopulation. To describe the strata and PSUs of your data and to handle the error message "stratum with only one PSl' detected , see [R] svydes. To estimate differences (and other linear combinations) of means, totals, and ratios, see [R] svylc. To perform hypothesis tests, see [R] svytest.
52
svymean — Estimate means, totals, ratios, and proportions for survey data
53
Options strata(varnflme) specifies the name of a varikble (numeric or string) that contains stratum identifiers. strata () can al ;o be specified with th^ svyset command; see the following examples and fR] svyset. psu( varname) specifies the name of a variable (numeric or string) that contains identifiers for the primary sampling unit (i.e., the cluster). piu() can also be specified with the svyset command. f pc (varname) reque its a finite population correction for the variance estimates. If the variable specified has values less thin or equal to 1, it is interpreted as a stratum sampling rate fh = n^/JV^, where nh = number of F SUs sampled from stratuim h and N& = total number of PSUs in the population belonging to stratum h. If the variable specified has values greater than or equal to n-h, it is interpreted as con anting Nh- fpc() can ilso be specified with the svyset command. by(var/wr) specifies that estimates be competed for the subpopulations defined by different values of the variable(s) in the by varlisi. subpop(varnflmf) s)ecifies that estimates bd computed for the single subpopulation defined by the observations for vhich varname / 0 Tybically, varname = 1 defines the subpopulation, and varname = 0 indicates observations not belonging to the subpopulation. For observations whose subpopulation stalus is uncertain, varname should be set to missing ('.'), srssubpop can onlj be specified if by() of! subpopO is specified, srssubpop requests that deff and deft be comp ited using an estimate of simple-random-sampling variance for sampling within a subpopulation. If srssubpop is not specified, deff and deft are computed using an estimate of simple-random-sa npling variance for sampling from the entire population. Typically, srssubpop would be given w hen computing subpopujation estimates by strata or by groups of strata. nolabel can only te specified if by() is specified, nolabel requests that numeric values rather than value labels be used to label output for subpopulations. By default, value labels are used. { complete j available } specifies how missing values are to be handled, complete specifies that only observations with complete data should be used; i.e., any observation that has a missing value for any of the variables in the vartist is ofcitted from the computation, available specifies that all available nonnu'ssing values be used fdjr each estimate. If neither comple fce nor available is specified, available is the default when there are missing values and there £re two or more variables in the varlist (or four or more for svyratio). If there are missing value:; and two or more variables (or four or more for svyratio), complete must be specified to compete the covariance or to use svytest (for hypothesis tests) or svylc (estimates for linear combin itions) after running the command; see [R] svylc and [R] svytest. format(X/mr) (svjprop only) specifies thdj display format for the proportion estimates and their standard errors. The default is 7,9.6f. ' The following options can be specified initially or when redisplaying results: level (#) specifies the confidence level (i.f., nominal coverage rate), in percent, for confidence intervals. The deiault is level(95) or as! set by set level; see [U] 23.5 Specifying the width of confidence intervals. ci requests that confidence intervals be displaced. If no display options are specified, then, by default. confidence interv Is are displayed. ] deff requests that tie design-effect measure; deff be displayed. If no display options are specified. then, bv default. leff is displayed. deft requests thai tLe design-effect measure deft be displayed. See the following section. Some fine points about deff and and deft, for a discussiojn of deff and deft.
58
svymean — Estimate means, totals, ratios, and proportions for survey data
The different variance estimates are due to the way if works in Stata. For all commands in Stata. using if is equivalent to deleting all observations that do not satisfy the if expression and then running the commands. In contrast with this, the svy commands compute subpopulation variance estimates in a way that accounts for which sample units were and were not contained in the subpopulation of interest. Thus, the svy commands must also have access to those observations not in the subpopulation to compute the variance estimates. The survey literature refers to these variance estimators as "unconditional" variance estimators/See, e.g., Cochran (1977, Section 2.13) for a further discussion of these unconditional variance estimators and some alternative "conditional" variance estimators. For survey data, there are only a few circumstances that necessitate the use of if. For example, if one suspected laboratory error for a certain set of measurements, then it might be proper to use if to omit these observations from the analysis.
Options for displaying results: ci deff deft meff meft obs size We now use data from the Second National Health and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981) as our example. First, we set the strata, psu, and pweight variables. svyset strata atratid svyset psu psuid svyset pweight finalwgt
We will estimate the population means for total serum cholesterol (tcresult) and for serum triglycerides (tgresult). . svymean tcresult tgresult Survey mean estimation pweight: finalwgt Strata: stratid PSU: psuid
Number of obs(*) = Number of strata *
10351 31
Number of PSUs = 62 Population size = 1.172e-K)8
Mean
Estimate
Std. Err.
[95X. Conf. Interval]
Deff
tcresult tgresult
213.0977 138.576
1.127252 2.071934
210.7986 134.3503
215.3967 142.8018
5.602499 2.356968
(*) Some variables contain missing values.
II we want to see how many nonmissing observations there are for each variable, we can redisplay the results specifying the obs option. . svymean, obs Survey mean estimation pweight: finalwgt Strata: stratid PSU: psuid
Number of obs(*) = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08
Mean
Estimate
Std. Err.
tcresult tgresult
213.0977 138.576
1 . 127252 2.071934
Obs
10351 5050
(*) Some variables contain missinst values.
r
svymean — Estimate means, totals, j-atios, and proportions for survey data
59
The svymean, svytotal, and svyratio corrjmands allow you to display any or all of the following: ci (confidence intervals), deff. deftj! meff, meft, obs, and size (estimated (sub) population size). I , svymean, ci deff deft meff meft obs size Survey mean estimation Number of obs(«0 Number of strata Number of PSUs Population size
pveight: finalwgt Strata: stratid PSU: psuid
= 10351 = 31 = 62 = 1.172e+08
Mean
Estimate
Std. Err.
[95X jjonf . Interval]
Deff
tcresult tgresult
213.0977 138.576
1.127252 2.071934
210.7686 134.3S03
215.3967 142.8018
S. 602499 2.356968
.;
Mean
Deft
Meff
4ft
Obs
Pop. Size
tcresult tgresult
2 . 36696 1.535242
5,39262 2.328208
2. 322^02 1.525B47
10351 5050
1.172e+08 56820832
C*) Some variables contain missing values.'
If none of these options are specified, ci and deff |are the default. Note that there is no control over the order in which the options are displayed; they ^re always displayed in the order shown here. We can also give display options when we first issue a command. . svyaean tcresult tgresult, deff meff obs
Using svytotal and svynrtio All our examples to this point have used svymeaji. The svytotal command has exactly the same syntax. In our NHANES II dataset, heartatk is a variable that is 1 if a person has ever had a heart attack and 0 otherwise. We estimate the total numbers of persons who have had heart attacks by sex in the population represented by our data. . svytotal heartatk, by(sex)
•
Survey total estimation pseight: finalwgt Strata: stratid PSU: psuid
Number of obs Number of strata Number of PSUs Population size 1|
Total Subpop .
= 10349 = 31 = 62 = 1.171e+08
— ...
Estimate
Std. Err.
:j[95X Conf. Interval]
2304839 1178437
200231.3 109020.5
1896465 956088.2
Deff
heartatk Hale Female
2713213 1400786
1.567611 . 9000898
The syntax for the svyratio command only differs from svymean and svytotal in the way the varlisi can be specified. All the options are the same.
60
svymean — Estimate means, totals, ratios, and proportions for survey data
In our NHANES H dataset, the variable tcresult contains total serum cholesterol and the variable hdresult contains serum levels of high-density lipoproteins (HDL). We can use svyratio to estimate the ratio of the total of hdresult to the total of tcresult. . svyratio hdresult/tcresult Survey ratio estimation pweight: finalwgt Strata: newstr PSU: newpsu
Number of obs Number of strata = Number of PSUs = Population size =
Ratio
Estimate
Std. Err.
hdresult /t cresult
.2336173
.0024621
[951/, Conf. Interval] .228589
.2386457
8720 30 60 98725345 Deff 7 . 724248
Out of the 10351 NHANES II subjects with a tcresult reading, only 8720 had an hdresult reading. Consequently, svyratio used only the 8720 observations that had nonmissing values for both variables. In your own datasets, if you encounter substantial missing-data rates, it is generally a good idea to look into the reasons for the missing-data phenomenon, and to consider the potential for problems with nonresponse bias in your analysis. Note that the slash / in the varlist for svyratio is optional. We could have typed . svyratio hdresult tcresult
svyratio or svymean can be used to estimate means for subpopulations. Consider the following example. In our NHANES II dataset, we have a variable female (equal to 1 if female and 0 otherwise) and a variable iron containing iron levels. Suppose that we wish to estimate the ratio of total iron levels in females to total number of females in the population. We can do this by first creating a new variable f iron that represents iron levels in females; i.e., the variable equals iron level if the subject is female and zero if male. We can then use svyratio to estimate the ratio of the total of f iron to the total of female. . gen firon = female*iron . svyratio firon/female Survey ratio estimation pweight: finalwgt Strata: strata PSU: psu
Ratio f iron/female
Number of obs Number of strata Number of PSUs Population size
= 10351 = 31 = 62 = 1.172e+08 Deff
Estimate
Std. Err.
[951/. Conf. Interval]
97.16247
.6743344
95.78715
98.53778
2.014025
This estimate can be obtained more easily by using svymean with the subpop option. The computation that is carried out is identical. . svymean iron, subpop(female) Survey mean estimation pweight: finalwgt Strata: strata PSU: psu Subpop.: female==l
Number of obs = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e-K>8
r
svymean — Estimate means, totaN. ratios, and proportions for survey data
Mean
Estimate
Std. Err.
[95X! Conf. Interval]
Deff
iron
97 . 1624?
.6743344
95.7JB715
98.53778
2.014025
61
Estimating proportions 1
Estimating proportions can be done by using 4e svymean or svyprop commands. The mean of heartatk (heartatk equals 1 if aj person has ever had a heart attack and 0 otherwise) is an estimate of the proportion of persons who h|^ve had heart attacks.
f
. svymean heartatk Survey mean estimation pweight: finalvgt Strata: strata PSU: psu
• i I ;
Humber of obs JJumber of strata Number of PSUs Population size
Mean
Estimate
Std. Err.
[95'X, ; Conf . Interval]
heartatk
.0297383
.0018484
.025^684
.0335081
10349 = 31 = 62 = = 1.171e-K>8
Deff 1.225296
We could have also obtained this estimate by using svyprop. . svyprop heartatk
j , ' j
, pweight : finalwgt Strata: strata PSU: psu
-' " — — Number of obs Number of strata Number of PSUs Population size
j Survey proportions estima'tion heartatk _Dbs _Est-Prop 9873 0.970262 476 0.029738
s =2
=
10349 31 62 1 . 171e-K>8
_StdErr i 0.001848 0.001848 •i
The svymean command produces more output thjan svyprop—it calculates covariance, deff, meff, etc. In order to do these computations, svymean ijequires that you specify a separate dummy (0/1) variable for each proportion you wish to estimate, iWith svyprop, we do not have to create dummy variables, nor does the command create them intehially. svyprop will estimate proportions for all the categories defined by the unique values of the variables in its varlist. It can handle any number of categories, but provides less output. j . svyprop race agegrp
j
pweight : finalwgt Strata: strata PSU: psu j
Survey proportions estimation race agegitp _0bs White age20-29 1975 White age30-39 1411 White
age40-49
1120
Number of obs Number of strata Number of PSUs Population size ,
_EstProp :| _StdErr 0.241082 {0.006800 0.178603 JO.006491 0.148661
10.004936
= 10351 = 31 = 62 = 1.172e+08
62
White White White Black Black Black Black Black Black Other Other Other Other Other Other
age5Q-59 age60-69 age 70+ age20-29 age30-39 age40-49 age50-59 age60-69 age 70+ age20-29 age30-39 age40-49 age50-59 age60-69 age 70+
1129 2552 878 286 179 124 140 260 97 59 32 28 22 48 11
0 . 147948 0.120860 0.042001 0.031459 0.020485 0.015115 0.015160 0 . 009703 0 . 003583 0.007917 0.005213 0.004587 0 . 004052 0.002926 0.000645
0.006418 0.004560 0.003132 0.004918 0 . 002754 0.001964 0.002668 0.001380 0.000753 0.002237 0.002072 0.001669 0.00216S 0.002152 0.000540
svyprop also allows the by() and subpopO options for subpopulation estimates. . svyprop agegrp, by(race) Number of obs Number of strata Number of PSUs Population size
pweight: finalwgt Strata : strata PSU: psu
Survey proportions estimation -> race=White agegrp _0bs _EstProp age20-29 1975 0,274220 age30-39 1411 0.203153 age40-49 1 1 20 0 . 1 69096 age50-59 1129 0.168285 age60-69 2552 0.137473 age 70+ 878 0.047774 -> race=Black agegrp _0bs _EstProp age20-29 286 0.329388 age30-39 179 0.214495 age40-49 124 0.158266 age50-59 140 0.158738 age60-69 260 0.101601 age 70+ 97 0.037513 -> race=0ther agegrp _0bs _EstProp age20-29 59 0.312426 age30-39 32 0 . 205729 age40-49 28 0.181028 ageBO-59 22 0.159901 age60-69 48 0.115473 age 70+ 11 0.025442
10351 31 62 = 1.172e+08
_StdErr 0.007286 0.006602 0 . 004680 0.005908 0.004102 0.003265 _StdErr 0.019606 0.012963 0.013822 0.012046 0.007740 0.005293 _StdErr 0 . 047763 0.027031 0.019953 0.024910 0.038194 0.011067
Changing the strata, psu, fpc, and pweight variables The NHANES II dataset contains a special sampling weight for use with the lead variable. We can change the pweight by setting it again with svyset. . svyset pweight leadwt
svymean — Estimate means, totals, ratios, and proportions for survey data
63
. svymean lead, by(sex race) Survey mean estimation pweight : leadwt Strata: strata psu PSU: ; Subpop .
Mean
Number of obs Number of strata Number of PSUs Population size
4948 = 31 = 62 = = 1. 129e+08 Deff
Estimate
Std. Err.
[957. Conf . Interval]
78945 70286 16S66 80468 92722 74*92
.3010539 .7448225 . 9394023 .2447241 .5255033 .655919
16. 17544 18. 18379 14.24973 11.30556 11.85545 10.40417
lead Hale Male Male Female Female Female
White Black Other White Black Other
16 19 16 1.1 12 11
17.40345 21.22194 18.08158 12.30379 13.99899 13.07968
4.641307 1 .950369 1 .293194 5.920213 3.946779 1 .492849
To change the pweight back to finalwgt, we type . svyset pseight finalwgt
Remember that typing svyset alone displays the seijtings. . svyset pweight is finalwgt strata is strata psu is psu
\ '
Finite population correction (FPC) A finite population correction (FPC) accounts for the reduction in variance that occurs when we sample without replacement from a finite populatiofi, as compared to sampling with replacement. The fpc() option of the svy contnands computes \ an FPC for cases of simple random sampling or stratified random sampling; i.e.. for sample designs that use simple random sampling without replacement of PSUs within each stratum with no subjsampling within PSUs. The fpc() option is not intended for use with designs that ittvolve subsamplijjig within PSUs. Consider the following dataset:
:
list 1. 2. 3. 4. 5. 6. 7.
stratid 1 1 1 1 1 2 2 2
psuid 1 2 3 4 5 1 2 3
weight 3 3 3 3 3 4 4 4
We first set the strata, psu, and pveights. svyset strata stratid svyset psu psuid svyset pweight weight
nh 5 5 5 5 5 3 3 3
Nh 15 IS 15 15 15 12 12 12
x 2.8 4.1 6.8 6,8 9.2 3.7 6.6 4.2
64
svymean — Estimate means, totals, ratios, and proportions for survey data
In this dataset, the variable nh is the number of PSUs per stratum that were sampled. Nh is the total number of PSUs per stratum in the sampling frame (i.e., the population), and x is our survey item of interest. If we wish to use a finite population correction in our computations, we set fpc to Nh, the variable representing the total number of PSUs per stratum in the population. . svyset fpc Nh. . svyset pweight is weight strata is stratid psu is psuid fpc is Nh . svymean. x Survey mean estimation pweight : weight Strata: stratid PSU: psuid FPC: Nh
Number of obs = Number of strata = Number of PSUs Population size =
Mean
Estimate
Std. Err.
[957. Conf . Interval]
X
5.448148
.6160407
3.940751
6.955545
8 2 8 27
Deff .9853061
Finite population correction (FPC) assumes simple randoin sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
If we want to redo the computation without an FPC, we must clear the fpc using svyset and run the estimation command again. . svyset fpc, clear . svymean x Survey mean estimation pweight: weight Strata: stratid PSU: psuid
Number of obs Number of strata Number of PSUs Population size
Mean
Estimate
Std. Err.
[951/. Conf. Interval]
X
5.448148
.7412683
3.63433
7.261966
= = = =
8 2 8 27
Deff 1.003906
Including an FPC always reduces the variance estimate. However, when the Nh are large relative to the rih, the reduction in estimated variance due to the FPC is small. Rather than having a variable that represents the total number of PSUs per stratum in the sampling frame, we sometimes have a variable that represents a sampling rate fh = nh/Nh- If we have a variable that represents a sampling rate, we set it the same way to get an FPC. The commands are smart; if the fpc variable is less than or equal to 1, it is interpreted as a sampling rate: if it is greater than or equal to nh, it is interpreted as containing Nh-
svymean — Estimate means, totals, ratios, and proportions for survey data
65
ie fine points aboyt deff and deft The ratio deff (Kish 1965) is intended to compare the variance obtained under our complex survey design with the variance that we would have obtained if we had collected our observations through simple random sampling, deff is defined as deff = F/fUwor where V is the design-based estimate of variance (i.ei, this is what the svy commands compute—the displayed standard error is the square root of V) aijid Fsrswor is an estimate of what the variance would be if a similar survey was conducted using siijjiple random sampling (srs) without replacement (wor). with the same number of sample elements as ii the actual survey. In other words, V^rswor is an estimate of the variance for a hypothetical simple-rabdom-sampling design in place of the complex design that we actually used; i.e., in place of our acijual weighted/stratified/clustered design. deft is defined as (Kish 1995)
} deft =
* * '-*
where VstsvlT is computed in the same way as Vsrswer, except now the hypothetical simple-randomsampling design is with replacemerit (wr). •
Computationally, VsrsvlOT = (1 - m/M)VSTSm, where m is the number of sampled elements (i.e., the number of observations in the dataset) and M ii the estimated total number of elements in the population. For many surveys, the term (1 - m/M) ;is very close to 1, so in these cases Vsrsw0r and Kirswr are almost equal. Furthermore, if the fpc() option is not specified or set with svyset, we do not compute any finite population corrections, soj the estimates produced for ^rsw0r and VetsWT are exactly equal. To summarize: deft is exactly the: square root of deff when fpc is not set, and it is somewhat smaller than the square root of deff whin the user sets an fpc.
:
The srssubpop option for deff and deft with f ubpopulations When there are subpopulations, the svy commarids can compute design effects with respect to one of two different hypothetical simple random sampling designs. The first hypothetical design is one in which simple random sampling is conducted] across the full population. This scheme is the default for deff and deft computed by the svy commands. The second hypothetical design is one in which the simple random sampling is conducted entirely within the subpopulation of interest. This second scheme is used for deft and deft when the sissubpop option is specified. Deciding which scheme is preferable depends on tb£ nature of the subpopulations. If one reasonably can imagine identifying members of the subpopula|ions prior to sampling them, then the second scheme is preferable. This case arises primarily when the subpopulations are strata or groups of strata. Otherwise, one may prefer to ase the first scheme. Here is an example of using the first scheme (i.e.< the default) with the NHANES II data.
66
svymean — Estimate means, totals, ratios, and proportions for survey data . svymean iron, by(sex) Survey mean estimation pweight: finalwgt Strata: stratid PSU: psuid
Mean
Number of obs Number of strata Number of PSUs Population size
Subpop .
Estimate
Std. Err.
[95*/, Conf . Interval]
Male Female
104.7969 97.16247
.G57267 . 6743344
103.6603 95.78715
= 10351 = 31 = 62 = l,172e-K)8 Deff
iron 105 . 9334 98.B3778
1.360971 2.014025
Here is the same example rerun using the second scheme; i.e., specifying the srssubpop option. . svymean iron, by(sex) srssubpop Survey mean estimation pweight: finalwgt Strata: stratid PSU: psuid ; Mean
Number of obs Number of strata Number of PSUs Population size
= 10351 = 31 = 62 = 1.172e+08
Subpop .
Estimate
Std. Err.
[95*/. Conf. Interval]
Deff
Hale Female
104.7969 97.16247
.557267 . 6743344
103.6603 95.78715
105.9334 98.53778
1.348002 2.031321
iron
Because the NHANES II did not stratify on sex, we consider it problematic to consider design effects with respect to simple random sampling of the female (or male) subpopulation. Consequently, we would prefer to use the first scheme here^ although the values of deff differ little between the two schemes in this case. For other examples (generally involving heavy oversampling or undersampling of specified subpopulations), the differences in deff for the two schemes can be much more dramatic. Consider the NM1HS data and compute the mean of birthwgt by race: . svymean birthwgt, by(race) deff obs Survey mean estimation pweight: finwgt Strata: stratan PSU:
Mean
Subpop.
Estimate
Std. Err.
birthwgt nonblack black
3402 . 32 3127.834
7.609532 6.529814
Number of obs Number of strata Number of PSUs Population size Deff
1.443763 . 1720408
Obs
4724 5222
= 9946 = 6 = 9946 = 3895561.7
r
svymean — Estimate means, totals, ratlbs, and proportions for survey data . svymean birthwgt, by(race) deff obs srssub|)op Survey mean estimation pweight: finwgt Strata: strataE PSU:
Subpop.
Estimate
Std. Err.
birthsgt nonblack black
3402.32 3127,834
7.609632 6.629814
Mean
Number of obs Number of strata Number of PSUs Population size
Deff
Obs
.6268418 .J5289629
4724 5222
67
= 9946 = 6 = 9946 = 3895561.7
—4
Since the NMIHS survey was stratified on race, marital status, age, and birthweight, we consider it plausible to consider design effects computed with respect to simple random sampling within an individual race group. Consequently, in this case, we would recommend the second scheme; i.e., we would use the srssubpop option.
Misspecification effects: meff and meft Misspecification effects are used to assess biases in : variance estimators that are computed under the wrong assumptions. The survey literature (e.g., Scotland Holt 1982, p. 850; Skinner 1989) defines misspecification effects with respect to a general set of "wrong" variance estimators. The current svy commands consider only one specific form: variance estimators computed under the incorrect assumption that our observe^ sample was selected through simple random sampling. The resulting "misspecification effect" measure i| informative primarily in cases for which an unweighted point estimator is approximately unbiase(| for the parameter of interest. See Eltinge and Sribney (I996a) for a detailed discussion of extension^ of misspecification effects that are appropriate for biased point estimators. ', Note that the definition of misspecification effect isj in contrast with the earlier definition of design effect. For a design effect, we compare our complex-tjesign-based variance estimate with an estimate of the true variance that we would have obtained under a hypothetical true simple random sample. The svy commands use the following definitions tor meffc and meftc: merle = meftc = %/rpeftV where I" is the appropriate design-based estimate ojf variance and Fmsp is the variance estimate computed with a misspecified design—namely, ignoring the sampling weights, stratification, and clustering. In other words, Fmsp is what the ci eojmmand used without weights would naively compute. ; Here we request that the misspecification effects :be displayed for the estimation of mean /.inc levels using our NHANES n data.
68
svymean -— Estimate means, totals, ratios, and praportions for survey data . svymean zinc, by(sex) meff meft Survey mean, estimation pweight: finalwgt Strata: stratid PSU: psuid
Mean
Number of obs = 9202 Number of strata = 31 Number of PSUs = 62 Population size = 1.043e+08
Subpop .
Estimate
Std. Err.
Male Female
90.74543 83 . 8635
.5S|50741 .4689532
Meff
Meft
6.282539 6.326477
2.506499 2.515249
zinc
If we run ci without weights, we get the standard errors that are (Vmsp)1/2. . ci zinc, by(sex) -> sex = Male Variable zinc -> sex = Female Variable
Dbs
Mean
4375
89 .53143
Obs
Mean
zinc 4827 83.76652 . di .5850741/.2334228 2.5064994 . di (.5850741/.2334228)"2 6.2825391 . di .4689B32/.186444 2.5152496 . di (.46895327.186444)"2 6.3264806
Std. Err.
.2334228
Std. Err.
. 186444
(Continued on next page)
[957. Conf. Interval]
89 0738
89 .98906
[951/. Conf. Interval] 83.40101
84.13204
svymean — Estimate means, totals, rtjtios, and proportions for survey data
69
gvymean. svytotal, svyratio save in e(): Scalars
number of observations jm j e(N_strata) number of strata L number of sampled PSlJTs n e (N_pBtt) estimate of population ajize M e(N_pop) number of subpopulatiojjis e (n_by)
e(N)
Macros e(cHid) e (depv) e(by) e(subpop) e( label) e(wtyp«) e (wexp) e (strata) e(psu) e(fpc) Matrices e(est) e(V_db) e(V_msp) e(V_srB)
command name (e.g., sjvymean) Mean, Total, or Ratifl by() varUsi \ sutrpopO criterion : label indicator or labels weight type ; weight variable or expression strata () variable psaO variable fpcQ variable : vector of mean, total, djr ratio estimates design-based (co)varian|;e estimates V misspecification (co)var|ance Vmep simple-random-samplin|-without-replacement (co)variance Vsrswor simjple-random-samplinsj-with-replacement (co)variance Vsrswr (only cheated when ^pc() option is specified)
e(deff) e(deft) e(meft)
vector of deff estimates! vector of deft estimate^ vector of meft estimates
e(_N_subp) e(_S) e(_N_str) e(_N_psu)
vector vector vector vector
Functions e( sample)
of subpopulatiod size estimates of numbers of rionmissing observations of numbers of sfrata of numbers of sampled PSUs
marks estimation sample (complete option only)
Methods and Formulas svymean, svytotal, svyratio, and svyprop! are implemented as ado-files. The current svy commands use the relatively sjmple variance estimators outlined below. See, for example, Cochran (1977) and Wolter (1985) for stjme methodological background on these variance estimators. Sometimes authors prefer to use othefr variance estimators that, for example, account separately for variance components at different stages of sampling, use finite population corrections with some unequal-probability and multistage designs, and include other special design features. In addition, the current svy commands use "linearization" based variance estimators for nonlinear functions like sample ratios. Alternative variance estimators that use replication methods—for example, jackknifing or balanced repeated replication—may be included in future svy versions.
70 _ svymean — Estimate means, totals, ratios, and proportions for survey data _
Totals All the computations done by the svytotal, svymean, svyratio, and svyprop commands are essentially based on the formulas for totals. Let h = 1, . . , , L enumerate the strata in the survey, and let (fo, i) denote the i th primary sampling unit (PSU) in stratum h for i — 1, . . . , Nh, where Nh is the total number of PSUs in stratum h in the population. Let Mhi be the number of elements in PSU (h,i), and let M — J2,h^\ !Ci=i -^» be the total number of elements in the population. Let Yhij be a survey item for element j in PSU (h,i) in stratum h\ e.g., Yhij might be income for adult j in block i in county h. The associated population total is
L Nh Mhi
h=i t=i y=i Let yhij be the items for those elements selected in our sample; here h = I , . . . , L: i ~ I , . . ., n^', and j = 1, . . . , mhi. The total number of elements in the sample (i.e., the number of observations in the dataset) is m — ^3h=i Sf= Our estimator Y for the population total Y is
where u;/,^ are the user-specified sampling weights (pwelghts or iweights). Our estimator M for the total number of elements in the population is simply the sum of the weights:
M is labeled "Population size" on the output of the commands, To compute an estimate of the variance of Y, we first define zy;u and ~zyh by
and
z y h ~ ~~ n
^ t=l
Our estimate for the variance of Y is
v(Y} =
L
(i_/ h ) _ ! .
nh
^ - v) 2
(3)
The factor (1 - f h ) is the finite population correction. If the user does not set an fpc variable, fh = 0 is used in the formula. If an fpc variable is set and is greater than or equal to n^. the variable is assumed to contain the values of Nh, and fh is given by fh — nh/Nh, If the fpc variable is less than or equal to 1, it is assumed to contain the values of fh- As discussed earlier, nonzero values of fh in formula (3) are intended for use only with simple random sampling or stratified random sampling with no subsampling within PSUs.
svymean — Estimate means, totals, ratios, and proportions for survey data
71
If the varlist given to svytotal contains tyo or more variables and the complete option is specified or is the default, the covariance of the1 variables is computed. For estimated totals Y and X (notation for X is defined similarly to that of 7). our covariance estimate is
-A)
Cov(f, £) =
h
h=l
Ratios, means, and proportions Let R = Y/X be a population ratio that \W wish to estimate, where Y and X are population totals defined as in (1). Our estimate for jR is R — Y/X. Using the delta method (i.e., a first-order Taylor expansion), the variance of the approximate distribution of R is
x2 ^S
-•*-.
;
Direct substitution of X, R, and expressions (3| and (4) lead to the variance estimator (5)
If we define the following "ratio residual"
I A
(6) j
and replace yhi} with dhij in our variance forrjmla (3). we get the right-hand side of equation (7) below. Simple algebra dhows that this is identical to (5).
V(R) = h=i To extend our variance estimators from ratios jo other parameters, note that means are simply ratios with Xhtj = 1 and that proportions are simply Jneans with Yhij equal to a 0/1 variable. Similarly, estimates for a subpopulation S are obtained by Computing estimates for Yghij — I(h.i.j}£sYhij and Xshi) — I(h.i.j)esXhij where I(h.i,j)£s equal^ 1 if element ( h . i . j ] is a member of subpopulation 5 and 0 otherwise. i Weights
]
When computing finite population corrections|(i.e., when an fpc variable is set) or when estimating totals, the svy commands assume your weights jare appropriate for estimation of a population total. For example, the sum of your weights should eq|ial an estimate of the size of the relevant population. When an fpc is not set^ the commands svymeaif. svyratio, and svyprop are invariant to the scale of the weights; i.e.. theie commands give the sijme results no matter what the scale of weights.
72 _ svymean — Estimate means, totals, ratios, and proportions for survey data _
Confidence intervals Let n = ^2h=1 nh be the total number of PSUs in the sample. The customary "degrees of freedom" attributed to our test statistic are d — n - L. Hence, under regularity conditions, an approximate 100(1 - a)% confidence interval for a parameter 6 (e.g., 9 could be a total Y or ratio R) is I
Cochran (1977, Section 2.8) and Korn and Graubard (1990) give some theoretical justification for the use of d — n — L in computation of univariate confidence intervals and p-values. However, for some cases, inferences based on the customary n - L degrees-of-freedom calculation may be excessively liberal. For example, the resulting confidence intervals may have coverage rates substantially less than the nominal 1 - a. This problem generally is of greatest practical concern when the population of interest has a very skewed or heavy-tailed distribution, or is concentrated in a small number of PSUs. In some of these cases, the user may want to consider constructing confidence intervals based on alternative degrees-of-freedom terms based on the Satterthwaite (1941, 1946) approximation and modifications thereof; see, e.g., Cochran (1977, 96) and Jang and Eltinge (1995).
deff and deft deff is estimated as (Kish 1965) V(6)
deff =
where V(9) is the design-based estimate of variance from formula (3) for a parameter 0, and ^srswor(^srs) 's an estimate of the variance for an estimator $srs that would be obtained from a similar hypothetical survey conducted using simple random sampling (srs) without replacement (wor) with the same number of sample elements m as in the actual survey. If 0 is a total Y, we calculate Lj W-h, ^i?ii V—> TT—•> \~~v
M YY1.
"
I
*i"i.*^
*.-••••»
/
$7\2
/o\
XvMlV
h=li-l3=1
where Y = YJM- The factor (1 - /) is a finite population correction. If the user sets an fpc, we use / = 777/M; if the user does not specify an fpc, / = 0 is used. If 0 is a ratio R, we replace yhlj in (8) with dhij from (6). Note that 53^=1 Y^i=i Y^-i whijdhi} = 0, so that Y is replaced with zero. deft is estimated as (Kish 1995) defL= where T^ rS wr(^srs) is an estimate of the variance for an estimator 9STS obtained from a similar survey conducted using simple random sampling (srs) with replacement (wr). K rswr (? srs ) is computed using (8) with / = 0. When we are computing estimates for a subpopulation 5 and the srssubpop option is not specified (i.e., the default), formula (8) is used with yshij = I(h,i,j)^s yhij in place of yhij. Note that the sums in (8) are still calculated over all elements in the sample regardless of whether they belong to the subpopulation. This is because we assume, by default, that the simple random sampling is done across the full population.
svymean — Estimate means, totals, ratios, and proportions for survey data
73
When the srssubpop option is specified, we assume that the simple random sampling is carried out within subpopulation 5. In this case, we use (8) With the sums restricted to those elements belonging to the subpopulation; m is replaced with mg, the ilumber of sample elements from the subpopulation; M is replaced with MS, the sum of the weights from the subpopulation; and Y = Y/M is replaced ***• 1 with YS ~ Ys/Ms, the weighted mean across tile subpopulation. meff and meft i
meffc and meftc are estimated as
|
«8. = i ^
where V(Q) is the design-based estimate of variaijce from formula (3) for a parameter B. In addition, ^msp and VmspC^msp) are the point estimator and variance estimator, based on the incorrect assumption that our observations w4re obtained through simple random sampling with replacement — in other words, they are th£ estimators obtained by simply ignoring weights, stratification, and clustering. When 9 is a mean Y, the;estimator and its variancfc estimate are computed using the standard formulas for an unweighted mean: •""«•
>s.
**-
J
.
^
;j
-
L
i\h \rnhi
_
When 9 is a total
m s
= MFmsp and
^ ~,-j '\h=li=lj=l
(F^
= M ^ V Y . When 9 is a ratio
R = Y/X, Rmsp = Ynisp/Xmsp and the estimator (5) with V^isp(^msp), etc., is used to compute When we compute meff and meft for a sub: population, we simply restrict our sums to those elements belonging to the subpopulation and use1 ms and MS in place of m and M,
Acknowledgments The svymean, svytqtal, svyratio, and svjyprop commands were developed in collaboration with John L. Eltinge, Department of Statistics, Texas A&M University. We thank him for his invaluable assistance. \ We thank Wayne Johnson of the National Cenler for Health Statistics for providing the NMIHS and ! NHANES 0 datasets.
74
svymean — Estimate means, totals, ratios, and proportions for survey date
References Cochran, W. G. 1977. Sampling Techniques. 3d ed. New York: John Wiley & Sons. Eltinge, J. L. and W, M.-Sribney. 1996a. Accounting for point-estimation bias in assessment of misspecification effects, confidence-set coverage rates and test sizes. Unpublished manuscript Department of Statistics. Texas A & M University. ; . 1996b. svy2: Estimation of means, totals, ratios, and proportions for survey data. Stata Technical Bulletin 31: 6-23. Reprinted in State. Technical Bulletin Reprints, vol. 6, pp. 213-235. Gonzalez, J. E, Jr., N. Krauss, and C. Scott. 1992. Estimation in the 1988 National Maternal and Infant Health Survey. In Proceedings of the Section on Statistics Education, American Statistical Association. 343-348. Jang, D. S. and J. L. Eltinge. 1995. Empirical assessment of the stability of variance estimators based on a twoclusters-per-stratum design. Technical Report #225, Department of Statistics, Texas A&M University. Submitted for publication. \ Johnson, W. 1995. Variance estimation for the NMIHS. Technical document. National Center for Health Statistics, Hyattsville, MD. : Kish, L. 1965. Survey Sampling. New York: John1 Wiley & Sons. —. 1995. Methods for design effects. Journal of Official Statistics 11: 55-77. Korn, E. L., and B. I. Graubard, 1990. Simultaneous testing of regression coefficients with complex survey data: use of Bonferroni t statistics. The American Statistician 44: 270-276. McDowell, A., A. Engel. J, T. Massey, and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey, 1976-1980. Vital and Health Statistics 15(1). National Center for Health Statistics, Hyattsvillc, MD. ; Satterthwaitc, F. E, 1941. Synthesis of variance. Psychometrika 6: 309-316. . 1946. An approximate distribution of estimates of variance components. Biometrics Bulletin 2: 110-114. Scott. A. J. and D. Holt. 1982. The effect of two»stage sampling on ordinary least squares methods. Journal of the American StulMicul Association 77: 848-854. Skinner, C. J. 19S9. Introduction to Pan A. In Analysis of Complex Surveys, ed. C. J. Skinner, D. Holt, and T. M. F. Smith. 23-58. New York: John Wiley! & Sons. WoltL-r, K. M. 1985. Introduction to Variance Estimation. New York: Springer-Verlag.
Also See Complementary:
|K( svy estimators, [R] svydes, [R] svyk, [R] svyset, [R] svytab, [R] svytest
Related:
| R j ci
Background:
j L1] 30 Overview of survey estimation, |R)svy
w
Title svyset — Set variables for survey data
Syntax svyset [thing-Jo~set [vamame]] [, clear]
Description svyset associates a variable with an option oj- weight for the svy commands. You can set any or all of the following: thing-io-set
Sets
•
: Description >
strata psu fpc pweight
:.
—,—,
—
stTata.(vamame) option psu(vcmam) option fpc(varmw^) option [pweight=v«mame]
A—..,---
rr- - -
.
:
- .
-_....._._....
---...... ...-L..-,---
.---....
- -----
; strata identifier variable varname ; PSU (cluster) identifier variable vamame finite population correction variable vamame : : sampling (probability) weight variable vamame
Settings made by svyset are saved with a dltaset. So if a dataset is saved after a thing Jio-sei has been set, it does not have to be set again. Current settings can be displayed by typing slryset without arguments. Settings can be changed simply by setting a different variable. Settings cd|n be erased using the clear option.
Options clear clears the current setting of thing-Jo-set'"tf thing jto^set is specified. If svyset, clear is ! typed, all settings are cleared.
Remarks > Example We use the svyset command to set the stratification variable and the pweight variable. . svyset strata stratan . svyset pueight jfinwgt
Typing svyset alone shows what has been set. . svyset
pweight is finwgt strata is stratan
:
76
svyset ~ Set variables for survey data
If we save the dataset, these settings will be remembered next time we use this dataset. . save nmihs, replace file nmihE.dta saved . use runihs, clear . svyset pweight is finwgt strata is stratan
We can now use the svy commands without having to specify explicitly the pweights and the strataQ option. . svymean birthwgt Survey mean estimation pweight: finwgt Strata: stratan PSU:
Number of obs Number of strata Number of PSUs Population size
Mean
Estimate
Std. Err.
[95X Conf . Interval]
birthwgt
3355.452
6.402741
3342 . 902
3368,003
= 9946 = 6 = 9946 = 3895561.7
Deff 1.142614
Alternatively, we could have set the pweights and strata when we issued the command. . svymean birthwgt [pweight=finwgt], strata(stratan) (output omitted) . svyset pweight is finwgt strata is stratan
Specifying the pueights and strataQ option has exactly the same effect as using svyset. No matter which of these methods are used initially to set pweights and strata, the settings are remembered and do not have to be specified in subsequent use of any of the svy commands. The settings can be changed by setting another variable. . svyset pweight adjwgt
Typing svyset shows the settings. . svyset pweight is adjwgt strata is stratan
Settings can be erased using the clear option. Either a single setting can be erased: . svyset strata, clear . svyset pweight is adjwgt
or all settings can be erased: . svyset, clear . svyset no variables have been set
svyset — Set variables for survey data
77
Methods and Formulas svyset is implemented as an ado-file.
Also See Complementary:
[R] svy estimators, [R] svydei [R] svylc, [R] svymean, [R] svytab, [R] svytest
Background:
[U] 30 Overview of survey estimation,
[R] svy
!
Title svytab — Tables for survey data
Syntax svytab varnamei varname2 [weight^ [if exp] [in range] {, strata(vQmamg) psu(vamame) fpc(varaame) subpop(varname) srssubpop ta.b(varname') missing cell count row column obs se ci deff deft { proportion | percent } nolabel nomarginals formatC/./mf) vertical level (#) pearson lr null wald llwald noadjust pweights and iweights are allowed; see [U] 14,1.6 weight. When any of se, ci, deff, or deft are specified, only one of cell, count, row, or column can be specified. If none of se, ci, deff, or deft are specified, any or all of cell, count, row, and column can be specified. svytab is implemented as an estimation command; as such, it shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. In particular, svytab typed without arguments redisplays previous results. Any of the options on the last three lines of the syntax diagram (cell through noadjust) can be specified when redisplaying with the following exception: wald must be specified at run time. Warning: Use of if or in restrictions will not produce correct statistics and variance estimates for subpopulations in many cases. To compute estimates for a subpopulation, use the subpop() option.
Description svytab produces two-way tabulations with tests of independence for complex survey data or for other clustered data. Despite the long list of options for syytab, it is a simple command to use. Using the svytab command is just like using tabulate to produce two-way tables for ordinary data. The main difference is that svytab will compute a test of independence that is appropriate for a complex survey design or for clustered data. The test of independence that is displayed by default is based on the usual Pearson x2 statistic for two-way tables. To account for the survey design, the statistic is turned into an F statistic with noninteger degrees of freedom using a second-order Rao and Scott (1981, 1984) correction. Although the theory behind the Rao and Scott correction is complicated, the p-value for the corrected F statistic can be interpreted in the same way as a p-value for the Pearson \ 2 statistic for "ordinary" data (i.e., data that are assumed independent and identically distributed). svytab will, in fact, compute four statistics for the test of independence with two variants of each, for a total of eight statistics. The options that give these eight statistics are pearson (the default), lr, null (a toggle for displaying variants of the pearson and lr statistics), wald, llwald, and noadjust (a toggle for displaying variants of the wald and llwald statistics). The options wald and llwald with noadjust yield the statistics developed by Koch et al. (1975), which have been implemented in the CROSSTAB procedure of the SUDAAN software (Shah et al. 1997. Release 7.5). 78
svytab — Tables for survey data
79
These eight statistics along with other variants Ijiave been evaluated in simulations (Sribney 1998). Based on these simulations, we advise researcher! to use the default statistic (the pearson option) in all situations. We recortirnend that the other statistics only be used for comparative or pedagogical purposes. Sribney (1998) gives a detailed comparison of the statistics; a summary of his conclusions is provided later in this entry. Other than the survey design options (the firs^ row of options in the syntax diagram) and the test statistic options (the last row of options in tlije diagram), most of the other options of svytab simply relate to different choices for what can bte displayed in the body of the table. By default, cell proportions are displayed, but in most circumltances it makes more sense to view either row or column proportions or weighted counts. • .| Standard errors and confidence intervals can optionally be displayed for weighted counts or cell, row, or column proportions. The confidence intervals are constructed using a logit transform so that their endpoints always lie between 0 and 1. Associated design effects (deff and deft) can be viewed for the variance estimates. The mean generalized jdeff (Rao and Scott 1984) is also displayed when deff or deft is requested. The mean generalized feff is essentially a design effect for the asymptotic distribution of the test statistic; see the Methods ^id Formulas section at the end of this entry. I
Options The survey design optibns strata(), psu(), fjpc(), subpopQ, and srssubpop are the same as those for the svymean cornmand; see [R] svyset apd [R] svymean for a description of these options. '
!i
tab(varname') specifies that counts should instead^ be cell totals of this variable and that proportions (or percentages) should be relative to (i.e., weighted by) this variable. For example, if this variable denotes income, then the cell "counts" are instead totals of income for each cell, and the cell proportions are proportions of income for each!cell. missing specifies that missing values of varnamei and varname^. are to be treated as another row or column category, rafher than be omitted from the analysis (the default). cell requests that cell proportions (or percentages); be displayed. This is the default if none of count, row, or column are specified. .; count requests that weighted cell counts be displayed. row or column requests that row or column proportions (or percentages) be displayed. obs requests that the number of observations for fach cell be displayed. se requests that the standard errors of cell proportions (the default), weighted counts, or row or column proportions be displayed. When se (or bi, deff, or deft) is specified, only one of cell, count, row, or columii can be selected. The standard error computed is the standard error of the one selected. \ ci requests confidence intervals for cell proportionjs, weighted counts, or row or column proportions. deff or deft requests thbt the design-effect measure deff or deft be displayed for cell proportions, counts, or row or coluihn proportions. Sec [R] ivymean for details. The mean generalized deff is also displayed when deff or deft is requested; see Methods and Formulas for an explanation. proportion or percent "requests that proportion! (the default) or percentages be displayed, aolabel requests that variable labels and value labels be ignored, nomarginals requests that row and column marginals not be displayed. format ('/./mi) specifies a format for the items in thetable. The default is 7,6. Og. See [l~l 15.5 Formats: controlling how data fcre displayed.
80
svytab — Tables for survey data
vertical requests that the endpoints of confidence intervals be stacked vertically on display. level(#) specifies the confidence level (i.e., nominal coverage rate), in percent for confidence intervals. The default is level(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. pear son requests that the Pearson x2 statistic be computed. By default, this is the test of independence that is displayed. The Pearson x2 statistic is corrected for the survey design using the second-order correction of Rao and Scott (1984) and is converted into an F statistic. One term in the correction formula can be calculated using either observed cell proportions or proportions under the null hypothesis (i.e., the product of the marginals). By default, observed cell proportions are used. If the null option is selected, then a statistic corrected using proportions under the null is displayed as well. See the following discussion for details. Ir requests that the likelihood-ratio test statistic for proportions be computed. Note that this statistic is not defined when there are one or more zero cells in the table. The statistic is corrected for the survey design using exactly the same correction procedure that is used with the pearson statistic. Again, either observed cell proportions or proportions under the null can be used in the correction formula. By default, the former is used; specifying the null option gives both the former and the latter. Neither variant of this statistic is recommended for sparse tables. For nonsparse tables, the Ir statistics are very similar to the corresponding pearson statistics. null modifies the pearson and Ir options only. If it is specified, two corrected statistics are displayed. The statistic labeled "D-B (null)" ("p-B" stands for design-based) uses proportions under the null hypothesis (i.e., the product of the marginals) in the Rao and Scott (1984) correction. The statistic labeled merely "Design-based" uses observed cell proportions. If null is not specified, only the correction that uses observed proportions is displayed. See the following discussion for details. wald requests a Wald test of whether observed weighted counts equal the product of the marginals (Koch et al. 1975). By default, an adjusted F statistic is produced; an unadjusted statistic can be produced by specifying noadjust. The unadjusted F statistic can yield extremely anticonservative p-values (i.e., p-values that are too small) when the degrees of freedom of the variance estimates (the number of sampled PSUs minus the number of strata) are small relative to the (R - l)(C — 1) degrees of freedom of the table (where '.R is the number of rows and C is the number of columns). Hence, the statistic produced by wald and noadjust should not be used for inference unless it is essentially identical to the adjusted statistic. llwald requests a Wald test of the log-linear model of independence (Koch et al. 1975). Note that the statistic is not defined when there are one or more zero cells in the table. The adjusted statistic (the default) can produce anticonservative p-values, especially for sparse tables, when the degrees of freedom of the variance estimates are small relative to the degrees of freedom of the table. Specifying noadjust yields a statistic with more severe problems. Neither the adjusted nor the unadjusted statistic is recommended for inference; the statistics are only made available for comparative and pedagogical purposes. noadjust modifies the wald and llwald options only. It requests that an unadjusted F statistic be displayed in addition to the adjusted statistic. Note: svytab uses the tabdisp command (see [P] tabdisp) to produce the table. Only five items can be displayed in the table at one time. If too many items are selected, a warning will appear immediately. To view additional items, redisplay the table while specifying different options.
w
svytab — Tables for survey data
81
Remarks Most of svytab's options deal with choices for the items that can be displayed in the body of the table. We will illustrate tHese options in a series of examples,
> Example *
We use data from the Second National Healtrf and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981). The strata, psu, and |>weiglit variables are first set using the svyset command rather than specifying them as options to svytab; see [R] svyset for details, . svyset strata stratid . svyset psu psuid . svyset pweight finalwgt . svytab race diabetes pweight: finalwgt Strata: stratid PSU: psuid
2*black, Smother
Number of obs Number of strata Number of PSUs Population size
:
= 10349 = 31 » 62 = l,171e+08
diabetes, l*yes, 0=no 1 Total
White Black Other
.851 .0281 .0890 .0056 .0248 5.2e-04
Total
.9658
.8791 .0955 .0253
.0342
Key: cell proportions Pearson: Uncorrected chi2(2) Design-based ^(1.52,47.26) ;
= =
i 21.0483 15.1)056
P = 0.0000
i
The default table displays only cell proportions, and this makes it very difficult to compare the incidence of diabetes in White versus black versus Mother" racial groups. It would be better to look at row proportions. This can be done by redisplaying the results (i.e., the command is reissued without specifying any variables) 'with the row option.
(Continued din next page)
82
svytab — Tables for survey data . svytab, row
pweight: Sinalwgt Strata: stratid PSU: psuid
l=white , 2=black, 3=other
Number of obs Number of strata Number of PSUs Population size
10349
31 62 = I.l71e+08
diabetes, 1=yes, 0=no 0 1 Total
White Black Other
.968 .032 .941 .059 .9797 .0203
1 1 1
Total
.9658 .0342
1
Key: row proportions Pearson:
Uncorrected chi2(2) Design-based F(l.52,47.26)
21.3483 15.0056
P = 0.0000
Thj-. table is much easier to interpret. A larger proportion of blacks have diabetes than do whites or per/jns in the "other" racial category. Note that the test of independence for a two-way contingency tabje is equivalent to the test of homogeneity of row (or column) proportions. Hence, we can conclude that there is a highly significant difference between the incidence of diabetes among the three racial groups. We may now wish to compute confidence intervals for the row proportions. If we try to redisplay specifying ci along with row, we get the following result: . svytab, row ci
confidence intervals are only available for cells to compute row confidence intervals, rerun command with row and ci options r(lll); There are limits to what svytab can redisplay. Basically, any of the options relating to variance estimation (i.e., se. ci, deff, and deft) must be specified at run time along with the single item (i.e.. count, cell, row. or column) for which one wants standard errors, confidence intervals, deff, or deft. So to get confidence intervals for row proportions, one must rerun the command. We do so below requesting not only ci but also se.
(Continued on next page)
r
svytab — Tables for survey data . svytab pweight: Strata: PSU:
ft>
race diabetes, TQU se ci format ('/«7.4f) Number of obs finalwgt Number of strata stratid Number of PSUs psuid Population size
1"white, 2*black, 3-otier
*..
83
10349 31 62 = l.!71e-K)8
diabetes, l=yes, 0=no
0
i 1
Total
0.968k>
1.0000
[0.963d,0.9718]
0.03^0 (0.002JO) [0.0282,0.03^2]
Black
0.9410 i(0.0061) [0.92711,0.9523]
0.0^90 (o.oodi) [0.0477,0.0739]
1.0000
Other
0.9797 1(0.0076) [0.95661,0.9906]
o.odos
1.0000
(0.007J6) [0.0094,0.043)4]
White
1(0'. 0020)
j
Total
; 0.9658
!(o.ooia)
[0.96191,0.9693]
0.03J42 (O.OOljB) [0.0307,0.038:1]
1.0000
Key: row proportions (standard ejrrors olf row proportions) [95*/, confidence intervals for row proportions] Pearson: Uncorrected cfii2(2) = 21.3483 Design-based Jl(l .52,47.26) = 15.0J056 P = 0.0000
In the above table, we Specified a 7,7.4f format;;rather than using the default '/,6.0g format. Note that the single format applies to every item in thj» table. If you do not want to see the marginal totals, you can omit them b'y specifying nomarginall. If the above style for displaying the confidence intervals is obtrusive—and if it can be in a wider liable—you can use the vertical option to stack the endpoints of the confidence interval one over ihe other and omit the brackets (the parentheses around the standard errors!are also omitted when Vertical is specified). If you would rather have results expressed as percentages, like the tabulate command (see [R] tabuiate). use the percent option. If you want to play around with these display options until you get a table that you are satisfied with, first try making cnanges to the options on redisplay (i.e.. omit the cross-tabulated variables when you issue tie command). This will ;be much faster if you have a larse dataset. I " 0
1
Technical Note
;
The standard errors coijnputed by svytab are ejxactly the same as those produced by the other svy commands thai can compute proportions: namjely, svymean, svyratio, and svyprop. Indeed, svytab calls the same drij/er program that svymean and svyratio use. Continuing with the previous example, we note! that the estimate of the proportion of AfricanAmericans with diabetes (tjhe second proportion in 'the second row of the preceding table) is simply a ratio estimate; hence, wd can also obtain the samp estimates using svyratio:
i •WK-
'
84
svytab — Tables for survey data . gen black = (race«=2) if race -=. . gen diablk = diabetes*black (2 missing values generated) . svyratio diablk/black Survey ratio estimation pweight: finalwgt Strata: stratid PSU: psuid
Number of obs Number of strata Number of PSUs Population size
= 10349 = 31 = 62 = 1.171e+08 Deff
Ratio
Estimate
$td. Err.
[95'/. Goaf. Interval]
diablk/black
.0590349
.0061443
. 0465035
. 0715662
.6718049
Although the standard errors are exactly the same (which they must be since the same driver program computes both), the confidence intervals are slightly different. The svytab command produced the confidence interval [0.0477,0.0729], and svyratio gave [0.0465,0.0716]. The difference is due to the fact that svytab uses a logit transform to produce confidence intervals whose endpoints are always between 0 and 1. This transformation also shifts the confidence intervals slightly toward the null (i.e., 0.5), which is beneficial since the untransformed confidence intervals tend to be, on average, biased away from the null. See the Methods and Formulas section at the end of this entry for details. Q
The tab option The tab() option allows one to compute proportions relative to a certain variable. For example, suppose one wishes to compare the proportion of total income among different racial groups in males with that of females. We do so below with fictitious data: . svytab pweight: Strata: PSU:
gender race, tab(income) row samwgt samstr sampsu
Gender
Race White Black Other Total
Hale Female
.8857 .0875 .0268 .884 .094 .022
Total
.8848 .0909 .0243
Tabulated variable: income Key: row proportions Pearson: Uncorrected chi2(2) Design-based F(1.91,59.12)
Number of obs Number of strata Number of PSUs Population size
3.6241 0.8626
P = 0.4227
= = = =
5479 4 76 1986683
svytab — Tables for survey data
85
The Rao and Scott correction svytab can produce ei^ht different statistics for the test of independence. By default, svytab displays the Pearson x2 statistic with the Rao and Slcott (1981, 1984) second-order correction. Based on simulations (Sribney 1J£>8), we recommend thbt researchers use this statistic in all situations. The statistical literature, however, contains several alternatives, along with other possibilities for the implementation of the Ra4 and Scott correction, ijtence, for comparative or pedagogical purposes, one may want to view sojne of the other statistic!^ computed by svytab. We briefly describe the differences among these statistics; for a more detaijed discussion, see Sribney (1998). Two statistics commonly used for independent,' identically distributed (iid) data for the test of independence of R x C tajbles (R rows and C coliimns) are the Pearson \'2 statistic R
; I \
C
\
Xp = n2_^2_^ (Pfc - POrc) r=l c=l :
/POrc
and the likelihood-ratio x2\ statistic • }
R
C I
r=lc=lj
where n is the total number of observations. prc fy the estimated proportion for the cell in the rth row and cth column of th4 table, and porc is the dstimated proportion under the null hypothesis of independence; i.e., porc H Pr-P-c- the product of tie row and column marginals: pr. = Y,c=i Prc and p.c = Y^?s=i Prc' I For iid data, both of th$se statistics are distributed asymptotically as xf^-DfC-i)' ^ote l^at ^e likelihood-ratio statistic is.not defined when one & more of the cells in the table are empty. The Pearson statistic, however,; can be calculated wher| one or more cells in the table are empty—the statistic may not have goo|l properties in this case, \ but the statistic still has a computable value. For survey data, Xj, and X^ can be computed ufing weighted estimates of prc and pore- However, for a complex sampling des-ign, one can no longer cljaim that they are distributed as X?/J_D(C-I)- ^ ut one can estimate the variance of prc under the sampling design. For instance, in Stata, this variance can be estimated using linearization methods using,! svymean or svyratio. Rao and Scott (1981, 1984) derived the asymptotic distribution of Xp and X£R in terms of the variance of prc. Unfohunately, the result (seel equation (1) in Methods and Formulas) is not computationally feasible, |ut it can be approximated using correction formulas, svytab uses the second-order correction developed by Rao and Scot| (1984). By default, or when the pearson option is specified, svytab displays the second-order correction of the Pearson statistic. The Ir option gives the second-order correction; of the likelihood-ratio itatistic. Because it is the default of svytab, we refer to the correction cornjputed with prc as the default correction.
i
i
The Rao and Scott papfers. however, left some details outstanding about the computation of the correction. One term in the correction formula can tje computed using either prc orpo rc - Since under the null hypothesis both an asymptotically equivalent, theory offers no guidance about which is best. By default, svytab uses p -c for the corrections ofj the Pearson and likelihood-ratio statistics, if the null option is specified, he correction is computed using PQTC. For nonsparse tables, these two correction methods yield a most identical results, tfowever, in simulations of sparse tables, Sribney (1998) found that the null-corrected statistics were extremely anticonservative for 2 x 2 tables (i.e., under the null, "significance" was declared too often) and too conservative for other tables. The default correction, however, had better properties. Hence, we do not recommend use of the null option. ! :
86
svytab — Tables for survey data
For the computational details of the Rao and Scott corrected statistics, see the Methods and Formulas section at the end of this entry.
Wald statistics Prior to the work by Rao and Scott (1981, 1984), Wald tests for the test of independence for two-way tables were developed by Koch et'al. (1975). Two Wald statistics have been proposed. The first is similar to the Pearson statistic in that it is based on 1 Y — ]^Nv rc — J VNr . J VN. c / IN re — 2\ ..
where Arrc is the estimated weighted count for the r,cth cell. The delta-method can be used to approximate the variance of YTC, and a Wald statistic can be calculated in the usual manner, A second Wald statistic can be constructed based on a log-linear model for the table. Like the likelihood-ratio statistic, this statistic is undefined when there is a zero proportion in the table. These Wald statistics are initially x2 statistics, but they have better properties when converted into F statistics with denominator degrees of freedom that account for the degrees of freedom of the variance estimator. They can be converted to F statistics in one of two ways. One method is the standard manner: divide by the x2 degrees of freedom do — (R — 1)(C - l) to get an F statistic with do numerator degrees of freedom and v — ripsu — L denominator degrees of freedom. This is the form of the F statistic suggested by Koch et al. (1975) and implemented in the CROSSTAB procedure of the SUDAAN software (Shah et al. 1997, Release 7.5), and it is the method used by svytab when the noadjust option is specified along with wald or llwald. Another technique is to adjust the F statistic by using
F.dj = (" - do + l)W/(i/doj)
with
F^ ~ F(d0, v-do + 1)
This is the default adjustment for svytab. Note that svytest and the other svy estimation commands produce adjusted F statistics by default using exactly the same adjustment procedure. See Korn and Graubard (1990) for a justification of the procedure. The adjusted F statistic is identical to the unadjusted F statistic when d0 = 1; that is, for 2 x 2 tables, As Thomas and Rao (1987) point out (also see Korn and Graubard, 1990), the unadjusted F statistics can become extremely anticonservative as do increases when v is small or moderate; i.e., under the null, the statistics arc -significant" far more often than they should be. Because the unadjusted statistics behave so poorly for larger tables 'when v is not large, their use can only be justified for small tables or when // is large. But when the table is small or when v is large, the unadjusted statistic is essentially identical to the adjusted statistic. Hence, for the purpose of statistical inference, there is no point in looking at the unadjusted statistics. The adjusted "Pearson" Wald F statistic behaves reasonably under the null in most cases. However. even the adjusted F statistic for the log-linear Wald test tends to be moderately anticonservative when ;/ is not large (Thomas and Rao 1987, Sribney 1998).
> Example With the NHANES t! data, we tabulate, for the male subpopulation, high blood pressure (highbp; versus a variable fsizplace) that indicates the degree of urban/ruralness. We request that all eight statistics for the test of independence be displayed.
I
svytab — Tables for survey data . gen male = (sex==l^ if sex~=. ,: . svytab highbp sizptace, siibpop(male) col bbs ! " pweight: finalwgt i i Strata: stratid i PSU: psuid ; ] ; ! Subpop,: male==l : I i |
1 if BP > 140/90, 0 otherwise
pearson Ir null wald llwald noadj Number of obs = 10351 Number ol strata = 31 Number of PSOs = 62 Population size = 1.172e+08 Subpop. no. of obs = 4915 Subpop. size = 56159480
!— — — — —
1
!i 2
3
l=urban , . . . 1 8=rural 4 ]5 6
7
8 Total
jj
0
.8489 .$929 .9213 .8509 -84J.3 558 371 431 J527 Ife6
.9242 ,8707 .8674 .8764 314 1619 4216 210
1
.1511 .1071 .0787 .1491 .15^7 95 : 80 64 74 is
.0758 .1293 .1326 .1236 20 57 273 699
Total
87
1 526
i i !607
1 622
Key: column proportions number of observations Pearson: ! Uncorrected ch:j.2(7) D-B (null) FCS.30,164.45) Design-based F(f.54,171.87) Likelihood ratio: j Uncorrected ch42(7) D-B (null) F(4.30,164.45) Design-based F^. 54,171.87) Wald (Pearson): ! Unadjusted chi2(7) Unadjusted F(j,31) Adjusted F(i,25) Wald (log-linear):; Unadjusted ch42(7) Unadjusted F(1,31) Adjusted F(1,25)
1 445
jl
1
2^2
230
64.4581 2.20T8 2.68$3 68.23^5 2.33t2
1 371
1 1892
1 4915
P = 0.0522 P = 0.0189
2.8417
P = 0.0408 P = 0.0138
21.27l)4 3.0316 2.45f5
P = 0.0149 P = 0.0465
25.76^4 3.68f6 2.96|3
P = 0.0052 P = 0.0208
The p-vaiues from the njull-corrected Pearson anB likelihood-ratio statistics {lines labeled "D-B (null)"; "D-B" stands for "dejsign-based") are bigger than the corresponding default-corrected statistics (lines labeled "Design-based!"). Simulations (Sribney!l998) show that the null-corrected statistics are overly conservative for many sparse tables (except 2j x 2 tables); this appears to be the case here — although this table is hardly 'sparse. The default-corrected Pearson statistic has good properties under the null for both sparse and jnqnsparse tables; hence. |the smaller p-value for it should be considered reliable, |i 'ji The default-corrected likelihood-ratio statistic is fisually similar to the default-corrected Pearson statistic except for very sparse tables, when it tends t<| be anticonservative. This example follows this Pattern, with its //-value beiig slightly smaller than that of the default-corrected Pearson statistic. For tables of these dimensions (2 x 8), the unadjusted "Pearson" Wald and log-linear Wald statistics are extremely anticonservative under the] null when the variance degrees of freedom are l. Here the variance decrees of freedom are onlyi 31 (62 PSUs minus 31 strata), so it is expected the unadjusted Wald F\ statistics yield smaller jf values than the adjusted F statistics. Because °' their poor behavior under !the null for small variance degrees of freedom, they cannot he trussed in
i
90
svytab — Tables for survey data
Methods and Formulas We assume here that readers are familiar with the Methods and Formulas section of the [R] svymean entry. For a table of R rows by C columns with cells indexed by r, c, let _ / 1 if the hijth element of the data is in the r, cth cell . . n 10 otherwise
y(rc)hij — I
where h = 1, . . . , L, indexes strata; i = 1, . . . , nh , indexes PSUs; and j — 1, . . . , ra/» , indexes elements in the PSU. Weighted cell counts (the count option) are
where WY^J are the weights, if any. If a variable Xhij is specified with the tab() option, Nrc becomes L . rih rn,hi
Let
c #r. = ]£#rc,
R ^c = X^rc,
c=l
r—l
R and
JV.. r=l c=l
Estimated cell proportions arep rc = Nrc/N... Estimated row proportions (row option) arep row rc = Nrf./Nr.. Estimated column proportions (column option) are pcolrc — Nrc/N.c. Estimated row marginals are pr, — Nr./N... Estimated column marginals are p.c = N.C/N... Nr(. is a total and the proportion estimators are ratios, and their variances can be estimated using linearization methods as outlined in [R] svymean. svytab computes the variance estimates using the same driver program that svymean, 'svyratio, and svytotal use. Hence, svytab produces exactly the same standard errors as these commands would. Confidence intervals for proportions are calculated using a logit transform so that the endpoints lie between 0 and 1. Let p be an estimated proportion and ? an estimate of its standard error. Let f(p) — \n{p/(\ — p } } be the logit transform of the proportion. In this metric, an estimate of the standard error is f ' ( p j s — s/p(l - p). Thus, a 100(1 - a)% confidence interval in this metric is hi{p/(l - p ) } ± /i_ Q /2,z/s/F(l - p), where ti_ Q / 2 > t / is the (I - a/2)th quantile of Student's r distribution. The endpoints of this confidence interval are transformed back to the proportion metric using f l ( y ) = exp(y)/{l +exp(y)}. Hence, the displayed confidence intervals for proportions are
Confidence intervals for weighted counts are untransformed and are identical to the intervals produced by svytotal. The uncorrected Pearson x2 statistic is R Xp - n
C
I
svytab — Tables for survey data
91
and the uncorrected Iikelih0od-ratio \'2 statistic is :
R
C
\
r=lc=l i
where n is the total numbdr of observations, prc is the estimated proportion for the cell in the rth row and cth column of the table as defined earlier, and p$rc is the estimated proportion under the null hypothesis of independence; i.e.. pore = Pr-P-c\ the product of the row and column marginals. Rao and Scott (1981, 19(84) showed that, asymptotically, Xp and Ar£R are distributed as
: 1
*2~
E|
fc=l! I
where the W& are independent x? variables and thfij 6k are the eigenvalues of (2)
where V is the variance of; the prc under the survey design and Vsrs is the variance of the prc that one would have if the design were simple random sampling; namely, Vsrs has diagonal elements PrcO - Prc)/ft and off-diagonal elements -prcPst/n. X2 is calculated as follojws. Rao and Scott do th|ir development in a log-linear modeling context, so consider [ 1 j Xj. | Xa j |s predictors for the cell counts of the R x C table in a log-linear model. The Xj. matrix of dimensibn RC x (R + C - 2) jcontains the R - 1 "main effects" for the rows and the C - 1 "main effectjs" for the columns. The X2 matrix of dimension RC x (R - } }(C - 1) contains the row and colurhn "interactions". Hencej fitting [1 Xj X 2 j gives the fully saturated model (i.e.. fits the observejd values perfectly) and [1 Xj ] gives the independence model. The X 2 matrix is the projection of ^2 °nto the orthogonal complement of the space spanned by the columns of Xi. where the orthogonality is defined with respfect to Vsrs; i.e.. X!)VsrsXi = 0. See Rao and Scott (1984) for the proof justifying equations (1) and (2). However, even without a full understanding, one canj get a feeling for A. It is like a "ratio" (although remember that it is a matrix) of two variances. The variance in the "numebtor" involves the variance under the true survey design, and the variance in the "denominator" invdlves the variance assuming that the design was simple random sampling. Rjecall that the design effect deff for the estimated proportions is defined as deff = V r (p r< -) Asrs(prc) (5Jee tne Methods and Forjnu/as section of [R] svymean. Hence. A can be regarded as a design effects, matrix, and Rao and Scirtt call its eigenvalues, the ^'s. the "generalized design effects". i \ :
,j
C?
V--
It is easy to compute arj estimate for A using estimates for V and Vsrs. Rao and Scott (1984) derive a simpler formula fcjr A:
Here C is a contrast matrix that is any RC x (R -j 1)(C - I) full-rank matrix that is orthogonal to !• j";: i.e.. C'l = 0 anc C'Xj = 0. D~ is a diagonal matrix with the estimated proportions p.,-, the diagonal. Note that when one of the prc is ^ero, the corresponding variance estimate is also '. hence, the correspondjing element for D^1 is jimmaterial for the computation of A. ! P
92
svytab — Tables for survey data
Unfortunately, equation (1) is not practical for the computation of a p-value. However, one can compute simple first-order and second-order corrections based on it. A first-order correction is based on downweighting the iid statistics by the average eigenvalue of A; namely, one computes X$/
and
where 8. is the mean generalized deff
These corrected statistics are asymptotically distributed as X?R_I\(C~I)- Thus, to first-order, one can view the iid statistics Xj, and X^R as being "too big" by a factor of 5. for true survey design. A better second-order correction can be obtained by using the Satterthwaite approximation to the distribution of a weighted sum of xi variables. Here the Pearson statistic becomes
(3) where a is the coefficient of variation of the eigenvalues:
3- = _
.-1
Since ^ 5*. — tr A and ^ &\ — tr A 2 , equation (3) can be written in an easily computable form as ..
. . tr A2
These corrected statistics are asymptotically distributed as
a2 + l
trA 2
i.e., a x2 w'th. in general, noninteger degrees of freedom. The likelihood-ratio X£R statistic can also be given this second-order correction in an identical manner. We would be done if it were not for two outstanding issues. First, there are two possible ways to compute the variance estimate Vsrs, which is used in the computation of A. Vsrs has diagonal elements prc(l -prc}/" and off-diagonal elements -prcPst/n\ but note that here prc is the true, not estimated, proportion. Hence, the question is, what to use to estimate prc: the observed proportions prc or the proportions estimated under the null hypothesis of independence porc — pr-P-c? Rao an° Scott (1984, 53) leave this as an open question. Because of the question of whether to use prc or pore to compute Vsrs, svytab can compute both corrections. By default, when the null option is not specified, only the correction based on prc is displayed. If null is specified, two corrected statistics and corresponding p-values are displayed, one computed using p rc and the other using pQrc.
svytab — Tables for survey data
93
The second outstanding issue concerns the degrees of freedom resulting from the variance estimate V of the cell proportions unjder the survey design. The customary degrees of freedom for t statistics resulting from this variance bstimate are v = npsu - L, where npsu is the total number of PSUs in the sample and L is the total number of strata. Rao and Thomas (1989) isuggest turning the corrected x2 statistic into an F statistic by dividing it by its degrees of freedom do = (-R - 1)(C - 1). l|The F statistic is then taken to have numerator degrees of freedom equal to do, and denominator |degrees of freedom equal to i/do. Hence, the corrected Pearson F statistic is \ X2 ': (trA) 2 Fp = — ~ with: Fp ~ F(d, vd) where! d = .......2 x'» and v — npsu ~ L (4) trA | trA This is the corrected statistic !that svytab displays by default or when the pearson option is specified. When the Ir option is specified, an identical correction is produced for the likelihood-ratio statistic ALR . When null is specified, equation (4) is also ujsed. For the statistic labeled "D-B (null)", A is computed using PQTC . For tie statistic labeled "Design-based", A is computed using prc . The Wald statistics compjuted by svytab with th4 wald and llwald options were developed by Koch et al. (1975). The statistic given by the wald Option is similar to the Pearson statistic since it is based on Yrc = Nrc - . where r = 1^ . . . , R- l^and c = l , . . . , C - l . |The delta-method can be used to estimate the variance of Y (which is Yrc stacked into a vector),; and a Wald statistic can be constructed in the usual manner: W = Y'{J N V(N)J N '}~ 1 Y
; where
J N = 3Y/9N'
The statistic given by the llwald option is based oil the log-linear model with predictors 1 1 1 Xi X2 ] that was mentioned earlier. This Wald statistic is } WLL := (Xi lnp)'{Xi J p V J p ' X a } " 1 (X£ In p) where J p is the matrix of fir&t derivatives of Inp witn respect to p, which is, of course, just a matrix with pj:c3 on the diagonal ahd zero elsewhere. Note; that this log-linear Wald statistic is undefined when there is a zero cell in jthe table. : Unadjusted F statistics (aoadjust option) are produced using -Punadj = W/do
with : Funadj ~ F(d0. v}
Adjusted F statistics are prcjduced using -Fadj = (^ - 4o + l)W/(vdQ)
with
Fadj ~ F(d0. v - d0 + 1)
other svy estimators alsb use this adjustment procedure for F statistics. See Korn and Graubard (1990) for a justification of {he procedure.
i
94
svytab — Tables for survey data
References Fuller, W. A.. W. Kennedy, D. Schnell, G. Sullivan, H. J. Park. 1986. PC CARP. Ames, IA: Statistical Laboratory, Iowa State University. Koch. G. G.. D, H. Freeman, Jr.. and J. L. Freeman. 1975. Strategies in the rnultivariate analysis of data from complex surveys. International Statistical Review 43: 59-78. Korn, E. L. and B. I. Graubard. 1990. Simultaneous testing of regression coefficients with complex survey data: use of Bonferroni t statistics. The American Statistician 44: 270-276. McDowell, A., A. Engel, J. T. Massey, and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey. 1976-1980. Vital and Health Statistics 15(1). National Center for Health Statistics, Hyattsville, MD. Rao, J. N. K. and A. J. Scott. 1981. The analysis of categorical data from complex sample surveys: chi-squared tests for goodness of fit and independence in two-way tables. Journal of the American Statistical Association 76: 221-230. -. 1984. On chi-squared tests for multiway contingency tables with cell proportions estimated from survey data. Annals of Statistics 12: 46-60. Rao, J. N. K. and D. R. Thomas, 1989. Chi-squared tests for contingency tables. In Analysis of Complex Surveys, ed. C. J. Skinner, D. Holt, and T. M. F. Smith, Ch. 4, 89-114. New York: John Wiley & Sons. Shah, B. V., B. G. Barnwcil, and G. S. Bieler. 1997. SUDAAN User's Manual, Release 7.5. Research Triangle Park, NC: Research Triangle Institute, Sribney, W. M. 1998. svy7: Two-way contingency tables for survey or clustered data. Stata Technical Bulletin 45: 33-49. Reprinted in Siuiu Technical Bulletin Reprints, vol. 8, pp. 297-322. Thomas, D. R. and J. N. K. Rao. 1987. Small-sample comparisons of level and power for simple goodness-of-fit statistics under cluster sampling. Journal of the American Statistical Association 82: 630-636.
Also See Complementary:
[R| svydes, [R] svymean, [R] svyset
Related:
[R| svy estimators, [R] svytest, [R] tabulate
Background:
[ i ] 30 Overview of survey estimation,
|R] sv
Title ^
:-...r...-r
-i_ j . . - . - - - _
I
.1..-..II--...—-iir..-j.V-.-1-r
;.i- - - :
---
.-.
rr - - J J - -
svytest — Test linear hypotheses after survey estimation ;
i i Syntax JL
i
The command svytest 4an be used with three different syntaxes: (1) svytest exp ~ exp
[,! noadjust accumulate notest
(2) svytest coejficieitflist
[,; noadjust accumulate notest 1
(3)
svytest tyarlist coefficientlist]
,i bonferroni
In the above, exp is a linear1 expression that is valid' for the test command: exp = exp is a linear equation that is valid for ! the test command; and coefficientlist is a valid coefficient list for the test command; see [R] lest.
Description svytest tests multidimensional linear hypotheses after a svy estimation command. See [R] svymean and [R] svy estimators for an introduction to the svy estimation commands. In addition to computing point estimates for linear: combinations, svylc computes t statistics and p-values: see [R] svylc. Thus, svylc can be used for testing one-dimensional hypotheses: it gives the same results as svytestj. Syntax (1) for svytest illows you to build up a multidimensional hypothesis consisting of any number of linear equations. Syntax (2) tests hypotheses of the form 3\ = 0. fa = 0. fa = 0, etc. Syntax (3) is only available after the svy commands described in [R] svy estimators. It computes a Bonferroni adjustment for hypotheses of the form & = 0. 32 = 0. fa - 0. etc. See the following examples and the Methods Qnd Formulas section for; details. :
I
By default, svytest useji with syntax (1) or (2); carries out an adjusted Wald test. Specifically, it uses the approximate F statistic (d-k + l)W/(^d), where W is the Wald test statistic, k is the dimension of the hypothesis: test, and d = total numlper of sampled PSUs minus the total number of strata. Under the null hypothesis, (d-k + l)W/(ktt) - F(k,d- k + I), where F(k,d- k + 1) is an F distribution with A- numerator degrees of freedom and d - k + I denominator degrees of ! freedom. ;
Options noadjust specifies that the ivald test be carried out ai W/k ~ F(k. d] (notation as described above). This gives the same resujt as the test command^! accumulate allows a hypothesis to be tested jointly! with the previously tested hypotheses. suppresses the output. This option is useful when you are interested only in the joint test of a number of hypotheses. : 95
96
svytest — Test linear hypotheses after survey estimation
bonf erroni can be specified only after estimating a model with any of the svy estimation commands described in [R] svy estimators. When this option is specified, svytest displays adjusted p-values for each of the specified coefficients. Adjusted p-values are computed as pa^ — min(fcp, 1), where k is the number of coefficients specified, and p is the unadjusted p-value (i.e., the p-value shown in the output of the estimation command) obtained from the statistic t = (3/{V(@)}1/2, which is assumed to have a t distribution with d degrees of freedom. If no argument list is specified with the bonf erroni option, adjustments are made for all terms in the model, excluding the constant term(s) and any ancillary parameters.
Remarks Testing hypotheses with svytest Joint hypothesis tests can be performed after svy estimation commands using the svytest command. Here we estimate a linear regression of loglead (log of blood lead).
Example . svyreg loglead age female black orace region2-region4 Survey linear regression pweight : leadwt Number of obs = 4948 Strata: stratid Number of strata = 31 PSU: psuid Number of PSUs = 62 Population size = 1.129e+08 F( 7, 25) = 186.18 Prob > F = 0.0000 R- squared = 0.2321 loglead
Coef .
Std. Err.
age female black orace region2 regions region4 _cons
. 0027842 - . 3645445 . 1783735 -.0473781 - . 0242082 -.1646067 -.0361289 2.696084
.0004318 .0110947 .0321995 .0383677 . 0384767 .0549628 .0377054 .0236895
t 6.45 -32.86 5.54 -1.23 -0.63 -2.99 -0.96 113.81
P>ltl
[957. Conf
Interval]
0.000 0.000 0.000 0.226 0.534 0.005 0.345 0.000
.0019036 -.3871724 .1127022 - . 1256295 -.1026819 -.276704 -.1130296 2.647769
. 0036649 -.3419167 . 2440447 .0308733 .0542655 -.0525094 .0407717 2 . 744399
We can use svytest to test the joint significance of the region dummies: regionl is the Northeast, region2 is the Midwest, regions is the South, and region4 is the West. We test the hypothesis that region2 = 0, regions = 0, and region4 = 0. . svytest region2 regions region4 Adjusted Wald test ( 1) region2 = 0 . 0 ( 2) regiou3 = 0.0 ( 3) region4 = 0 . 0 F( 3, 29) = 2.97 Prob > F = 0.0480
, svytest — Test linear hypotheses after survey estimation
97
The noadjust option on pvytest produces an uiiadjusted Wald test. . svytest region2 regijonS region4, noadjust Unadjusted Wald test ' ( 1) region2 = 0.0 ( 2) regions =0.0 ( 3) r«gion4 = 0,0 ! F( 3, 31) 4 3.18 Prob > F 4 0.0377
Note that for one-dimensional tests, the adjusted and unadjusted F statistics are identical, but for higher dimensional tests, !they differ. Using the readjust option is not recommended since the unadjusted F statistic can produce extremely antico^servative p-values (i.e., p-values that are too small) when the variance degtees of freedom (equal to]the number of sampled PSUs minus the number of strata) are not large relative to the dimension of de test. Bonferroni-adjusted p-valifes can also be compute^: , svytest regions regions region4, bonferroni Bonferroni adjustment;for 3 Comparisons loglead
Cef .
region2 regions region4
-.024 !082 -.164' 5067 -.036 1289
Std.
t
Err.
.0384767 . 0549628 , 0377054
Adj. P
-01629 -2J995 -01958 _
—
j
1 . 0000 0.0161 * 1.0000 _
The smallest adjusted p-vali|e is a p-value for a test of the same joint hypothesis that we tested before; namely, region2= t], regions = 0, and rejgion4= 0. See Korn and Graubard (1990) for a discussion of i! these three different procedures for conducting joint hypothesis tests. ;
The examples given above show how to use svytejst to test hypotheses for which the coefficients are jointly hypothesized to ibe zero. We will now Illustrate the use of svytest to test general hypotheses.
Example Let us run the same regression model as in the previous example, only this time we will include the other region dummy regionl and omit the constant term.
(Continued on rjexf page)
svytest — Test iinear hypotheses after survey estimation Evyreg loglead age female black orace regionl-region4, nocons -jrvey linear regression pveight: leadwt Number of obs = 4948 Strata: stratid Number of strata = 31 PSU: psuid Number of PSUs = 62 Population size = l,129e+08 5148.74 F( 8, 24) = Prob > F = 0.0000 R-squared = 0.9806 loglead
Coef .
age female black orace regionl region2 regions region4
.0027842 - - 3645445 .1783735 -.0473781 2.696084 2.671876 2.531477 2.659955
Std. Err. .0004318 .0110947 .0321995 . 0383677 .0236895 .0420415 .0601017 .0405778
t 6.45 -32 . 86 5.54 -1.23 113,81 63.55 42.12 65.55
P>lti 0.000 0.000 0.000 0.226 0.000 0.000 0.000 0.000
[957, Conf . Interval] . 0019036 -.3871724 .1127022 -.1256295 2.647769 2.586132 2.408899 2.577196
.0036649 -.3419167 . 2440447 . 0308733 2 . 744399 2 . 75762 2.654055 2.742714
In order to test the joint hypothesis that regionl = region2 = regions = region4, we must enter the equations of the hypothesis one at a time and use the accumulate option. - svytest regionl = region2 Adjusted Wald test ( 1)
regionl - region2 = 0.0 F( 1, 31) = 0.40 Prob > F = 0.5338
. svytest region2 = regions, accum Adjusted Wald test ( 1) regionl - region2 =0.0 ( 2) region2 - regions =0.0
F( 2, 30) = 4.41 Prob > F = 0.0209 . svytest regions = region4, accum Adjusted Wald test ( 1) regionl - region2 =0.0 ( 2) region2 - regions =0.0 ( 3) regions - region4 =0.0
F< 3, 29) = 2.97 Prob > F = 0.0480 As expected, we get the same answer as before. Note that the Bonferroni adjustment procedure is not available for use with the above syntax. The svytest command can only use the Bonferroni procedure to lest whether a group of coefficients are simultaneously equal to zero.
Using svytest after estimators that estimate multiple-equation models is straightforward. Users merely need to refer to the coefficients using the syntax for multiple equations; see [u] 16.5 Accessing coefficients and standard errors for a description and [R] test for examples of its use.
svytest ~ Test linear hypotheses after survey estimation
99
Example Here we estimate a multiple-equation model using svymlogit: . svymlogit health jf «aale black age age2 S-urvey multinomial logistic regression 10335 = 31 Number of strata = 62 N-umber of PSUs Population size = 1. 170e+08 36-41 F( 16 16) = 0.0000 Prob > F
Number of obs
pweight : final wgt Strata: stratid PSU: psuid
health
Coef.
Std. Err.
female black age age2 _cons
-.1983735 .i!964694 . $990246 - . $004749 -5 475074
. 1072747 . 1797728 .032111 .0003209 .7468576
fi .85 .99 :s .08 fi .48 .33
0.074 0.000 0.004 0.149 0.000
-.4171617 . 5298203 .0335338 -.0011294 -6.9983
.0204147 1.263119 . 1645155 .0001796 -3.951848
female black age age2 _cons
4782371 .4429445 .D024576 .6002875 -ij. 819561
.0726556 . 122667 .0172236 .0001684 .4018153
|2 .45 j3 .61 0.14 1 .71 H .53
0.020 0.001 0.887 0.098 0.000
. 030055 . 1927635 - . 0326702 -.0000559 -2.639069
.3264193 .6931256 .0375353 .000631 -1.0000S3
female black age age2 _cons
-.(3458251 -.7532011 4 061369 .0004166 1.815323
.074169 .1105444 . 009794 .0001077 .1996917
j-o .62
0.541 0.000 0.000 0.001 0.000
-.1970938 - . 9786579 -.081344 .000197 1.408049
. 1054437 - . 5277443 -.0413939 . 0006363 2.222597
excellent female black age age2 _cons
-j. 222799 -L 991647 -.(D293573 -.0000674 1.499683
.0754205 . 1238806 .0137789 .0001505 .286143
0.006 0.000 0.041 0.657 0.000
- . 3766202 -1.244303 -.0574595 - . 0003744 .9160909
-.0689778
t
P>|t|
[957. Gonf . Interval]
poor ;4
f
fair
good .81 ^6 f-6 .27 ! 3 .87 9.09
h2 .95
he .00
Ur
.13 .45 ^o : 5 .24 —a— (Outcome health==a!verage is the comparisdn group)
- . 7389909 -.00125B . 0002396 2.083276
Suppose we want to do 4 joint test of whether thfe coefficients for female and black are the same for the "good" and "excellent" categories. We do! so. using first the notest option to suppress the output, and then accum tp add the second equatidn to the test. . svytest [good]female = [excellent]female, notest ( 1)
[good] f emalje - [excellent]female q 0.0
. svytest [good] hijack = [excellent] black j accum Adjusted Wald tesfl ( 1) [good] f emalje - [excellent] female 4 0.0 ( 2) [good]blacl| - [excellent]black * tij.O F(
2, Prob
F =
4.32 0.0225
100
svytest — Test linear hypotheses after survey estimation
The svytest command can also be used after svymean, svytotal, and svyratio. svytest and svylc use the same syntax to reference the estimates. The only difference is that you use svytest with a full equation (i.e., you include an equal sign and a right-hand side for the equation), or with a list of estimates that you wish to test simultaneously equal to zero. Here is an example of the former.
> Example . svynean bpsystol bpdiast, by(rural) Survey mean estimation pweight: finalwgt Strata: stratid PSU: psu
Mean
Number of obs Number of strata Number of PSUs Population size
= 10351 = 31 = 62 = 1.172e+08
Subpop .
Estimate
Std. Err.
[957. Conf . Interval]
Deff
bpsystol rural==0 rural==!
126.6065 127.6753
.5503138 1.261624
125.4841 125.1022
127.7289 130.2484
4.655704 11.52492
bpdiast rural==0 rural ==1
80 . 90864 81.2508!
.4990564 . 9476732
79 . 89081 79.31802
81.92648 83 . 1836
10.94774 17.36593
. svytest [bpsystol]0 = [bpsystol]! Adjusted Wald test ( 1)
[bpsystol]0 - [bpsystol]! =0.0
F( 1, 31) = Prob > F =
0.71 0.4064
. svytest [bpdiast]0 = [bpdiast]!, accumulate Adjusted Wald test ( 1) ( 2)
[bpsystol]0 - [bpsystol]! = 0 . 0 [bpdiast]0 - [bpdiast]! = 0 . 0 F(
2, 30) = Prob > F =
0.65 0.5300
Saved Results svytest saves in r ( ) : Scalars r(df) r(df _r) r(F)
F numerator degrees of freedom (i.e., dimension of hypothesis test) F denominator degrees of freedom (or t statistic degrees of freedom for bonferroni) F statistic (or maximal t statistic for bonferroni)
Methods and Formulas svytest is implemented as an ado-file.
r
svytest — Test lineal" hypotheses after survey estimation
101
svytest tests the null hypothesis H0: C6 = c, Where 9 is a q x 1 vector of parameters. C is any k x q matrix of constants, ^tnd c is a k x 1 vector r|f constants. The Wald test statistic is \W = (CQ-c
By default, svytest uses
: I
d-k+i kd
to compute the p-value. Hejre d — total number of jsampled PSUs minus the total number of strata, and F(fc, d - fc + 1) is af F distribution with k\ numerator degrees of freedom and d - k 4- 1 denominator degrees of freedom. If the noadjust cjption is specified, the p-value is computed using W/k ~ F(k.d). Note that'the noadjust option giVes the same results as the test command. When the bonf erroni joption is specified, svytest displays adjusted p-values for each of the coefficients corresponding jto the specified variables. Adjusted p-values are computed as padj — min(fcp, 1), where k is the number of variables specified, and p is the unadjusted p-value (i.e., the p-value shown in the outputiof the estimation commahd) obtained from the statistic t = 3/{V(3)}1'2, which is assumed to have i t distribution with d degrees of freedom. See Kom and Graubard (1990) for a detailed description of the Bonferroni adjustment technique and a discussion of the relative merits of it and of the adjusted and unadjusted Wald tests.
Acknowledgment The svytest command was developed in collaboration with John L. Eltinge, Department of Statistics, Texas A&M University.
References Eltinge. J. L. and W. M. Sribn^y. 1996. svy5: Estimates of linear combinations and hypothesis tests for survey data. Stata Technical Bulletin 31:!31-42. Reprinted in Stata Tfechnica/ Bulletin Reprints, vol. 6. pp. 246-259. Konu E. L. and B. I. Graubardi 1990. Simultaneous testing;of regression coefficients with complex survey data: use of Bonferroni t statistics. Tie American Statistician 44: J270-276.
Also See Complementary: Related: Background:
I
[R] Ijncom, [R] svy estimators, [R] svydes, [R] svylc. [R] svymean i [R] tjest [u] ^0 Overview of survey estimation, [R] sjvy
Title SW — Stepwise maximum-likelihood estimation
Syntax sw estimation-command term [term [ • • • ] ]
[weigfcn
[if exp]
[in range^ ,
(pr(#) |pe(#) pr(#) pe(#) } [ forward Ir hier lockterml cmd-options ] where term is { varname \ (.varlist) } and where estimation^command is clogit cloglog cnreg|cox|ereg|gamma|glm| gompertz jhetprob jllogistic Inormal|logistic jlogit nbreg ologit oprobit{poisson jprobit qreg regress|scobit tobit weibull
For example, . sw regress mpg weight displ, pr(.2) would perform backward-selection linear regression. by .. . : may be used with sw; see [R] by. Weights are allowed if estimation^command allows them; see [U] 14.1.6 weight. sw shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. predict after sw behaves the same as predict after estimation-command; see the Syntax for predict section in the manual entry for estimation-command.
Description sw performs slepwise estimation; the flavor of which is determined by the options specified: backward selection backward hierarchical selection backward stepwise
pr(#) pr(#) hier pr(#) pe(#)
forward selection forward hierarchical selection forward stepwise
pe(#) pe(#) hier pr(#) pe(#) forward
Options pr(#) specifies the significance level for removal from the model; terms with p > p r ( ) are eligible for removal. pe(#) specifies the significance level for addition to the model; terms with p < pe() are eligible for addition.
102
r
sw — Stepwise maximum-likelihood estimation
103
forward specifies the forwajrd-stepwise method and may only be specified when both pr() and pe() are also specified. Specifying both pr() and pe()| without forward results in backward stepwise. Note that specifying only; pr() results in backwald selection, and specifying only pe() results in forward selection. i : Ir specifies the test of ten^i significance is to be the likelihood-ratio test. The default is the less computationally expensive Wald test (i.e., the test is based on the estimated variance-covariance matrix of the estimators)! hier specifies hierarchical ^election. locktenal specifies that t^ie first term is to be included in the model and not subjected to the selection criteria, I i ij cmd^opnons refers to any of the options of estimation ^command, sw operates by iterating the estimation command to (bbtain results.
Remarks sw performs stepwise estimation, the flavor of which is determined by the options specified: backward selection
pr(#)
backward hierarchical selection i backward stepwise !
pr(#) hier pr(#) pe(#)
forjivard selection forWard hierarchical selection forward stepwise
Typing
•
pe(#) pe(#) hier : pr(#) pe(#) forward
! I
. &v regress yl xl xJ2 dl d2 d3 x4 x5, pr(.lO) j
i
performs a backward selection search for the regression model yl on xl. x2. dl. d2. d3, x4. and x5. In this search, each explanatory variable is said: to be a term. Typing . sw regress yl xl XJ2 (dl d2 d3) (x4 x5), pjr(.lO)
performs a similar backwarji selection search, but thb variables dl, d2. and d3 are treated as a single term as are x4 and x5. That is, dl. d2, and d3 maf or may not appear in the final model, but they appear or not together.
' Example Using the automobile d^taset. we estimate a backward selection mode! of mpg: . gen weight2 = weight*weight . sw regress mpg weijght weights displ gear Jturn headroom foreign price, pr(.2) | begin with full modei p = 0.7116 >= 0.200C removing headroom j p = 0.6138 >= 0.200C removing displacement p = 0.3278 >= 0.200C removing price Source Model Residual Total
| SS 1 1738.31455 707,!144906 —•
df
MS ! h 5 347.2629iil 68 10.39918^8 • f-
Number of obs = 74 F( 5, 68) = 33.39 Prob > F = 0.0000 R-squared = 0.7106 Adj R-squared = 0.6893
2443.45946
73
Root MSE
33.4720474
=
3.2248
104
sw — Stepwise maximum-likelihood estimation
mpg
Coef .
weight weight2 foreign gear_ratio turn _cons
-.0158002 1.77e-06 -3.615107 2.011674 -.3087038 59.02133
Std. Err.
.0039169 6.20e-07 1 . 260844 1.468831 .1763099 9.3903
t -4.03 2.86 -2.87
1.37 -1.75 6.29
P>|t|
0.000 0.006 0.006 0.175 0.084 0.000
[95g/. Conf, Interval] -.0236162 5.37e-07
-6.131082 -.919332 -.6605248 40.28327
- . 0079842 3.01e-06 -1.099131
4.94268 .0431172 77 . 75938
In this estimation, each variable was treated as its own term and so considered separately. The engine displacement and gear ratio should really be considered together: . sw regress mpg weight weight2 (displ gear) turn headroom foreign price, pr(.2) begin with full model p = 0.7116 >= 0.2000 removing headroom p = 0.3944 >= 0,2000 removing displacement gear_ratio p = 0.2798 >= 0.2000 removing price 74 Source df MS SS Number of obs = F( 4, 69) - 40.76 = 0.0000 1716 . 80842 4 429 .202105 Prob > F Model R-squared = 0.7026 Residual 726. 651041 69 10.5311745 Adj R-squared = 0.6854 s= 3.2452 Root MSB Total 2443 .45946 73 33.4720474 mpg
Coef.
Std. Err.
weight
-.0160341 1.70e-06 -2.758668 -.2862724 65.39216
. 0039379 6.21e-07 1 . 101772 . 176658 8.208778
weight2 foreign turn _cons
P>ltl -4.07 2.73 -2.50 -1.62 7.97
0.000 0.008 0.015 0.110 0.000
[95% Conf. Interval] -.0238901 4.58e-07 -4.956643 -.6386955 49.0161
-.0081782 2.94e-06 - . 5606925 .0661507 81.76823
Search logic for a step Before giving the complete search logic, let's consider the logic for a step—the first step—in detail. The other steps follow the same logic. If you type . sw regress yl xl x2 (dl d2 d3) (x4 x5), pr(.20) the logic is 1. Estimate the model y on xl x2 dl d2 d3 x4 x5. 2. Consider dropping xl. 3. Consider dropping x2. 4. Consider dropping dl d2 d3. 5. Consider dropping x4 x5. 6. Find the term above that is least significant. If its significance level is > .20, remove that term.
I
sw — Stepwise maximum-likelihood estimation
105
If you typed . sw regress yl xl x2 (dl d2 d3) (x4 xB) , -pH.20) hier
the logic is different because the hier option states the terms are ordered. The initial logic becomes 1. Estimate the model y on xi x:| dl d2 d3 x4 x5. 2. Consider dropping x4 x5 — thd last term. 3. If the significance of this last &m is > .20, remove the term. The process would then stop or continue. It would slop if x4 x5 were not dropped and otherwise sw would continue to consider the significance of the rjext-to-last term, dl 62 d3. Specifying pe() rather than pr() switches to fdjrward estimation. If you typed . s« regress yl xi x2 (dl d2 d3> (x4 x5) , jiie(.20)
sw performs forward selection search. The logic foi the first step is 1. 2. 3. 4. 5. 6.
Estimate a model of y on nothing (meaning a constant). Consider adding xi. Consider adding x2. j Consider adding dl d2 d3. j Consider adding x4 x5. | Find the term above that is mipst significant. If its significance level is < .20, add that term. •
As with backward estimation, if you specified hiet, . sw regress yl xl x2 (dl d2 d3) (x4 x5) , J5e(.20) Mer •j
the search for the most significant term is restrictel to the next term: 1 . Estimate a model of y on jiothing (meaning a constant). 2. Consider adding xl — the frst term. 3. If the significance is < .2(1 add the term. If xl were added, sw would next consider x2; otherwise, the search process would stop. sw can also employ a stepwise selection logic |vhere it alternates between adding and removing terms. The full logic for all the possibilities is given below.
(Continued op next page)
106
sw — Stepwise maximum-likelihood estimation
Full search logic
Option
Logic
pr() (backward selection)
Estimate full model on all explanatory variables. While the least significant term is "insignificant", remove it and re-estimate.
pr() hier (backward hierarchical selection)
Estimate full model on all explanatory variables. While the last term is "insignificant", remove it and re-estimate.
prO pe() (backward stepwise)
Estimate full model on all explanatory variables. If least significant term is "insignificant", remove it and re-estimate; otherwise stop. Do that again: If least significant term is "insignificant", remove it and re-estimate; otherwise stop. Repeatedly, if most significant excluded term is "significant", add it and re-estimate; if least significant included term is "insignificant", remove it and re-estimate; until neither is possible.
pe() (forward selection)
Estimate "empty" model. While the most significant excluded term is "significant", add it and re-estimate.
pe() hier (forward hierarchical selection)
Estimate "empty" model. While the next term is "significant", add it and re-estimate.
pr() pe() forward (forward stepwise)
Estimate "empty" model. If mOst significant excluded term is "significant", add it and re-estimate; otherwise stop. If most significant excluded term is "significant". add it and re-estimate; otherwise stop. Repeatedly, if least significant included term is "insignificant", remove it and re-estimate; if most significant excluded term is "significant", add it and re-estimate: until neither is possible.
I
sw — Stepwise maximum-likelihood estimation
107
Examples
-
*S! Sp
The following two statements are equivalent; bojth include solely one-variable terms: . sw reg price apg weight displ, pr(.2) J . sw reg price (mpg) (weight) (displ), p r ( j 2 )
;
-,JL. : w ^p T '
2.
•
i
The following two statements are equivalent; the lit term in each is rl, . . . , r4: . sw reg price mpg weight displ (rl-r4), prC.2) hier • sw reg price (mpg) (weight) (displ) (rl-r4), pr(.2) hier
If one also wished to group variables weight and jlispl into a single term, one might type
,jk
. sw reg price mpg (waight displ) (rl-r4),|pr(.2) hier
^'
sw can be used with commands other than regress, for instance
"*
' weight) treatedl treated2, pr(.2) . sw logit outcome (sax . sw logistic outcome (sex weight) treatedl treated2, pr(-2)
"
Either statement would estimate the same model! because logistic and logit both estimate maximum-likelihood logistic regression; they differ only in how they report results; see [R] logit and [R] logistic. I '
\
I
If one wished that variables trteatedl and trekted2 be included in the model no matter what, one could type . sw logistic outcome (treatedl treated2) j.., pr(.2) locktennl
After sw estimation, you can type sw without arguments to redisplay results . sw
loufpuf from logistic appears) '•
i
\ 1
or you can type the underlying estimation command . logistic (output from logistic appears)
i ,
At estimation time, you may specify options unique to the command being stepped: . sw logit outcome (sex weight) treatedl tteated2, pr(.2) or or is logit's option to report odds ratios rather th^n coefficients; see [R] logit.
Estimation sample considerations Whether you use backward or forward estimafon, sw forms an estimation sample by taking observations with nonmissing values of all the variables specified. The estimation sample is held constant throughout the stepping. Thus, if you type! . sw regress amount sk edul sval, pr(.2) hier
and variable sval is missing in half the data, that half of the data will not be used in the reported model even if sval is not included in the final mojlel. The function e (sample) identifies the sample that] was used, e (sample) contains 1 for observations used and 0 otherwise. If e (sample) is defined, it is; dropped and recreated. For instance, if you type
TUB
sw — oiepwibe muxiiiiuiii-iiiwiiiiuuu . sw logistic outcome xl x2 (z3 x4) (x5 x6 x7), pr(.2) pe(.lO)
and the final model is outcome on xl, x5, x6, and x7, you could recreate the final regression by typing ! . logistic outcome xl x5 x6 x7 if e(sample)
You could obtain summary statistics within the estimation sample of the independent variables by typing \ . summarize xl x5 x6 x7 if e(sample)
If you estimate another model, e (sample) will automatically be redefined. Typing , sw logistic outcome (xl x2) (x3 x4) (x5 x6 x7), lock pr(.2) would automatically drop e(sample) and recreate it.
Messages Informatory message:
dropped due to collinearity
Each term is checked for collinearity and variables within the term dropped if collinearity is found. For instance, you type . sw regress y xl x2 (rl-r4) (x3 x 4 ) , p r ( . 2 )
and assume variables rl through r4 are mutually exclusive and exhaustive dummy variables—perhaps rl, . . . , r4 indicate in which of four regions the subject resides. One of the rl, . . . , r4 variables will be automatically dropped to identify the model. This message should cause you no concern. Error message: between-term coflinearity, variable _____
After removing any within-term collinearity, if sw still finds collinearity between terms, it refuses .to continue. For instance, assume you type . sw regress yl xl x2 (dl-d8) (rl-r4), pr(.2)
Assume rl. . . . , r4 identify in which of four regions the subject resides and that dl, . . . , d8 identify the same sort of information, but more finely, rl, say, amounts to dl and d2; r2 to d3, d4, d5; r3 to d6 and d7; and r4 to d8. One can estimate the d* variables or the r* variables, but not both. It is your responsibility to specify nonco|llinear terms. Informatory message: estimability
dropped due to estimability and
obs. dropped due to
You probably received this message in estimating a logistic or probit model. Regardless of estimation strategy, sw checks that the full model can be estimated. The indicated variable had a 0 or infinite standard error. I In the case of logistic, logit, and probit, this is typically caused by one-way causation. Assume you type . sw logistic outcome (xl x2 x3) dl, pr(.2)
r
sw — Stepwlse maximum-likelihood estimation
109
and assume variable dl is an indicator (dummy) variable. Further assume that whenever dl = I. outcome = 1 in the data. Then the coefficient on dl is infinite. One (conservative) solution to this problem is to drop the dl variable and the dl==i observations. The underlying estimation commands probit, logit, and logistic report the details of t|e difficulty and solution; sw simply accumulates such problems and reports the above summary messages. Thus, if you see this message, you could type . logistic outcome i1 $2 x3 dl
to see the details. While you should think carefully ajxmt such situations, Stata's solution of dropping the offending variables and observations is, in general, appropriate.
Saved Results sw saves whatever is saved by the underlying esffimation command.
Methods and Formulas sw is implemented as an ale-file. Some statisticians do not recommend stepwise procedures; see Sribney (1998) for a summary.
References Beale. E. M. L. 1970. Note on procedures for variable selection in multiple regression. Tec/inome tries 12: 909-914. :
•
j
Bendel. R, B. and A. A. Afifi. 1977. Comparison of stopping rules in forward "stepwise" regression. Journal of the American Statistics/ Association 72: 46-53. ] Berk, K. N. 1978. Comparing subset regression procedures. JTecnnometrics 20: 1-6. Draper, N. and H. Smith. 1998. Applied Regression Analysii 3d ed. New York: John Wiley & Sons. Efroymson, M. A. 1960. Multiple itegression analysis. In Mamemafical Methods for Digital Computers, ed. A, Ralston and H. S. Wilf, 191-203. New York: John Wiley & Sonf>. :i
Gorman, J, W. and R. J. Toman. 1966. Selection of variablel for fitting equations to data. Tedinomefrics 8: 27-51. j Hocking, R. R. 1976. The analysis and selection of variables) in linear regression. Biometrics 32: 1-50. Hosmer. D. W.. Jr.. and S. Lemeshow, 1989. Applied Logisfc Regression. New York: John Wiley & Sons. (Second edition forthcoming in 200J.) j Kennedy. W, J. and T. A. Bancroft. 1971. Model-building fdjr prediction in regression based on repeated significance tests. Annals of Mathematical Statistics 42: 1273-1284. • Mantel, N. 1970. Why stepdown in variable selection. Technometrics 12: 621-625.
i
—-. 1971. More on variable selection and an alternative apbroach (letter to the editor). Tedwomefrics !3: 455-457. Sribney, W. M. 1998. FAQ: What are some problems with stfepwise regression? /ittp://wM-w'.stata.com/support/faqs/stat. Wang, Z. 2000. sg!34: Model selection using the Akaike information criterion. Sfata Technical Bulletin 54: 47-49. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp» 335-337.
Also See
i
Complementary:
i-
[R] clogit [R] cloglog, [R] co: [R] glm, [R] hetprob, [R] logistic, [R] logit. [R] ftbreg, [R] ologit, [R] oprobit, [R] pofeson, [R] probit, [R] qreg.
[R] regfess, [R] scobit, [R] to lit, [R] weibull Background:
[l!] 163 Accessing coefficienjs and standard errors. [U] 23 Estimation and post-estimation commands
112
swilk — Shapiro-WHk and Shalpiro-Francia tests for normality
Methods and Formulas swilk and sfrancia are implemented as ado-files. The Shapiro-Wilk test is based on Shapiro and Wilk (1965) with a new approximation accurate f or 3 < n < 2000 (Royston 1992). The interested reader is directed to the original papers for a discussion of the theory behind the test. Hie calculations made by swilk are based on Royston (1982, 1992, 1993). The Shapiro-Francia test (Shapiro andi Francia, 1972; Royston 1983) is an approximate test that is similar to the Shapiro-Wilk test for very large samples.
Acknowledgment swilk and sfrancia were written by Patrick Royston of the MRC Clinical Trials Unit, London.
References Gould, W. W. 1992. sg3.7: Final summary of tests of normality. Stata Technical Bulletin 5: 10-11. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 114-115. Royston, P. 1982. An extension of Shapiro and Wilk's W test for normality to large samples. Applied Statistics 31: 115-124. ; 1983. A simple method for evaluating the Shapiro-Francia W test of nonnonnality. Applied Statistics 32: 297-300. —. 1991a. Estimating departure from normality; Statistics in Medicine 10: 1283-1293. -. 1991b. sg3.2: Shapiro-Wilk and Shapiro-Francia tests. Sfata Technical Bulletin 3: 19. Reprinted in Stata Technical Bulletin Reprint*, vol. 1, p. ]Q5. ! -. 1992. Approximating the Shapiro-Wilk latest for non-normality. Statistics and Computing 2: 117-119. —. 1993. A toolkit for testing for non-normality in complete and censored samples. Statistician 42: 37-43. Shapiro, S. S. and R. S. Francia. 1972. An approximate analysis of variance test for normality. Journal of the American Statistical Association 67: 215-216. '. Shapiro, S. S. and M. B. Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52: 591-611.
Also See Related:
[R] InskewO, (Rj Iv, [R] sktest
f
Title
j symmetry — Symmetry and marginal homogeneity tests
Syntax
]
symmetry casevar controlvar [weight] [if £xp] [in range] [, notable contrib
exact rsh trend c(? ]
j
sywni # u #12 [...] \ f 21 #22 (-] [\-l [if
ex
P\ [in
ran e
S ] [• not
I
contrib exact mh trend cc ]
; ij
f»aights are allowed; see |JU] 141,6 weight.
Description
j
.i
syBmetry performs asymptotic symmetry and marginal homogeneity tests, and an exact symmetry test on K x K tables where there is a 1-to-l notching of cases and controls (nonindependence). This is used to analyze matched-pair case-contra} data with multiple discrete levels of the exposure (outcome) variable. In genetics, the test is known: as the Transmission/Disequilibrium test (TDT) and is used to test the association between transmitteji and nontransmitted parental marker -alleles to an affected child (Spieldman, McGinnis, and Ewensjl993). In the case of 2 x 2 tables, the asymptotic test statistics reduce to the McNemar test statistic and the exact symmetry test produces an exact McNemar test; see [R] epitab. For numeric exposure variables, symmetry can optionally perform a test for linear trend in the tog relative risk. • i symmetry expects the data to be in the wide fofrnat, that is, each observation contains the matched case and control values in variables casevar and controlvar. Variables can be numeric or string. i is the immediate form of symmetry. Trfe symmi command uses the values specified on the command line; rows are separated by '\'; options ajfe the same as for symmetry. See [U] 22 Immediate commands for a general introduction to immediate commands.
Options notable suppresses the output of the contingencjy table. By default, symmetry displays the n x n contingency table at the top of the output. \ contrib reports the contribution of each off-diagonal cell-pair to the overall symmetry \ 2 . exact performs an exact test of table symmetr '. This option is recommended for sparse tables. CAUTION: the exact test requires substantial ^mounts of time and computer memory for large tables.
symmetry — Symmetry and marginal homogeneity tests
114
mh performs two marginal homogeneity tests that do not require the inversion of the variance-covariance matrix. By default, symmetry produces the Stuart-Maxwell test statistic which requires the inversion of the nondiagonal variance-covariance matrix, V. When the table is sparse, the matrix may not be of full rank and, in that case, the command substitutes a generalized inverse V* for V"1. mh calculates optional marginal homogeneity statistics that do not require the inversion of the variance-covariance matrix. These tests may be preferred in certain situations. See Methods and Formulas and Bickenboller and Clerget-t)arpoux (1995) for details on these test statistics. trend performs a test for linear trend in the (log) relative risk (RR). This option is only allowed for numeric exposure (outcome) variables and its use should be restricted to measurements on the ordinal or the interval scales. cc specifies that the continuity correction be used when calculating the test for linear trend. This correction should only be specified when the levels of the exposure variable are equally spaced.
Remarks symmetry and synuni may be used to analyze 1-to-l matched case-control data with multiple discrete levels of the exposure (outcome) variable.
!> Example Consider a survey of 344 individuals (BMDP 1990, 267-270), who were asked in October 1986 whether they agreed with President Reagan's handling of foreign affairs. In January 1987, after the "Iran-Contra" affair became public, these same individuals were surveyed again and asked the same question. We would like to know if public opinion changed over this time period. We first describe the dataset and list a few observations. . describe Contains data from iran.dta obs: 344 vars: 2 size: 2,064 (99. 97. of memory free) •variable name before after
storage type byte byte
display format
:value label
variable label
7.8. Og 7.8. Og
vlab vlab
Public Opinion before 1C Public Opinion after 1C
Sorted by: . list in 1/5 before after 1. agree agree 2. agree disagree 3. agree unsure 4. disagree agree 5. disagree
1 Aug 2000 15:42
disagree
symmetry — Symmetry and marginal homogeneity tests
no
Each observation corresponds to one of the 344 individuals. The data are in wide form so that each observation has a before and an after measurement. We now perform the test without options. . symmetry before after Public Opinion before 1C
]
i| Public Opinion. after l|c Total unsure agree disagree
agree disagree unsure
47 28 26
S6 61 47
38 31 10
141 120 83
Total
101
164
79
344 cM2
df
14.87 14.78
Symmetry (asymptotic) Marginal homogeaeity (St-aart-Haxwell)
Prob>cM2 0.0019 0.0006
i
The test first tabulates the data in a K x K tatjle and then performs Bowker's (1948) test for table symmetry, and the Stuart-Maxwell (Stuart 1955,'Maxwell 1970) test for marginal homogeneity. Both the symmetry test and the marginal honjogeneity test are highly significant, thus indicating a shift in public opinion. I An exact test of symmetry is provided for juse on sparse tables. This test is computationally intensive and so should not be used on large tablefe. Since we are working on a fast computer, we will run the symmetry test again, and this time include the exact option. We will suppress the output of the contingency table by specifying notable, aijd also include the contrib option'so that we may further examine the cells responsible for the significant result. symmetry before alter, contrib exact Contribution to symmetry Cells chi-Squared nl_2 & n2_l nl_3 ft n3_l n2.3 k n3_2
9.3333 2.2500 3.2821
notable
:
| Symmetry Marginal Marginal Marginal
(asymptotic) 1 homogeneity (Stuart-Maxwell) homogeneity (Bickenboller) i homogeneity (no diagonals)
Symmetry (exact significance probabilit^)
chi2
14.87 14.78 13.53 15.25
df
Prob>chi2
3 2 2 2
0.0019 0 . 0006 0.0012 0,0005 0.0018
The largest contribution to the symmetry \1 is due to cells n12 and 7i 2 i. These correspond to changes between the agree and disagree categories. Of the 344 individuals. 56 (16..Wr) changed from the agree to the disagree response while only 2$ (8.1%) changed in the opposite direction. For these data, the results from the exact tesj are similar to those from the asymptotic test.
116
symmetry — Symmetry and marginal homogeneity tests
E> Example Breslow and Day (1980, 163) reprinted data from Mack et ah (1976) from a case-control study of the effect of exogenous estrogen on the risk of endometrial cancer. The data consist of 59 elderly women diagnosed with endometrial cancer and 59 disease-free controls living in the same community as the cases. Cases and controls were matched on age, marital status, and time living in the community. The data collected included information on the daily dose of conjugated estrogen therapy. Breslow and Day analyzed these data by creating four levels of the dose variable. Here are the data as entered into a Stata dataset: list, noobs case control 0 0 0 0.1-0.299 0 0.3-0.625 0 626+ 0 0.1-0.299 0.1-0.299 0.1-0.299 0.1-0.299 0.3-0.625 0.1-0,299 626+ 0.3-0.625 0 0.3-0.625 0.1-0.299 0.3-0.625 0.3-0.625 0.3-0.625 626+ 0 626+ 626+ 0.1-0.299 626+ 0.3-0.625 626+ 626+
count 6 2 3 1 9 4 2 1 9 2 3 1 12 1 2 1
This dataset is in a different format than the previous example. Instead of each observation representing a single matched pair, each observation represents possibly multiple pairs indicated by the count variable. For instance, the first observation corresponds to 6 matched pairs where neither the case nor the control was on estrogen; the second observation corresponds to 2 matched pairs where the case was not on estrogen and the control was on 0.1 to 0.299 ing/day; etc. In order to use symmetry to analyze this dataset, we must specify fweight to indicate that in our data there are observations corresponding to more than one matched pair. . symmetry case contrib [fHeight=count] case 0 0.1-0.299 0.3-0.625 0.626+
Total
0 6 9 9
' control 0.1-0.299 0.3-0.625 2
0.626+
Total
12
4 2 1
3 2 3 2
1 1 1 1
12 16 15 16
36
9
10
4
59
chi 2 Symmetry (asymptotic) Marginal homogeneity (Stuart-Maxwell)
17.10 16.96
df
6 3
Prob>chi2 0.0089 0.000?
Both the test of symmetry and the test of marginal homogeneity are highly significant, thus leading us to reject the null hypothesis that there is no effect of exposure to estrogen on the risk of endometrial cancer.
.1
symmetry — Symmetry and marginal homogeneity tests
117
•ij Breslow and Day perform a test for trend assuming that the estrogen exposure levels were equally spaced by receding the exposure levels as 1, 2, 3, arid 4. j We can easily reproduce their results by recodiijg our data in this way and by specifying the trend option. Two new numeric variables were creitted. ca and co, corresponding to the variables case and control respectively. Below, we list some of the data and our results from symmetry: . list in 1/4, noobs case control 0 0 0 0.1-0.299
•!
ca count 1 € 1 2 3 o o.s-o.eds 1 0 0,626+ 1 i symsetry ca co [fv^cdunt], notable trend
co 1 2 3 4 cM2
df
Prob>cM2
Synmetry (asysptotic) Marginal homogeneity (Stuart-Maxwell)
17.10 16. 96
6 3
0.0089
Linear trend in the (Iqg) RR
14.43
1
0.0001
o.ooor
'I
Note that we requested the continuity correction by specifying cc. This is appropriate because our coded exposure levels are equally spaced. i!
The test for trend was highly significant, indicating an increased risk of endometrial cancer with increased dosage of conjugated! estrogen. } You must be cautious: the way in which you co|e the exposure variable affects the linear trend statistic. If instead of coding the levels as 1, 2, 3 ^nd 4, we had instead used 0, .2, .46, and .7 (roughly the midpoint in the range of each level), we would have obtained a x 2 statistic of 11.19 for these data.
Saved Results symmetry saves in r(): Scalars r(N_pair) r(chi2) r(fif)
number of matcha pairs asymptotic asymptotic symmei -y degrees of freedom r(p) asymptotic symmely p-value MH (Stuart-Maxw :ll) r(df_sjn) MH (Stuart-Maxw|:ll) degrees of freedom r(p_smi) MH (Stuan-Maxw ;D) p-value r(cha2i_b) MH (Bickenboller)) x2 r(flf_b3 MH (Bickenboller) degrees of freedom r(p_b) MH ^Bickenboller)! p-value r(cha2s_nd) MH (no diagonals) x2 r(dfjid) MH (no diagonals) degrees of freedom r(p_ndD MH (no diagonals) p-value r(chi2L.t) X2 for linear trend r(p_trend) p-value for linear rend r(p_exact) exact symmetry p-J'alue
118
symmetry — Symmetry and marginal homogeneity tests
Methods and Formulas
i
symmetry is implemented as an ado-file. Asymptotic tests Consider a square table with K exposure categories, that is, K rows and K columns. Let n^ be the count corresponding to row i and column j of the table, N^ — n^ + njj, for ij = 1 , 2 , . . . , K, and HI. and let n.j be the marginal totals for row i and column j respectively. Asymptotic tests for symmetry and marginal homogeneity for this K x K table are calculated as follows: The null hypothesis of complete symmetry pij = pji, i ^ j, is tested by calculating the test statistic (Bowker 1948)
E
(nij — n,-,-) ^T
2
which is asymptotically distributed as x2 with K(K - l)/2 - R degrees of freedom, where R is the number of off-diagonal cells with Nij — 0. The null hypothesis of marginal homogeneity, pi. = p.i, is tested by calculating the Stuart-Maxwell test statistic (Stuart 1955, Maxwell 1970)
Tsm = d'v-ld where d is a column vector with elements equal to the differences dl = nt. — n.% for i — 1 , 2, . . . , K, and V is the variance-covariance matrix with elements
Tsm is asymptotically x2 w i tn K - 1 degrees of freedom. This test statistic properly accounts for the dependence between the table's rows and columns. When the matrix V is not of full rank a generalized inverse V* is substituted for V"1. The Bickenboller and Clerget-Darpoux (1995) marginal homogeneity test statistic is calculated by
i
i
This statistic is asymptotically distributed* under the assumption of marginal independence, as %2 with K — 1 degrees of freedom. The marginal homogeneity (no diagonals) test statistic T^h is calculated in the same way as Tmh except that the diagonal elements do not enter into the calculation of the marginal totals. Unlike the previous test statistic, T^lh reduces to a McNemar test statistic for 2 x 2 tables. The test statistic {(K - l)/2}T^h is asymptotically distributed as x2 with K - 1 degrees of freedom (Cleves et al. 1997, Spieldman and Ewens 1996). Breslow and Day's test statistic for linear trend in the (log) of relative risk is
I
symmetry — Symmetry and marginal homogeneity tests
119
where the Xj are the "doses" associated with the various levels of exposure and cc is the continuity correction; it is asymptotically distributed as x 2 wifh 1 degree of freedom. The continuity correction option is only applicable when the levels of the exposure variable are equally spaced. \ Exact symmetry test
The exact test is based on a permutation algorithrh applied to the null distribution. The distribution of the off-diagonal elements n zj . i ^ j , conditional on the sum of the complementary off-diagonal cells. JVjj = n^ + riji, can be written as the proddct of K(K - l)/2 binomial random variables:
where n is a vector with elements nl} and TT,J =* ^n^/JV^-jA^). Under the null hypothesis of complete symmetry, 7rtJ - TTJ, = 1/2. and thus the| permutation distribution is given by
w=n
The exact significance test is performed by evaluating
where p = {n : Prj(n) < Po(n*)} and n* is fie observed contingency table data vector. The algorithm evaluates pcs exactly. For information ariout permutation tests, see Good (1999).
References Bickenboller. H. and F. Clerget-Darpoux. 1995. Statistical properties of the allelic and genotypic transmission/disequilibrium test for multiallelic markers. Genetic Epidemiology 12: 865-870. BMDP. 1990. BMDP statistical software manual. Example 4F2.9 Los Angeles: BMDP Statistical Software. Inc. Bowker. A. H. 1948. A test for symmetry in contingency tables. Journal of the American Statistical Association 43: 572-574. " j Breslow. N. E. and N. E. Day. 1980. Statistical Methods |n Cancer Research, vol. 1. 182-198. Lyon: International Agency for Research on Cancer. Cleves. M. A. 1997. sg74: Symmetry and marginal homogeneity test/TDT. Sfata Technical Bulletin 40: 23-27. Reprinted in Sfata Technical Bulletin Reprints, vol. 7. pp. 193-191. — . 1999. sgllO: Hardy- Weinberg equilibrium test and allelfe frequency estimation. Stata Technical Bulletin 48: 34-37. Reprinted in Stata Technical Bulletin Reprints, vol. 8. pjjj. 280-284. Clevex M. A.. J. M. Olson, and K. B. Jacobs. 1997. Exact transmission-disequilibrium tests with multiallelic markers. Generic Epidemiology 14: 337-347. Cui. J. 2000. sg!50: Hardy- Weinberg equilibrium test in cise-control studies. Sfata Technical Bulletin 57: 17-19. ij
Good, P. 1999. Resampling Meihods: A Practical Guide to 'Da/a Analysis. New York: Birkhauser Boston. | -Mack. T. M.. M. C. Pike. B. E. Henderson, R. I. Pfeffeij, V. R. Gerkins. B. S. Arthur, and S. E. Brown. 1976, Estrogens and endometrial cancer in a retirement community. New England Journal of Medicine 294: 1262-1267. Mander. A. 2000. sbe3H: Haplotype frequency estimation using an EM alaorithm and log-linear modeling. Stuta Technical Bulletin 57: 5-7.
120
symmetry — Symmetry and marginal homogeneity tests
Maxwell, A. E. 1970. Comparing the classification of subjects by two independent judges. British Journal of Psychiatry 116: 651-655. Spieldman, R. S. and W. J. Ewens. 1996. The TDT and other family-based tests for linkage disequilibrium and association. American Journal of Human Genetics 59: 983-989. Spieldman, R. S., R. E. McGinnis, and W. J. Ewens. 1993. Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependence diabetes mellitus. American Journal of Human Genetics 52: 506-516. Stuart, A. 1955. A test for homogeneity of the marginal distributions in a two-way classification. Biometrika 42: 412-416.
Also See Related:
[R] epttab
tabte — Tables of summary statistics
Syntax table rowvar [colvar [supercolvar]] [if exf>] [in range]
[weight] [, contents(c/wO
by(superrowvarHst') cw rov col scol replace name(sfn?ig) I format('/.jwif) ceatet left concise midlsing callwidth(#) csepHidth<#) ECBepwidtl(#) stubwidth(#) where the elements of clist may be
freq frequency mean varname mean of vdrmww sd varname standard deviation ' sam Vflfrwme sum j rawsum varname sums ignoring optionally specified wiight count varname count of nonmissing observations n varname same as count mar varname maxintum \ min varname minimum by . . , : may be used with table: see [R] by.
median varname median pi vamame 1st percentile p2 varname 2nd percentile 3rd-49th pertentites p50 vamame 50th percentifc (median) 51st-97th percentiles p98 varname 98th percentik p99 vamam^ 99th percentife iqr vamame interquartile range
i
freights, aueights. iweights. and pweights are allowe|; see [U] 14.1.6 weight, sd (standard deviation) is not allowed with pweignts. j j
Rows, columns, supercolumns, and superrows are defined thus:
row 1 row 2
supercol 2 col 1 col 2
supercol 1 col 1 col 2
supercol 2 col 1 col 2
row 1 row 2
col 1
row 1 row 2
superco! 1 col 1 col 2
col 2
sujjeirrow 1: row 1 row 2 sufjerrow 2: row 1 row 2
1211!
122
table — Tables of summary statistics
Description table calculates and displays tables of statistics,
Options contents (cfof) specifies the contents of the table's cells; if not specified, contents (freq) is used by default, contents (freq) produces a table of frequencies, contents (mean mpg) produces a table of the means of variable mpg. contents (freq mean mpg sd mpg) produces a table of frequencies together with the mean and standard deviation of variable mpg. Up to five statistics may be specified. bytsuperrowvarlist) specifies numeric or string variables to be treated as superrows. Up to four variables may be specified in varlist. Note that the by() option may be specified with the by . . . : prefix, cw specifies casewise deletion. If not specified, all observations possible are used to calculate each of the specified statistics, cw is relevant only when you request a table containing statistics on multiple variables. For instance, contents (mean mpg mean weight) would produce a table reporting the means of variables mpg and weight. Consider an observation in which mpg is known but weight is missing. Should that observation be used in the calculation of the mean of mpg? By default, it will be. Specify cw and the observation will be excluded in the calculation of the means of both mpg and weight. row specifies that a row is to be added to the table reflecting the total across the rows. col specifies that a column is to be added to the table reflecting the total across columns. scol specifies that a supercolumn is to be added to the table reflecting the total across supercolumns. replace specifies that the data in memory are to be replaced with data containing one observation per cell (row, column, supercolumn, and superrow) and with variables containing the statistics designated in content s(), This option is rarely specified. If you do not specify this option, the data in memory remain unchanged. If you do specify this option, the first statistic will be named tablel, the second table2. and so on. For instance, if contents (mean mpg sd mpg) was specified, the means of mpg would be in variable tablel and the standard deviations in table2. is relevant only if you specify replace. nameO allows changing the default stub name replace uses to name the new variables associated with the statistics. Specify name(stat) and the first statistic will be placed in variable statl, the second in stat2. and so on. format ('/.fmt) specifies the display formal: for presenting numbers in the table's cells, format C/,9 . Og) is the default; f ormat(7,9.2f ) and format C/,9.2f c) are popular alternatives. The width of the format you specify does not matter except that 7,fmr must be valid. The width of the cells is chosen by table to be what it thinks looks best. Option cellwidthQ allows you to override table's choice. center specifies results are to be centered in the table's cells. The default is to right align results. For centering to work well, you typically need to specify a display format as well, center format (*/,9 . 2f ) is popular. left specifies that column labels are to be left aligned. The default is to right align column labels to distinguish them from supercolumn labels, which are left aligned. If you specify left, both column and supercolumn labels are left aligned.
IT
table — Tables of summary statistics
123
concise specifies that rows with all missing entrids are not to be displayed. massing specifies that missing statistics are to be sHown in the table as periods (Stata's missing-value indicator). The default is that missing entries ar^ left blank. cellwidth(#) specifies the width of the cell in u|its of digit widths; 10 means the space occupied by 10 digits, which is 1234567890. the default cellwidthO is not a fixed number but a number chosen by table to spread the table out while presenting a reasonable number of columns across the page. j
csepwidth(#) specifies the separation between columns in units of digit widths. The default is not a fixed number but a number chosen by table 'according to what it thinks looks best. scsepwidth(#) specifies the separation between supercolumns in units of digit widths. The default is not a fixed number but a number chosen by iable according to what it thinks looks best. stubwidth(#) specifies the width, in units of dig|it widths, to be allocated to the left stub of the table. The default is not a fixed number but a nuhiber chosen by table according to what it thinks looks best.
Limits Up to 4 variables may be specified irt the by(), so with the three row, column, and supercolumn variables, seven-way tables rhay be displayed. Up to 5 statistics may be displayed in each cell; of the table. ;
\
The sum of the number of rows, columns, supejrcolumns, and superrows is called the number of margins. A table may contain up to 3,000 margins. ^Thus, a one-way table may contain 3,000 rows. A two-way table could contain 2.998 rows and 2 columns. 2,997 rows and 3 columns 1,500 rows and 1,500 columns, — 2 rows and 2,998 columns.: A three-way table is similarly limited by the sum of the number of rows, columns, and supercolumns A r x c x d table is feasible ifr + c + d< 3,000. Note that the limit is set in terms of the sum of Ihe rows, columns, supercolumns. and superrows and not, as you might expect, their product.
Remarks I One-way tables Using the automobile datasel. here is a simple 6ne-vvay table: . table rep78, contents(mean Kpg) Repair Record 1978
mean(mpg) 1 2 3 4 5
2*1 19.126 19-. 4333 21.. 6667 27.3636
124
table — Tables of summary statistics
We are not limited to including only a single statistic: . table rep78, c(n mpg mean mpg sd mpg median mpg) Repair Record 1978
N(mpg)
me an (mpg)
sd(mpg)
med(mpg)
2 8 30 18 11
2:1 19 . 125 19.4333 21.6667 27.3636
4.24264 3.758324 4.141325 4.93487 8.732385
21 18 19 22.5 30
1 2 3 4 5
Note we abbreviated contents () as c(). The format () option will allow us to better format the numbers in the tabie: . table rep78, c(n mpg mean mpg Repair Record 1978
sd mpg median mpg) format C/.9.2f)
N(mpg)
mean (mpg)
sd(mpg)
med(mpg)
1 2 3
2 8 30
4 5
18 11
21.00 19.12 19.43 21,67 27.36
4.24 3.76 4.14 4.93 8.73
21.00 18.00 19.00 22.50 30.00
The center option will center the results under the headings: . table rep78, c(n mpg mean mpg Repair Record 1978
N(ffipg)
1 2 3 4 5
2 8 30 18 11
sd mpg median mpg) format (7.9. 2f) center
mean (mpg)
sd(mpg)
21.00 19.12 19.43 21.67 27.36
4.24 3.76 4.14 4.93 8.73
med(mpg) 21.00 18.00 19.00 22.50 30.00
Two-way tables When we typed 'table rep78, ...', we obtained a one-way table. If we type 'table rep78 foreign, ...', we obtain a two-way table:
I
table — Tabtes of summary statistics
125
. table rep78 foreign, c(n»an Repair Record 1978
Car type Domestic Foreign 21 19.126 19 18.4444 32
23.3333 24.8889 26.13333
Note the missing cells, Certain combinations of repair record and car type do not exist in our dataset. As with one-way tables, we can specify a disp!lsay format for the cells and center the numbers within the cells if we wish. . table rep78 foreign-, c(meaa mpg) format (r/,9. 2f) center Repair Record 1978
Car type Domestic Foreign 1 2 3 4 5
21.00 19.12 19.00 18.44 32.00
23.33 24.89 26.33
We can obtain row totals by specifying the rowt >tal option and column totals by specifying the coltotal option. We specify both below and abb reviate the options row and col: . table rep78 foreign, c(mean mpg) format */,9.2f) center row col Repair Record 1978
Car type Foreign
Total
21.00 19.12 19.00 18.44 32.00
23.33 24.89 26.33
21.00 19.12 19.43 21.67 27.36
19.54
26.29
21.29
Domestic 1 2 3 4 5
Total
table can display multiple statistics within cel!:s but, once we move beyond one-way tables, the table becomes busy:
table — Tables of summary statistics
126
. table foreign rep78, c(mean mpg n mpg) fonaat('/.9,2f) center Repair Record 1978 Car type
1
2
3
4
5
Domestic
21.00 19.12 19,00 18.44 32.00 2 8 27 9 2
Foreign
23.33 24.89 26.33 9 3 9
This two-way table with two statistics per cell works well in this case. That was, in part, helped along by us interchanging the rows and columns. We turned the table around by typing table foreign rep78 rather than table rep78 foreign. Another way to display two-way tables is to specify a row and superrow rather than a row and column. We do that below and display three statistics per cell: . table foreign, by(rep78) c(mean mpg sd mpg n mpg) format(*/,9,2f) center Repair Record 1978 and
Car type
mean (mpg)
sd(mpg)
N(mpg)
21.00
4.24
2
19.12
3.76
8
19.00
4.09 2.52
27 3
24.89
4.59 2.71
9 9
32.00 26.33
2.83 9.37
2 9
1
Domestic Foreign 2
Domestic Foreign 3
Domestic Foreign
23.33
4
Domestic Foreign
18.44
5
Domestic Foreign
Three-way tables We have data on the prevalence of byssinosis, a form of pneumoconiosis to which workers exposed to cotton dust are subject. The dataset is on 5,419 workers in a large cotton mill. We know whether each worker smokes, his or her race, and the dustiness of the work area. The categorical variables PlT*f*
smokes Smoker or nonsmoker in the last five years. race White or other. workplace I (most dusty), 2 (less dusty), 3 (least dusty).
Moreover, this dataset includes a frequency-weight variable pop. Here is a three-way table showing the fraction of workers with byssinosis:
r ;t I
tfrble — Tsbtes of summary statistics ,
„
—
_
1.— .
127
~~_ — — ™
. table workplace amokas race [f w*pop] , c (m<San prob) Duatiness of workplace least less most
Rac« and fcgokes
^g
yes
no
yes
no
.0107527 .0101&23 .02 .0081*33 .0820896 .1679105
.0081549 .0162774 .0136612 .0143149 .0833333 .2295082
This table would look better if we showed the fraction to four digits: i . table workplace smokes race [fv*=pop] , c(mjpan prob) format(7,9.4f) Diistin«ss of workplace least less most
Ra <se ani Smokes . __!_ ite her no no yes
0 0108 0.0102 0 0200 0.Q082 0 0821 0. 1679
3
0 0082 0.016 0 0^37 0.014 0 0833 0.229
1
In this table, the rows are the dusriness of the workplace, the columns are whether the worker smokes, and the supercolumns are race. I I Here is what happens if we request that the table ibclude the supercolumn totals, which we request by specifying the option sctotal, which we can abbreviate sc: . table workplace smokes race [fw^pop] , c(mean prob) format(7,9.4f) sc Duatiness of workplace least less most
Race and Smoke
no
yes
0.0108 0.0102 0.0200 O.Q082 0.0821 0.1679
no
ye s
0.0t)82 0.016 1 0.6137 0.014 ! 0.0833 0.229 i
no
yes
0.0090 0.0145 0.0159 0.0123 0.0826 0.1929
The supercolumn total is the total over race and d vided into its columns based on smokes. Here is the table with the column rather than the superco umn totals: table workplace smokes race [fiH=pop] , c(mJ!an prob) formatC/,9.4f) col Dustlaess of workplace least less most
no
Race and Smokes! .... 1 other j yes Total
0 0108 0.0102 0.0164 0 0200 0.0082 0.0136 0 0821 0.1679 0.1398
3
,»hite
yes
Total
0 008 i 0 0163 0.0129 0 0134 ' 0 0143 0.0140 0 083 ! 0 2295 0.1835
And here is the table with both column arid superco|umn totals:
130
table — Tables of summary statistics
Also See Related: Background:
[R] adjust, [R] collapse, [R] tabstat, [R] tabulate, [P] tabdisp [U] 15.6 Dataset, variable, and value labels, [U] 28 Commands for dealing with categorical variables
I
Title tabstat — Display table of summary statistics
Syntax
]
tabstat varlist [weight} [if exp] [in rknge} [, statistics(statname [...]) \ by (varname} nototal iissihg nosep Columns(variables | statistics) longstub labelwidth(#) format [(%/wif)] casewise save ] by . , . : may be used with tabstat (but do not confiise the by . .. : prefix with the byO option). aweights and fweights are allowed; see (U] 14.1.6 wfeight. |
|
Description
j
tabstat displays summary statistics for a sferies of numeric variables in a single table, possibly broken down on (conditioned by) another variable. Without the by() option, tabstat is a usefal alternative to summarize because it allows you to specify the list of statistics to be displayed. j With the by() option, tabstat resembles tabulate varname, summarize (varlist) in that both report statistics of varlist for the dilferent valujes of varname. tabstat allows more flexibility in terms of the statistics presented and the format of the table. tabstat is sensitive to display linesizf: it widens the table, if possible, and wraps when necessary. j
Options
j
statistics {statname [•••]) specifies the statistics to be displayed: the default is equivalent to specifying statistics (mean), (stats () it a synonym for statisticsQ.) Multiple statistics ma\ be requested when you specify the option, e.g., statistics (mean sd). Available statistics are
(Continue^' on next page)
132
tabstat — Display table of summary statistics
statname mean count n sum max
definition mean count of nonmissing observations same as count sum maximum
min minimum range range == max — min sd standard deviation skesmess skewness kurtosis kurtosis median median (same as p50) p5 5th percentile plO I Oth percentile p25 25th percentile pSO 50th percentile (same as median) p75 75th percentile p95 95th percentile p99 99th percentile iqr interquartile range = p75 — p25 q equivalent to specifying p25 p50 p75
by(varname') specifies that the statistics are to be displayed separately for each unique value of vamame: varname may be numeric or string. For instance, tabstat height would present the overall mean of height, tabstat height, by(sex) would present the mean of height of males, of females, and the overall mean height. nototal is for use with byO; it specifies that the overall statistics not be reported. missing specifies that missing values of the by() variable are treated just like any other value; statistics should be displayed for them. The default is not to report the statistics for the by()=="" . group. If the by() variable is a string variable, by()=="" is considered to mean missing. nosep specifies that a separator line between the by() categories not be displayed. columns (variables statistics) specifies whether variables or statistics are displayed in the columns of the table, columns (variables) is the default. longstub specifies that the left stub of the table is to be made wider so that it can fully document the numbers reported in the table, labelwidth(#) specifies the maximum width to be used within the stub to display the labels of the byO variable. The default is labelwidth(16). format and format (%/mr) specifies how the statistics are to be formatted. The default is to use a '/,9. Og format. format specifies that each variable's statistics be formatted with the variable's display format; see [R] format. format ('/,/mO specifies the format be used for all statistics. The maximum width of the specified format should not exceed 9 characters, casewise specifies casewise deletion of observations. Statistics are to be computed for the sample that is not missing for any of the variables in varlist. The default is to use all the nonmissing values for each variable.
I
tabstat — Display table of summary statistics
133
save specifies that the summary statistics are to: be returned in r(). The overall (unconditional) statistics are returned in matrix r(§tatTotal} (rows are statistics, columns are variables). The conditional statistics are returned in the matricjes r(Statl), r(Stat2), . . . . and the names of the corresponding variables are returned in the jmacros rCnamel), r(name2), — Concerning the results returned in matrices, because Stata does not allow missing values to be stored in matrices, missing values are indicated by the "magic number" 10300.
(Remarks •> Example You have data on the price, weight, mileage rating, and repair record of 22 foreign and 52 domestic 1978 automobiles. You wish to obtain the mininium. mean, and maximum of these variables for foreign and domestic cars. . tabstat price weight mpg reptfe, by(foreign) stats(min mean max) col(stat) longstub mean min max foreign variable Domestic
price weight fflpg rep78
$291 1800 12 1
6072.423 &317.115 k9. 82692 fe. 020833
15906 4840 34 5
Foreign
price weight
$748 1760 14 3
6384.682 bl5.909 124.77273 1.285714
12990 3420 41
Hipg
rep78 Total
price weight »Pg rep78
5
$291 fel66.257 15906 1760 &019.459 4840 41 12 21.2973 1 MOE797 5 i
In 1978, on average, foreign cars cost more, weighed less, got more miles per gallon, and had a better repair record than domestic cars. '!
Acknowledgments The tabstat command was written by Jeroen Weesie and Vincent Buskens of Utrecht University in the Netherlands. !
Also See Related:
[R] collapse, [R] summarize, [R] tahile, [R] tabsum I
Title tabsum — One- and two-way tables of summary statistics
Syntax tabulate vamame\ [varname^ {weight] [if exp\ [in range] , summarize (varname^ [ [nojmeans [no] standard [no]llre(-l [n°]°bs wrap nolabel missing j by . . . : may be used with tabulate, summarize; see [R] by. aweights and fweights are allowed; see [U] 14.1.6 weight.
Description tabulate, summarize () produces one- and two-way tables (breakdowns) of means and standard deviations. See [R] tabulate for frequency tables. See [R] table for a more flexible command that produces one-, two-, and n-way tables of frequencies and a wide variety of summary statistics, table is better but tabulate, summarize () is faster. Also see [R] tabstat for yet another alternative.
Options identifies the name of the variable for which summary statistics are to be reported. If you do not specify this option, a table of frequencies is produced; see [R] tabulate. The description here concerns tabulate when this option is specified. [nojmeans includes or suppresses only the means from the table. The summarize () table normally includes the mean, standard deviation, frequency, and, if the data are weighted, the number of observations. Individual elements of the table may be included or suppressed by the [nojmeans, [no standard, [no]freq, and [nojobs options. For example, typing tabulate category, summarize (myvar) means standard produces a summary table by category containing only the means and standard deviations of myvar. You could also achieve the same result by typing tabulate category, summarize (myvar) nofreq [no] standard includes or suppresses only the standard deviations from the table; see [nojmeans option above. [nojfreq includes or suppresses only the frequencies from the table: see [nojmeans option above. [nojobs includes or suppresses only the reported number of observations from the table. If the data are not weighted, the number of observations is identical to the frequency and by default only the frequency is reported. If the data are weighted, the frequency refers to the sum of the weights. See [nojmeans option above. wrap requests no action be taken on wide tables to make them readable. Unless wrap is specified, wide tables are broken into pieces to enhance readability. nolabel causes the numeric codes to be displayed rather than any label values. 134
tpbsufri — One- and two-way tabtes of summary statistics
135
missing requests that missing values of vamame\ arid varname^ be treated as categories rather than as observations to be omitted from the analysis. :
Remarks tabulate with the summarize() option produce^ one- and two-way tables of summary statistics. When combined with the by prefix, it can produce H-way tables as well.
One-way tables >} Example
j j
!
!
You have data on 74 automobiles. Included in your1 dataset are the variables foreign, which marks domestic and foreign cars, and mpg, the car's mileage rating. If you type tabulate foreign, you obtain a breakdown of the number of observations y(w have by the values of the foreign variable. . tabulate foreign Car type Fa?eq.
Percent
j jCum.
icfo.oo
Domestic Foreign
5J2 22
70J27 29.173
Total
74
100,00
•fo.27
You discover that you have 52 domestic cars and 22 foreign cars in your dataset. If you add the summarize (vamame) option, however, tabulate produces a table of summary statistics for varname: . tabulate fc>reign, aunnarize (mpg) Sqanjary of Mileage (mpg) Freq. Mean Std. dev. Car type Domestic Foreign
U9.83 24.77
4-74 4-61
52 22
Total
21.30
9.79
74
In addition to discovering that you have 52 domestic and 22 foreign cars in your dataset. you discover that the average gas mileage for domestic cars is about 20 mpg and the average foreign is almost 25 mpg. Overall, the average is 21 mpg in your dataset!
; Technical Note You might now wonder if the difference in gas 'mileage between foreign and domestic cars is statistically significant. You can use the oieway comjmand to find out; see [R] oneway. To obtain an anajysis-of-variance table of mpg on foreign, type j . oneway mpg foreign Source Between groups Within groups
;i Analysis 4f Varianc4 SS ' df MS
378. 153515 2066 .30694
1 72
378 153515 28. 848048
I
Prob > F
13,.18
Total : 73 33.4720474 2443.45946 Bartlett's test for «iqual variances cM2(i) = 3..4818 ii
0.0005
Prob>chi2 = 0-
tabsum — One- and two-way tables of summary statistics
136
The F statistic is 13.18 and the difference between foreign and domestic cars' mileage ratings is significant at the 0.05% level. There are a number of ways that we could have statistically compared mileage ratings—see, for instance, [R] anova. [R] oneway, [R] regress, and [R] ttest—but oneway seemed the most convenient. P
Two-way tables x Example tabulate, summarize can be used to obtain two-way as well as one-way breakdowns. For instance, we obtained summary statistics on mpg decomposed by foreign by typing tabulate foreign, summarize (mpg). We can specify up to two variables before the comma: . tabulate wgtcat foreign, summarize(mpg) Means, Standard Deviations and Frequencies of Mileage (mpg) wgtcat
Car type Domestic Foreign
Total
1
28.29 3.09 7
27.06 5.98 16
27.43 5.23 23
2
21.75 2,41 16
19,60 3.44 5
21.24 2.76 21
3
17.26 1.86 23
14.00 0.00 1
17.12 1.94 24
4
14.67 3.33 6
0
14.67 3.33 6
19.83 4.74 52
24.77 6.61 22
21.30 5.79 74
Total
In addition to giving the means, standard deviations, and frequencies for each weight-mileage cell, also reported are the summary statistics by weight, by mileage, and overall. For instance, the last row of the table reveals that the average mileage of domestic cars is 19.83 and of foreign cars is 24.77—domestic cars yield poorer mileage than foreign cars. But we now see that domestic cars yield better gas mileage within weight class—the reason domestic cars yield poorer gas mileage is because thev are, on average, heavier.
l> Example When you do not specify the statistics to be included in a table, tabulate reports the mean, standard deviation, and frequency. You can specify the statistics you want to see using the means, standard, and f req options:
tabsum — One- and two-way tables of summary statistics
137
tabulate wgtcat foreign, SB3amarize(mJ>g) means Means of Mileage (mpg) wgtcat
Car type Foreign Domestic
Total
1 2 3 4
28.29 21.75 17.26 14.67
27.06 19.60 14.00
27.43 21.24 17.12 14.67
Total
19.83
24.77
21.30
When we specify one or more of the means, standard, and f req options, only those statistics are displayed. Thus, we could obtain a table containing just the means and standard deviations by typing means standard after the sutaarize(nipg) Option. ij We can also suppress selected statistics by placing no in front of the option name. Another way of obtaining only the means and standard deviations is to add the nof req option: , tabulate wgtcat foreign, summarize (mj>g) nof req ;
8
Means and Standard Deviations of Mileage (mpg) Car type wgtcat
Domestic
Foreign
Total
1
28.29 3.09
27.06 5.98
27.43 5.23
2
21.75 2.41
19.60 3.44
21.24 2.76
3
17.26 1.86
14.00 0.00
17.12 1.94
4
14.67 3.33
Total
19 . 83 4.74
i 14.67 3.33 24.77 6.61
21.30 5.79
Also See Related:
[R] adjust, [R] oneway. [R] tpble, [R] tabstat, [R] tabulate
Background:
[u] 15.6 Dataset, variable, afid value labels, [U] 28 Commands for dealing with categorical variables
]
Title tabulate — One- and two-way tables of frequencies
Syntax One-way tables tabulate varname [weight] [if exp] [in range] [, generate (vamame) mat cell (matname} matrow(mamam) missing nofreq nolabel plot subpop(vamame) ] tabl varlist [weight] [if exp] [in range] [, missing nolabel plot j Two-way tables tabulate varnamei varname^ [weight] [if exp] [in range] [, all cell chi2 column. exact gamma Irchi2 ma.tcel.Kmatname') ma.tcol(matname) matrow(mamame) missing nofreq wrap nolabel row taub V J tab2 varlist [weight] [if exp] [in range] [, tabulate -.options ] tabi #u #12 [. - • ] \ #21 #22 [• • •] [\ • • •] [» replace tabulate -.options ] by . . . : may be used with tabulate, tabl, and tab2 (but not tabi); see [R] by. fweights, aweights, and iweights are allowed by tabulate, fweights are allowed by tabl and tab2. See [U] 14.1,6 weight.
Description tabulate produces one- and two-way tables of frequency counts along with various measures of association, including the common Pearson x2, the likelihood-ratio \ 2 . Cramer's V, Fisher's exact test, Goodman and Kruskal's gamma, arid Kendall's r&. tabl produces a one-way tabulation for each variable specified in varlist. tab2 produces all possible two-way tabulations of the variables specified in varlist. tabi displays the r x c table using the values specified; rows are separated by 'V. If no options are specified, it is as if exact were specified for 2 x 2 tables and chi2 were specified otherwise. See [U] 22 Immediate commands for a general description of immediate commands. Specifics for tabi can be found toward the end of [R] tabulate. Also see [R] table if you want one-, two-, or n-way tables of frequencies and a wide variety of summary statistics. See [R] tabsum for a description of tabulate with the summarize () option; it produces tables (breakdowns) of means and standard deviations, table is better than tabulate, summarize (), but tabulate, summarize () is faster. See [R] epitab for 2 x 2 tables with statistics of interest to epidemiologists. 138
I
tabulate — One- and two-way tables of frequencies
139
Options Options are listed in alphabetical onder.
i all is equivalent to specifying chi2 Irchi2 V gamma taub. Note the omission of exact. When all is specified, no may be placed in front of the other options, all noV requests all measures of association but Cramer's V (and Fisher's esjact). all exact requests all association measures including Fisher's exact test, all may not be Ipecified if aweights or iweights are specified. cell displays the relative frequency of each cell :in a two-way table. chi2 calculates and displays Pearson's x2 for the lypothesis that the rows and columns in a two-way table are independent. chi2 may not be sped! ed if aweights or iweights are specified. column displays in each cell of a two-way table tie relative frequency of that cell within its column. exact displays the significance calculated by Fishjer's exact test and may be applied to r x c as well as to 2 x 2 tables. In the case of 2 x 2 tables, Both one- and two-sided probabilities are displayed, exact may not be specified if aweights or iteeights are specified. gamma displays Goodman and Kruskal's gamma jMong with its asymptotic standard error, gamma is appropriate only when both variables are ordinal, gamma may not be specified if aweights or iweights are specified. generate (varname) creates a set of indicator variables reflecting the observed values of the tabulated variable. Irchi2 displays the likelihood-ratio \2 statistic. The request is ignored if any cell of the table contains no observations. Irchi2 may not be specified if aweights or iweights are specified. matcell (matname') saves the reported frequencies; in matname. This option is for use by programmers. matcoKmaniame) saves the numeric values of Ihe 1 x c column stub in matname. This option is for use by programmers, mat col () may not rjie specified if the column variable is a string. matrow(matname) saves the numeric values of me r x 1 row stub in matname. This option is for use by programmers. matrowO may not be specified if the row variable is a string. missing requests that missing values be treated likfc other values in calculations of counts, percentages, and other statistics. nof req suppresses the printing of the frequencies. nolabel causes the numeric codes to be displayed rather than the value labels. :j ]
plot produces a bar chart of the relative frequencies in a one-way table. (Also see [G] histogram.) :]
replace indicates that the immediate data speci|ed as arguments to the tabi command are to be left as the current data in place of whatever ddjta were there. row displays in each cell of a two-way table the Relative frequency of that cell within its row. i
subpop(vtfr««w) excludes observations for wljich varname — 0 in tabulating frequencies. The mathematical results of tabulate . . . , subj>op(myvar) are the same as tabulate . . . if myvar~=0, but the table may be presented differently. The identities of the rows and columns will be determined from all the data, including the^myvar = 0 group, so there may be entries in the table with frequency 0. ,|
140
tabulate — One-and two-way tables of frequencies
Consider tabulating answer, a variable that takes on values 1, 2, and 3, but consider tabulating it just for the male==l subpopulation. Assume answer is never 2 in this group, tabulate answer if male==l will produce a table with two rows: one for answer 1 and one for answer 3. There will be no row for answer 2 because answer 2 was never observed, tabulate answer, subpop (male) produces a table with three rows. The row for answer 2 will be shown as having 0 frequency. taub displays Kendall's TJ, along with its asymptotic standard error, taub is appropriate only when both variables are ordinal, taub may not be specified if aweights or iweights are specified. V (note capitalization) displays Cramer's V. V may not be specified if aweights or iweights are specified. wrap requests that Stata take no action on wide, two-way tables to make them readable. Unless wrap is specified, wide tables are broken into pieces to enhance readability.
Limits One-way tables may have a maximum of 3,000 rows (Interceded Stata) or 500 rows (Small Stata). Two-way tables may have a maximum of 300 rows and 20 columns (Intercooled Stata) or 150 rows and 20 columns (Small Stata). If larger tables are needed, see [R] table.
Remarks Remarks are presented under the headings One-way tables Two-way tables Measures of association N-way tables Weighted data Tables with immediate data tab! and tab2
For each value of a specified variable (or a set of values for a pair of variables), tabulate reports the number of observations with that value. The number of times a value occurs is called its frequency.
One-way tables > Example You have data summarizing the speed limit, the number of access points (on-ramps and off-ramps) per mile, and the accident rate per million vehicle miles along various Minnesota highways in 1973. The variable containing the speed limit is called spdlimit. If you summarize the variable, you obtain its mean and standard deviation: . summarize spdlimit Obs Variable
Mean
39
55
spdlimit
Std. Dev.
Min
Max
5 . 848977
40
70
The average speed limit is 55 miles per hour. We can learn more about this variable by tabulating it:
f *Mri*
tabulate — One- and two-way tables of frequencies . tabulate spdlimit Speed limit Fr«q.
Percent
1 3 7 15 11 1 1
2.56 7.69 17.95 38.46 28.21 S.56 2.56
40 45 50 55 60 65 70
Cum. 2.56 <10.26 128.21 66.67 94.87 197.44
ioo.oo
. — ..... — i. ,;„_... .,..„,._....
Total
141
i
100.00
39
We see that one highway has a speed limit of 40 mjles per hour, three have speed limits of 45, 7 of 50, and so on. The column labeled Percent shows j the percent of highways in the dataset that have the indicated speed limit. For instance, 318.46% of highways in our dataset have a speed limit of 55 miles per hour The final column shows the cumulative percent. We see that 66.67% of highways in our dataset have a speed limit of 55 miles per houri or less.
> Example The plot option places a sideways histogram al
the table:
. tabulate spdlimit, plot Speed limit Freq. I'll
70
1 3 7 15 11 1 1
Total
39
40 45 50 55 60 65
m-
' i l l
-
. . • —
..larij
i
.
.in
i
..
i
.i.n.
. i
-. I..."
ii
,.-....n-.—-tt«i*^-™-^~-"
* ! *** i ******* *************** *********** * *
1
Of course, graph can produce better-looking histogrkms; see [G] histogram. <3
!> Example tabulate labels tables using variaWe and value labels if they exist. To demonstrate how this works, let's add a new variable to our dataset that categorizes spdlimit into three categories. We will call this new variable spdcat: i . generate spdcat=recode(spdiimit,50,60,70) :
The recede () function divides spdlimit into 50 or below, 51-60, and above 60 miles per hour; see [U] 16.3.6 Special functions. We specified the break pbints in the arguments (spdlimit ,50,60,70). The first argument is the variable to be receded. Tb| second argument is the first break point, the third argument the second break point, and so on. Wd can specify as many break points as we wish. recede() used our arguments not only as the bfjeak points, but to label the results as well. If spdlimit is less than or equal to 50, spdcat is s£t to 50; if spdlimit is between 51 and 60, spdcat is 60; otherwise, spdcat is arbitrarily set to^70. (See [U] 28 Commands for dealing with categorical variables.)
142
tabulate — One- and two-way tables of frequencies
Since we just created the variable spdcat, it is not yet labeled. When we make a table using this variable, tabulate uses the variable's name to label it: . tabulate s]sdcat spdcat
Freq .
Percent
Cum.
50 60 70
11 26 2
28.21 66. 67 5.13
28.21 94.87 100.00
Total
39
100.00
Even through the table is not well labeled, recede O's coding scheme provides us with clues as to the table's meaning. The first line of the table corresponds to 50 miles per hour and below, the next to 51 through 60 miles per hour, and the last to above 60 miles per hour. We can improve this table by labeling the values and variables: . label define scat 50 "40 to 50" 60 "55 to 60" 70 "Above 60" . label values spdcat scat . label variable spdcat "Speed Limit Category"
We define a value label called scat that attaches labels to the numbers 50, 60, and 70 using the label define command; see [U] 15.6.3 Value labels. We label the value 50 as '40 to 50', since we looked back at our original tabulation in the first example and saw that the speed limit was never less than 40. Similarly, we could have labeled the last category '65 to 70' since the speed limit is never greater than 70 miles per hour. Next, we told Stata that it was to label the values of the new variable spdcat using the value label scat. Finally, we labeled our variable Speed Limit Category. We are now ready to tabulate the result: . tabulate spdcat Speed Limit Category
Freq.
Percent
Cum.
40 to 50 55 to 60 Above 60
11 26 2
28.21 66.67 5.13
28 . 21 94.87 100.00
Total
39
100.00
t> Example If you have missing values in your dataset, tabulate ignores them unless you explicitly indicate otherwise. We have no missing data in our example, so let's put some in: . replace spdcat=. in 1 (1 real change made, 1 to missing)
We changed the first observation on spdcat to missing. Let's now tabulate the result:
r
tabulate — Orie- and two-way tables of frequencies . tabulate spdcat Speed Limit Freq. Category 40 to 50 55 to 60 Above 60
11 26 1
Total
38
Cum.
Percent 28.95 68.42 2.63
143
28.95 97.37 : 00.00
100.00
If you compare this output with that in the previous example, you will find that the total frequency count is now one less than it was—38 rather than 3f. You will also find that the 'Above 60' category now has only one observation where it used to halve two, so we evidently changed a road with a high speed limit. If you want tabulate to treat missing values;just as it treats numbers, specify the missing option: . tabulate spdcat, missing Speed Limit Category
Freq.
Percent
I Cum.
40 to 50 55 to 60 Above 60 •
11 26 1 1
28.21 6*. 67 2.56
28.21 94.87 97.44
2.56
IOO.DO
Total
39
100.00
We now see our missing value—the last category,'labeled table sum is once again 39.
\ shows a frequency count of I . The
Let's put our dataset back as it was originally: . replace spdcat=70 in 1 (1 real change made)
i
Q Technical Note tabulate also has the ability to automatically create indicator variables from categorical variables. We will briefly review that capability here, but see j[u] 28 Commands for dealing with categorical variables for a complete description. Let's begin by describing our highway dataset: . describe Contains data from iiway.dta : obs: 39 vars: S .1 size: 936 (99.5% of memory free} variable name
storage display typ« f onftat
Minnesota Highway Data, 1973 21 Jul 2000 11:42
value I label i
variable label Accident rate Speed limit Access points per mile Accident rate per million vehicle miles Speed Limit Category
acc_rate spdlimit acc_pts rate
float X9.0g float '/.9.0g float •/,9.6g float •/.9.0g
rcat
spdcat
float '/.9.0g
scat
Sorted by: Note: dataset has changed since last saved
144
tabulate — One- and two-way tables of frequencies
Our dataset contains five variables. We will type tabulate spdcat, generate(spd), describe our data, and then explain what happened. . tabulate spdcat, generate(spd) Speed Limit Category
Freq.
Percent 28.21
28.21
66.67
Above 60
11 26 2
94.87 100.00
Total
39
100.00
40 to 50 55 to 60
Cum.
5.13
. describe Contains data from hiway.dta obs: 39 vars: size: 1,053 (99.5'/, of memory free) M CU. O •
U
variable name
storage type
display format
acc_rate spdlimit acc_pts rate
float float float float
spdcat spdl spd2 spd3
float 7.9.0g byte '/,8.0g byte '/,8. Og byte 7,8. Og
Sorted by * Note :
value label
X9.0g
y,9.og
f
/,9.0g */,9 . Og
rcat scat
Minnesota Highway Data, 1973 21 Jul 2000 11:42
variable label Accident rate Speed limit Access points per mile Accident rate per million vehicle miles Speed Limit Category spdcat==40 to 50 spdcat==55 to 60 spdcat==Above 60
dataset has changed since last saved
When we typed tabulate with the generate() option, Stata responded by producing a one-way frequency table, so it appeared that the option did nothing. Yet when we describe our dataset, we find that we now have eight variables instead of the original five. The new variables are named spdl, spd2, and spd3. When you specify the generate () option, you are telling Stata not only to produce the table but also to create a set of indicator variables that correspond to that table. Stata adds a numeric suffix to the name you specify in the parentheses, spdl refers to the first line of the table, spd2 to the second line, and so on. In addition, Stata labels the variables so that you know what they mean, spdl is an indicator variable that is true (takes on the value 1) when spdcat is between 40 and 50; otherwise, it is zero. (There is an exception: If spdcat is missing, so are the spdl, spd2, and spd3 variables. This did not happen in our dataset.) We want to prove our claim to you. Since we have not yet introduced two-way tabulations, we will use the summarize statement: summarize spdlimit if spdl==l Variable Obs Mean spdlimit
11
47.72727
summarize spdlimit if spd2==l Variable Obs Mean spdlimit
26
57 11538
Std. Dev.
Min
Max
3.437758
40
50
Std. Dev.
Min
Max
2 519157
55
60
tabulate — Ohe- and two-way tables of frequencies summarize spdlimit if spd3==l Variable Obs Mean 67.5
spdlimit
Std. iDev.
Mia
Max
3.53S|534
65
70
145
Notice the indicated minimum and maximum in leach of the tables above. When we restrict the sample to spdi, spdlimit is between 40 and 50; when we restrict the sample to spd2. spdlimit is between 55 and 60; when we restrict the sampl| to spd3, spdlimit is between 65 and 70, Thus, tabulate provides an easy way to create indicator (sometimes called dummy) variables. We could now use these variables in, tor instance;, regression analysis. See [U] 28 Commands for dealing with categorical variables for an exampld. Q
TWo-way tables t> Example
:
tabulate will make two-way tables if you spejcify two variables following the word tabulate. In our highway dataset, we have a variable called rate that divides the accident rate into three categories: below 4, 4-7, and above 1 per millich vehicle miles. Let's make a table of the speed limit category and the accident rate category: tabulate spdcat rate Speed Accident rate per million! Limit vehicle miles 4-7 Abov* 7 Category Below 4
Total
40 to 50 55 to 60 Above 60
3 19 2
5 6 0
^ 3 1 0
26 2
Total
24
11
4
39
11
The table indicates that 3 stretches of highway haVe an accident rate below 4 and a speed limit of 40 to 50 miles per hour, the table also shows the Vow and column sums (called the marginals). The number of highways with a speed limit of 40 to JJ50 miles per hour is 11, which is the same result we obtained in our previous one-way tabulations. Stata can present this basic table in a number of ways—16, to be precise—and we will show just a few below. It might be easier fo read the table If we included the row-percentages. For instance, out of 11 highways in the lowest speed limit category, 3 are also in the lowest accident rate category. Three-elevenths amounts to some 27.3%. We can ftsk Stata to fill in this information for us by using the row option:
(tlontj'nued cjrc next page)
146
tabulate — One- and two-way tables of frequencies
tabulate spdcat rate, row Speed Limit Category
Accident rate per million vehicle miles Below 4 4-7 Above 7
Total
40 to 50
3 27,27
5 45.45
3 27.27
11 100.00
55 to 60
19 73.08
6 23.08
1 3.85
26 100.00
Above 60
2 100.00
0 0.00
0 0.00
2 100.00
Total
24 61.54
11 28.21
4 10.26
39 100.00
The number listed below each frequency is the percentage of cases that each cell represents out of its row. That is easy to remember because we see 100% listed in the "Total" column. The bottom row is also informative. We see that 61.54% of all the highways in our dataset fall into the lowest accident rate category, that 28.21% are in the middle category, and that 10.26% are in the highest. tabulate can calculate column percentages and cell percentages as well. It does so when you specify the column or cell options, respectively. You can even specify them together. Below is a table that includes everything: . tabulate spdcat rate, row column cell Speed Limit Category 40 to 50
55 to 60
Above 60
Total
Accident rate per million vehicle miles Below 4 4-7 Above 7
Total
3
5
3
11
27.27 12.50 7.69
45.45 45.45 12.82
27.27 75.00
100.00
19
6
73.08 79.17 48.72
23.08 54.55 15.38
1 3.85
2
0
100.00 8.33 S.13
0.00
0.00 0.00
24
11
61.54 100.00 61.54
28.21 100 . 00 28.21
7.69
28.21 28.21
2.56
26 100 . 00 66 . 67 66 - 67
0 0.00 0.00 0.00
100.00 5.13 5.13
25.00
4 10.26 100.00
10.26
2
39 100.00 100.00 100.00
The number at the top of each cell is the frequency count. The second number is the row percentage—they sum to 100% going across the table. The third number is the column percentage—they sum to 100% going down the table. The bottom number is the cell percentage—they sum to 100% going down all the columns and across ail the rows. For instance, highways with a speed limit above 60 miles per hour and in the lowest accident rate category account for 100% of highways with a speed limit above 60 miles per hour; 8.33% of highways in the lowest accident rate category; and 5.13% of all our data.
tabulate — 0ne- and two-way tables of frequencies
147
There is a fourth option you will find useful—nofreq. That option tells Stata not to print the frequency counts. If we wish to construct a table consisting of only row percentages, we type tabulate spdcat rate, rov nofreq Speed Limit Category
Accident rate per milliofl vehicle iiles Aboilte 7 Below 4 4-7
Total
40 to 50 55 to 60 Above 60
27.27 73.08 100.00
45.45 23,08 0.00
21*. 27 4.85 tij.OO
100.00 100.00 100.00
Total
61.54
28.21
1(1.26
100.00
Measures of association &> Example
I
tabulate will calculate the Pearson x2 test fdr the independence of the rows and columns if you specify the ch!2 option. Suppose you have 1980'Census data on 956 cities in the U.S. and wish to compare the age distribution across regions of thb country. Assume that agecat is the median age in each city and that region denotes the region 0f the country in which the city is located. • . tabulate region agecat, chi2 Census Region
119-29
agecat 30^34
35+
Total
NE N Cntrl South West
46 162 139 160
83 92 68 73
37 • 30 43 ,i 23
166 284 250 256
J133 = 0.000
956
507 316 Pearson chi2 (6) = 61.2877
Total
We obtain the standard two-way table and, at the JDOttom, a summary of the x2 test. Stata informs us that the \ 2 associated with this table has 6 degree^ of freedom and is 61.29. The observed differences are quite significant. The table is, perhaps, easier to understand if ive suppress the frequencies and print just the row percentages: '! . tabulate region agecat , row iof req chif
Census Region
19-29
agecat 30-34
135+
Total
NE N Cntrl South West
27.71 57.04 55 .60 62 . 50
50.00 32.39 27.20 28.52
2^.29 19.56 11.20 8.98
100.00 100.00 100.00 100.00
Total
S3. 03 33.05 Pearson cM2C6) = 61.2877
13.91 100.00 pj = 0.000
148
tabulate — One- and two-way tables of frequencies
D> Example You have data on dose level and outcome for a set of patients and wish to evaluate the association between the two variables. You can obtain all the association measures by specifying the all and exact options: . tabulate dose function, all exact Function Dosage < i hr 1 to 4 I/day 2/day 3/day
20 16 10
Total 46 Pearson chi2(4) = likelihood-ratio chi2(4) = Cramer's V = gamma = Kendall's tau-b = Fisher's exact =
10 12 16
4+
Total
2 4 6
32 32 32
12 38 6.7780 Pr = 0.148 6.9844 Pr = 0.137 0.18:79 0.36^89 ASE = 0.129 0.2378 ASE = 0.086 0.145
96
We find evidence of association, but not enough to be truly convincing. Had we not also specified the exact option, we would not have obtained Fisher's exact test. Stata can calculate this statistic both for 2 x 2 tables and for r x c. For 2 x 2 tables, the calculation is almost instant. On more general tables, however, the calculation can take a long time. In this case, on a 60 MHz Pentium, the calculation took 9.4 seconds. On a 25 MHz 80386 with an 80387, the calculation took 1.5 minutes. That is why all does not imply exact. It must be explicitly requested. Note that we carefully constructed our example so that all would be meaningful. Kendall's T\> and Goodman and Kruskal's gamma are relevant only when both dimensions of the table can be ordered, say low-to-high or worst-to-best. The other statistics, however, are applicable in all cases.
Q Technical Note You are warned that calculation of Fisher's exact test can not only run into the minutes, it can run into the hours for big tables. Using the older 25 MHz 80386 computer, a 9 x 2 table containing 40 observations took 30 seconds, but a 5 x 4 containing 29 observations took 49 minutes (5.1 minutes on a 60 MHz Pentium)!
a
N-way tables If you need more than two-way tables, your best alternative to is use table, not tabulate; see [R] table. In the technical note below, we show yuu how to trick tabulate into doing a sequence of two-way tables that together form, in effect, a three-way table, but using table is easy and produces prettier results: ^ , ..— ._ —
tabulate — One-jj &nd two-way tables of frequencies
149
table birthcat region agecat, c(ireq) agecat and Census Region •4
birthcat
ft
NE N Catrl 11 31 4
29-136 137-195 196-529
23 97 38
1
in- ti
i
>Q
South
West
n
,i
11 46 91
65 56
HE N Cntrl 34 48 1
27 58 3
South
West
10 45 12
8 42 21
agecat aad Census Region birthcat 29-136 137-195 196-529
NE N Catrl 34 3
26 4
South 27
7 4
West
18 4i
Q Technical Note You can make n-way tables by combining the by far/is?: prefix with tabulate. Continuing with the dataset of 956 cities, say we want to make a dlble of age category by birth rate category by region of the country. The birth rate category variable| is named birthcat in our dataset. Below we make separate tables for each age category. . by agecat, sort: tabulate birthcat _i regioni ;
_____
-> ftgecat = 19-29
birthcat
NE
Census Region N Cntrl Soutlj
West
Total
11 46 91
56 239 192
148
487
West
Total
!
29-136 137-195 196-529
11 31 4
23 97 38
Total
46
158
11! 63 59
13^
-> agecat = 30-34 birthcat
IlE
Census Region i N Cntrl Soutlii J
29-136 137-195 196-529
34 48 1
27 58 3
id 4^ li
8 42 21
79 193 37
Total
83
88
ei
71
309
West
Total
i
4
18 4 0
105 18 4
3f
22
127
-> agecat = 35+ birthcat
HE
29-136 137-195 196-529
34 3 0
Total
37
Census Region i N Cntrl Soutlj i i 26 : 27
4Q
30
_
tabulate — One- and two-way tables of frequencies
150
Weighted data > Example tabulate can process weighted as well as unweighted data. As with all Stata commands, you indicate the weight by specifying the [weight] modifier; see [U] 14.1.6 weight. Continuing with our dataset of 956 cities, we also have a variable called pop, the population of each city. We can make a table of region by age category, weighted by population, by typing . tabulate region agecat [freq=pop] Census Age Category 19-29 region 30-34
35+
Total
KE N Cntrl South West
4257167 17161373 17607696 12862832
17290828 5548927 4809089 9089231
5015443 1348988 2612535 1856258
26563438 24059288 25029320 23808321
Total
51889068
36738075
10833224
99460367
If we specify the cell, coltimn, or row options, they will also be appropriately weighted. Below we repeat the table, suppressing the counts and substituting row percentages: . tabulate region agecat [freq=pop], nofreq row Census Age Category 30-34 region 19-29 35+
Total
NE N Cntrl South West
16.03 71.33 70.35 54.03
65.09 23.06 19.21 38.18
18.88 5.61 10.44 7.80
100.00 100.00 100.00 100 . 00
Total
52.17
36.94
10.89
100.00
Tables with immediate data \> Example tabi ignores the dataset in memory and uses as the table the values you specify on the command line: . tabi 30 18 \ 38 14
col
row
1
2
Total
1 2
30 38
18 14
48 52
32
100
68 I'isher's exact = 1 -sided I"isher's exact =
Total
0.289 0.179
tabulate — One- and two-way tables of frequencies
151
You may specify any of the options of tabuljate and are not limited to 2 x 2 tables: . tabi 30 18 38 \ 13 7 22, ehiJ2 exact col 1 2 row
1 2
'
18 7
30 13
2S Total 43 Pearson,chi2(2) = 6.7967 Fisher's exact = . tabi 30 13 \ 18 7 \ 38 22, all exact col row 1 2
3
Total
38 22
86 42
i 60 P]t = 0.671 ;| 0.707 cj>l fotal
43
30 $4.88
13 3
33.59
2
18 20.93
7 16,67
25 |9.53
3
38 44.19
22 52 1 38
46.88
1
128
60
42 86 i 128 100.00 100 i 00 4o.oo Pearson chi2(23 = 0.7967 P|- = 0.671 likelihood-ratio chi2(2) = 0.7985 P| = 0.671 Cramer's ¥ = 0.0789 ,j ganma = 0-1204 ASE = 0.160 Kendall's tau-i> = 0-0630 AS| = 0.084 Fisher's exact * ' 0.707
Total
Note that, for 2 x 2 tables, both the one- and t\}>o-sided Fisher's exact probabilities are displayed; this is true of both tabulate and tabi. See Cumulative incidence data and Case-control data in [R] epitab for more discussion on the relationship between one- and two-sided probabilities.
Q "technical Note
:l
tabi, as with all immediate commands, leaves Jiny data in memory undisturbed. With the replace option, however, the data in memory are replacecj by the data from the table: I
. tabi 30 18 \ 38;i4, replace
row
1 i
1 2 Total
i
;
i 30 38
68 ! ^isher^s exact = 1 -sided !-'isher'!s exact =
2
1otal
; 18 14
48 52
1
32
100 0.289 0.179
152
tabulate — One- and two-way tables of frequencies , list
row 1 1 2 2
1. 2. 3. 4.
col
pop 30 18 38 14
1 2 1 2
With this dataset, one could recreate the above table by typing . tabulate row col tfreq=pop], exact col 1
row
2
Total
1 2
30 38
18 14
48 52
Total
68
32 1
100
I'isher's exact = 1-sided I"isher's exact =
0.289 0,179
tabl and tab2 tabl and tab2 are convenience tools. Typing . tabl myvar thisvar thatvar, plot is equivalent to typing . tabulate myvar, plot . tabulate thisvar, plot . tabulate thatvar, plot
Typing . tab2 myvar thisvar thatvar, chi2
is equivalent to typing . tabulate myvar thisvar, chi2 . tabulate myvar thatvar, chi2 . tabulate thisvar thatvar, chi2
Saved Results tabulate, tabl, tab2, and tabi save in r ( ) : Scalars r(N) r(r) r(c) r(chi2) r(p) r(ganuna) r(pl_exact)
number of observations number of rows number of columns Pearson's x 2 significance of Pearson's gamma one-sided Fisher's exact p
r(p_exact) r(chi2_lr) r(p_lr) r (CramersV) r(ase_£am) r(ase_taub) r(taub)
Fisher's exact p likelihood-ratio x2 significance of likelihood-ratio x2 Cramer's V ASE of gamma ASE of -r\, n>
r(pl_exact) is defined only for 2 x 2 tables. In addition, the matrowO, matcolQ, and matcelK) options allow you to obtain the row values, column values, and frequencies, respectively.
tabulate — p|e- and two-way tables of frequencies
Methods and Formulas
153
i
tabi, tab2, and tabi are implemented as ado-toes. Let rijj, i = I . . . . , J, j = 1..... J, be the nuraier of observations in the ith row and jth column. If the data are not weighted, riij is just a count. If the data are weighted, n^ is the sum of the weights of all data corresponding to the (ij) cell.j Define the row and column marginal* as
=£' t=i and let n = ^. ]TV nij be the overall sum. Also define the concordance and discordance as
fc>t Kj
i Kj
along with twice the number of concordantes P = ]T^ ^ n^ylij and twice the number of discordances The Pearson x2 statistic with (7 — 1)(J — 1) cjegrees of freedom (so called because it is based on Pearson 1900; see Conover 1999, 240 and Fienberg 1980, 9) is defined as
I * WTiClC 7ff4 4 —™ it j. itf. 4 j Ttm
m,-
i jj
The likelihood-ratio x2 statistic with (I - 1)(J - 1) degrees of freedom (Fienberg 1980, 40) is defined=as
CrameYs V (Cramer 1946: also see Agresti 1 14, 23-24) is a measure of association designed so that the attainable upper bound is 1. For 2 x 2 Tables, -1 < V < 1, and otherwise 0 < V < 1. for 2 x 2
(nntt22 1/2
otherwise
Gamma (Goodman and Kruskal 1954^ 1959, 196j3, 1972; also see Agresti 1984, 159-161) ignores tied pairs and is based ohly ' on the number of (loncordant and discordant pairs of observations,
with asymptotic variance
16
154
tabulate — One- and two-way tables of frequencies
Kendall's T& (Kendall 1945; also see Agresti 1984, 161-163), — 1 < TJ, < 1, is similar to gamma except that it uses a correction for ties:
with asymptotic variance
j nij(2wrwcdij + nvij)2 - n3rg(wr + w c (wrwcc4 where
wr ~rf2
2
^ z
Wc —
2
V^2 A
3
n.jWr
Fisher's exact test (Fisher 1935; Finney 1948; and see Zelterman and Louis 1992, 293-301, for the 2 x 2 case) yields the probability of observing a table that gives at least as much evidence of association as the one actually observed uflder the assumption of no association. Holding row and column marginals fixed, the hypergeometric probability P of every possible table A is computed and the
where A is the set of all tables with the same marginals as the observed table, T*. such that Pr(T) < Pr(T*). In the case of 2 x 2 tables, the one-sided probability is calculated by further restricting A to tables in the same tail as T*. The first algorithm extending this calculation to r x c tables was Pagano and Halvorsen (1981); the one implemented here is a search-tree clipping method based on the ideas of Mehta and Patel (1983). Fisher's exact test is a permutation test. For more information on permutation tests, see Good (2000).
References Agresti, A. 1984. Analysis of Ordinal Categorical Data. New York: John Wiley & Sons. Conover, W. J. 1999. Practical Nonparametrie Statistics. 3d ed. New York: John Wiley & Sons. Cox, N. J. 1996. sg57: An immediate command for two-way tables. Stata Technical Bulletin 33: 7-9. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 140-143. -—. 1999. sg!13: Tabulation of modes. Stata Technical Bulletin 50: 26-27. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 180-181. Cramer, H. 1946. Mathematical Methods of Statistics. Princeton: Princeton University Press. Fienberg. S. E. 1980. The Analysis of Cross-Classified Categorical Data. 2d ed. Cambridge, MA: MIT Press. Finney, D. J. 1948. The Fisher-Yates test of significance in 2 x 2 contingency tables. Biometrika 35: 145-156. Fisher, R. A. 1935. The logic of inductive inference. JournaJ of the Royal Statistical Society, Series A 98: 39-54. Good, P. 2000. Permutation Tests. 2d ed. New York: Springer-Verlag. Goodman. L. A. and W. H. Kruska!. 1954. Measures of association for cross classifications. Journal of the American Statistical Association 49: 732-764.
tabulate — one- and two-way tables of frequencies
155
1959. Measures of association for cross classifications II: further discussion and references. Journal of the American Statistical Association 54: 123-163. . 1963. Measures of association for cross classification? HI: approximate sampling theory-. Journal of the American Statistical Association 58: 310-364. j . 1972. Measures of association for cross classifications IV: simplification of asymptotic variances. Journal of the American Statistical Association 67: 415-421. j Judson. D. H. 1992. sg!2: Extended tabulate utilities. States, Technical Bulletin 10: 22-23. Reprinted in SJata Technical Bulletin Reprints, vol. 2, pp. 140-141, } Kendall. M. G. 1945. The treatment of ties ill rank problems. Biometrika 33: 239-251. Mehta, C. R. and N. R. Patel. 1983. A network algorithm ror performing Fisher's exact test in rxc contingency tables. Journal of the American Statistical Association 78: 421-434. Pagano. M. and K. Halvorsen. 1981. An algorithm for finding the exact significance levels of rxc tables. Journal of the American Statistical AJssoc/atkMi 76: Ml-934. j Pearson, K. J9QO, On the criterion that a given system o^ deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. PMosopAical Magazine, Series 5, 50: 157-175. ;! Selvin. S. 1995. Practical Biostatistical Methods. Belmont,jCA: Duxbury Press. Wolfe, R. 1999. sgllS: Partitions of Pearson's x2 for analyzing two-way tables that have ordered columns. Stata TedHH'caJ Bulletin 51: 37-40, Reprinted irt Stata Technical Bulletin Reprints, vol. 9, pp. 203-207. j
Zelterman, D. and T. A. Louis. 1992. Contingency tablls in medical studies. In Medical Uses of Statistics, ed. J. C. Bailar HI and F. Mosteller, 293-310^ 2d ed. Bosjon: New England Journal of Medicine Books.
Also See Complementary:
[R] 'encode
Related:
[R] epitab, [R] table, [R] tajwtat, [R] tabsum, [R] xtdes, [R] xttab
Background:
[U] 15.6.3 Value bbels, \ [U] ill Immediate commands. [U] 28 Commands for dealing with categorical variables
Title test — Test linear hypotheses after model estimation
Syntax (1) test [exp - exp] [, accumulate niotest ] (2) test [coefficientlist] [, accumulate notest (3) test [term [term [ • • • ] ] ] [/ term [term [ • • • } } } [> symbolic (4) testpann varlist [, equal Syntax 1 is available after all estimation commands. Syntax 2 is available after all estimation commands except anova. Syntax 3 is available only after anova. Syntax 4 is available after all estimation commands except anova and multiple-equation estimation commands such as mlogit or mvreg. coefficientlist is one of coefficient [coefficient . . .] Legnol : coef [coef . . .] [eqno] : coef [coef . . ] coefficient is coef {.eqno}coef coef is varname _b [yamame] or something more complicated in the case of anova eqno is ## # name
term is varname[{ *
I }varname [•••]}
Distinguish between [], which are to be typed, and [], which indicate optional arguments.
Description test tests linear hypotheses about the estimated parameters from the most recently estimated model using a Wald lest. Without arguments, test redisplays the results of the last test. testpann provides a useful alternative to test that permits varlist rather than just a list of coefficients (which is often nothing more than a list of variables), allowing use of standard Stata notation including '-' and '*', which are given the expression interpretation by test. For likelihood-ratio tests, see [R] litest. For tests of nonlinear hypotheses, see [R] testnl. To display estimates for one-dimensional tests, see [R] lincom. If you have estimated a model with one of the svy commands, you should use svytest rather than test; see [R] svytest. 156
test — Test linear hypotheses after model estimation
157
'Options I .
i
;
accumulate allows a hyj>othesis to be tested jointly with the previously tested hypotheses. notest suppresses the output. This option is useful when you are interested only in the joint test of a number of hypotheses. | symbolic requests the symbolic form of the test rather than the test statistic and is allowed only after anova estimation. tVhen this option is specified without any terms (test, symbolic), the symbolic form of the estimable functions is displayed, equal tests that the variables appearing in varlist that also appear in the previously estimated model are equal to each othej rather than jbintly equal to zero.
Remarks test performs F or ^2 tests of linear restrictions applied to the most recently estimated model (e.g., anova, regress, . . . in the linear regression case; cox, logit, ... in the single-equation maximum-likelihood case; mlogit, mvteg, . . . iii the multiple-equation maximum-likelihood case). test may be used after any estimation commajid, although in the case of maximum likelihood techniques, the test is performed on the estimate of the covariance matrix—you may prefer to use the more computationally expensive likelihood-ratio test; see [U] 23 Estimation and post-estimation commands and [R] Irtest, There are three variations on the syntax for test. The first syntax : I test e|p = exp •
\
is allowed after any form of estimation, although it is not that useful after anova unless you wish to concoct your own test. The aaova case is discussed in [R] anova, so we will ignore it. Putting aside anova, after estimating a model of depvar\m xl, x2, and x3, typing test xl+x2=x3 tests the restriction that the coefficients on ±1 and x2 Urn to the coefficient on x3. The expressions can be arbitrarily complicated, for instance, typing telt xl+2*(x2+x3)=x2+3*x3 is the same as typing test xi+x2=x3. ! ; i Note that test understands that when yoilj type xl, you are referring to the coefficient on xi. You could also (and ihore explicitly) 1 type test _b[xl]+_b[x2]=_b[x3] (or test _coef [xl]+_coef [x2]*_coef tx3] or test [#jt]xl+[#l]x2= [#l]x3 or many other things since there is more than one way to refer to art estimated 'coefficient; see [U] 16.5 Accessing coefficients and standard errors). The shorthand involves less typing. On the other hand, you must be more explicit after estimation of multiple-equation models since! there may be more than one coefficient associated with an independent variable. You miglit type, fcjr instance, test [#2]xl+[t2]x2=[t2]x3 to test the constraint in equation 2 or, more readably, tist [ford]xi+[ford]x2=[ford]x3, meaning to test the constraint on the equation corresponding tdi ford, which might be equation 2. (ford would be an equation name after, say, sureg or, after mlogit, ford would be one of the outcomes. In the case of mlogit, you could also type test j[2]xi+[2jx2=C2]x3—note the lack of the #—meaning not equation 2 but the equatibn corresponding to the numeric outcome 2.) You can even test constraints across equations: test [Jford]xl+[fcird]x2=[b!irick]x3. The second syntax, test coefflcientlist is available after all estimation commands except imova and is a convenient way to test that multiple coefficients are zero following estimation. A coe} fficientlist can simply be a list of variable names teist varname\ \varname
test — Test linear hypotheses after model estimation
158
and it is most often specified that way. After estimating a model of depvar on xi, x2, and x3, typing test xl x3 tests that the coefficients on xl and x3 are jointly zero. After multiple-equation estimation, this would test that the coefficients on xl and x3 are zero in all equations that contain them. Alternatively, you can be more explicit and type, for instance, test [ford] xl [ford]x3 to test that the coefficients on xl and x3 are zero in the equation corresponding to ford. In the multiple-equation case, there are more alternatives. You could also test that the coefficients on xl and x3 are zero in the equation corresponding to ford by typing test [ford] : xl x3. You could test that all coefficients except the coefficient on the constant are zero in the equation corresponding to ford by typing test [ford]. You could test that the coefficients on xl and x3 in the equation corresponding to ford are equal to the corresponding coefficients in the equation corresponding to buick by typing test [ford=buick] : xl x3. You could test that all the corresponding coefficients except the constant are equal by typing test [f ord=buick]. testparm is much like the second syntax of test except it cannot be used after multiple-equation commands and, like the second syntax, cannot be used after anova. Its usefulness will be demonstrated below. Finally, syntax 3 of test is used for testing effects after anova models. It will not be discussed here; see [R] anova. In the examples below we will use regress, but what is said is equally applicable after any single-equation estimation command (such as logistic, etc.). It is also applicable after multipleequation estimation commands if you bear in mind that references to coefficients must be qualified with an equation name or number in square brackets placed before them. The convenient syntaxes for dealing with tests of many coefficients in multiple-equation models are demonstrated in Special syntaxes after multiple-equation estimation, below.
Example You have 1980 Census data on the 50 states recording the birth rate in each state (brate), the median age (medage) and its square (medagesq), and the region of the country in which each state is located. The variable regl is 1 if the state is located in the Northeast and zero otherwise, whereas reg2 marks the North Central, reg3 the South, and reg4 the West. You estimate the following regression: . regress brate medage medagesq reg2-reg4 Source
S3
Model Residual
38803.419 3393.40095
Total
42196.82
brate
Coef .
ffledage -109.0957 medagesq 1 . 635208 reg2 15.00284 reg3 7 . 366435 reg4 21,39679 _ COILS 1947.61
MS;
df
50 Number of obs Fl 5, 44; = 0.0000 Prob > F = 0.9196 R-squared Ad j R- squared Root MSE 8.782
5 7760.68381 44 77. 1227489 49
861 .159592
Std. Err.
13.52452 .2290536 4.252068 3.953336 4.650602 199.8405
t -8.07 7.14 3.53 1.86 4.60 9.75
P>ltl 0. 000 0.000 0. 001 0.069 0. 000 0.000
[957. Conf . Interval} -136.3526 1.173581
6 . 433365 - . 6009897 12.02412 1544.858
-81.83886 2.096835 23.57233 15.33386 30 . 76946 2350.362
test can now be used to perform a variety of statistical tests. We can test the hypothesis that the coefficient on reg3 is zero by typing
.1
teat — Test llneair hypotheses after model estimation . test reg3=0 ( 1) reg3 = 0.0 F{ 1, 44)= Prob > F «
3.47 0.0691
The F statistic with 1 numerator and 44 denominator degrees of freedom is 3.47. The significance level of the test is 6.91%—we can reject the hypbthesis at the 10% level but not at the 5% level. This test result, obtained by using test, is identical to a test result presented in the output from regress. That output indicates that the t |tatistic on the reg3 coefficient is 1.863 and that its significance level is 0.069. The t statistic presented in the output tests the hypothesis that the corresponding coefficient is zero, although it states tjie test in slightly different terms. The F distribution with I numerator degree of freedom is, however, identical to the t2 distribution. We note that 1.8632 w 3.47. We also note that the significant levels associated with each test agree, although one extra digit is presented by the test commanil.
t f
Q Technical Note After all estimation commands, even maximum| likelihood, the result reported by test of whether a single variable is zero is identical to the resultj reported by the command's output. The tests are performed in the same way—using the estimated covariance matrix. If the estimation command reports significance levels and confidence intervals using z rather than t statistics, test will report results using the x2 rather tfian the F statistic. }
£> Example If that were all test could do, it wduld be useless. We can use test, however, to perform other tests. For instance, we can test the hypothesis trjat the coefficient on reg2 is 21 by typing . test reg2=21 ( 1) r«g2 = 21.0 F( 1, 44) = 1.99 Prob >> = 0.16B4
I ] | ,| ;i
We find that we cannot reject thJat hypothesis, or; at least cannot reject it at any significance level below 16.5%. | :
;j
Example The previous test is useful, but we could alrnist as easily perform it by hand using the results presented in the regression output—if we were wfcll read on our statistics. We could type . display Ftail(l,44,((_coef[re£2]-21)/4.p52068)-2> .16544972 | :
j
So now let's test something a bit more difficult.| Let's test whether the coefficient on reg2 is the same as the coefficient on reg4: \ , test reg2=reg4 ( 1) reg2 - reg4;= 0.0 F(
1, 4 -F =
j
2.84 0.093*
We find that we cannot reject the equality hypothesis at the 5% level, but we can at the 10% level.
160
test — Test linear hypotheses after model estimation
t> Example You may notice that when we tested the equality of the reg2 and reg4 coefficients, Stata rearranged our algebra. When Stata repeated the test, it indicated that we were testing whether reg2 minus reg4 is zero. The rearrangement is innocuous and, in fact, offers an advantage. Stata can perform much more complicated algebra; for instance, . test 2*(reg2-3*(reg3-reg4))=reg3+reg2+6*{reg4-reg3) ( 1) reg2 - rag3 =0.0 F( 1, 44) » 5.06 Prob > F 0.0295
Although we requested what appeared to be a lengthy hypothesis, once Stata simplified the algebra it realized that all we wanted to do was test whether the coefficient on reg2 is the same as the coefficient on reg3.
Q Technical Note Stata's ability to simplify and test complex hypotheses is limited to linear hypotheses. If you attempt to test a nonlinear hypothesis, you will be told that it is not possible: . test reg2/reg3=reg2+reg3 not possible with test
If you want to test a nonlinear hypothesis, see [R] testnl.
a
Example The real power of test is demonstrated when we test joint hypotheses. Perhaps we wish to test whether the region variables, taken as a whole, are significant. This amounts to a test of whether the coefficients on reg2, reg3, and reg4 are simultaneously zero. To perform tests of this kind, specify each constraint and accumulate it with the previous constraints: . test reg2=0 ( 1) reg2 = 0.0 F( i, 44) = 12.45 Prob > F = 0.0010 . test reg3=0, accumulate ( 1) reg2 = 0.0 ( 2) reg3 = 0.0 F( 2, 44) = 6.42 Prob > F = 0.0036 . test reg4=0, accumulate ( 1) reg2 * 0.0 ( 2) reg3 = 0.0 ( 3) reg4 = 0.0 F( 3, 44) = 8.85 Prob > F = 0.0001
I
test—Test llneur hypotheses after model estimation
101
We will show you a more convenient way to perfornj tests of this kind in the next example, but you should first understand this one. We tested the hypothesis that the coefficient on rcg2 was zero by typing test reg2=0. We then tested wheiher the coefficient on reg3 was also zero by typing test reg3=0, accumulate. The accumulate option tolijl Stata that this was not the start of a new test but a continuation of a previbus one. Stata responded:! by showing us the two equations and reporting an F statistic of 6.42. The significance level associated with those two coefficients being zero is 0.36%. ! j
When we added the last constraint test reg4=0, 'accumulate, we discover that the three region variables are quite significant. I
QiTechnical Note :
,
j
If al! we wanted was the overall significance and! we did not want to bother seeing the interim results, we could have used the not;est option: . test r*g2*0, noteet ( i) r«g2 = 0.0 . test r«g3=0, accumulate nofeest ( 1) r*g2 = 0 . 0 ( 2) reg3 = 0.0 . t«st reg4=0, accuimilate ( 1) reg2 = 0.0 ( 2) r«g3 = 0.0 ( 3) r«g4 = 0.0 F(
3, 44) <= Prob > F •=
8.85 0.0001
Q
Example Typing separate test coihmands for eAch constraint can be tiresome. The second syntax allows us to perform our last test more conveniently:
:
:
j
. t«at reg2 reg3 reg4; ( 1) reg2 = 0.0 ( 2) reg3 = 0.0 { 3) reg4 = 0.0
F( 3, 44) *= Prob > F *
8.65 0.5001
Example
i •
i
We will now show you the use of test^arm. In its second syntax, test accepts a list of variable names, but not a varlist. . t*st r«g2-reg4 - not found r(iil)j
\ ;
:
162
test — Test linear hypotheses after model estimation
In a varlisi, reg2-reg3 means variables reg2 and reg3 and all the variables in between. Yet we received an error, test is mightily confused because the - has two meanings: it means subtraction in an expression and "through" in a varlist. Similarly, '*' means "any set of characters" in a varlist and multiplication in an expression, testparm avoids this confusion—it allows only a varlist. . testpann reg2-reg4 ( 1) reg2 * 0.0 ( 2) reg3 = 0.0 ( 3) reg4 =0.0 F( 3,
44) =
Prob > F =
8.85
0.0001
testpann has another advantage. We have five variables in our dataset that start with the characters reg: region, regl, reg2, reg3, and reg4. reg* thus means those five variables: . describe reg* storage variable name type
int byte byte byte byte
region regl reg2 reg3 reg4
display format
value label
7.8. Og 7.9. Og 7,9. Og
region
'/.9.0g 7.9- Og
variable label Census Region region^NE region==N Cntrl region==South regiona=West
We cannot type test reg* because in an expression, '*' means multiplication, but here is what would happen if we attempted to test all the variables that begin with reg: . test region regl reg2 reg3 reg4 region not found r(lll);
The variable region was not included in our model and so was not found. However, with testparm: . testparm reg* ( 1) reg2 =0.0 ( 2) reg3 =0.0 ( 3) reg4 = 0.0 F(
3,
44) =
Prob > F =
8.85
0.0001
That is, testparm took reg* to mean all variables that start with reg that were in our model. <3
Q Technical Note Actually, reg* means what it always does—all variables in our dataset that begin with reg—in this case, region regl reg2 reg3 reg4. testparm just ignores any variables you specify that are not in the model. Q
£> Example We just used test (testparm, actually, but it does not matter) to test the hypothesis that reg2, reg3, and reg4 are jointly zero. We can review the results of our last test by typing test without arguments:
test -*- Test flnejir hypotheses after model estimation
163
teat
( i) reg2 = 0.0 ( 2) r«g3 = 0 . 0 ( 3) reg4 = 0 . 0
F(
3, 44) = Prob > B =
8.85 0.0001
Q: Technical Note
ji
;
test does not care how you taild joint hypotheses; you may freely mix syntax 1 and syntax 2. (You can even start with testpajhn, but you cannpt use it thereafter because it does not have an accumulate option.) 1 Say we type test reg2 reg3 reg4 to test that tjjie coefficients on our region dummies are jointly zero. We could then add a fourth constraint, say thajt medage = 100, by typing test medage=100, accumulate. Or, if we had introduced the medage (ionstraint first (our first test command had been test medage=100), we could then add the region |durnmy test by typing test reg2 reg3 reg4, accumulate. ] Remember that all previous tests are cleared wrjen you do not specify the accumulate option. No matter what tests we performed in the past, if |we type test medage medagesq, omitting the accumulate option, we wduld test that faedage anifl; medagesq are jointly zero.
i
Example
a
j ij
Let's test the hypothesis that all the included rjegions have the same coefficient—that it is the Northeast which is significantly different from the rfest of the nation: . teat reg2=reg4
i
( 1) reg2 - reg4 =,0*0
F(
1,
44)|=
Prob > F: =
j
j
2.. 84
0.0989
j
. test reg3=reg4, acpua
.1
( 1) reg2 - reg4 » 0,0 ( 2) reg3 - reg4 -: 0.0 F(
2, 44)= Prob > F =
I j 8:23 0^.0009 :
:
j \
I
We find that they are not all the same. We, performe(| this test by imposing two constraints: Region 2 has the same coefficient as region 4 and region 3 ha$ the same coefficient as region 4. Alternatively, we could have tested that the coefficients on regioni; 2 and 3 are the same and that the coefficients on regions 3 and 4 are the same. We would obtain Ihe same results in either case.
Example This test is even easier with the tefetparm command. When you include the equal option, testpaTm tests that all the Variables specified are eiual:
164
test — Test linear hypotheses after model estimation . testpann reg*. equal ( 1) - reg2 + reg3'* 0.0 ( 2) - reg2 + reg4 « 0.0 F( 2, 44) = 8.23 Prob > F = 0.0009
Q Technical Note If you specify a set of inconsistent constraints, test will tell you by dropping the constraint or constraints that led to the inconsistency. For instance, let's test that the coefficients on region 2 and region 4 are the same, add the test that the coefficient on region 2 is 20, and finally add the test that the coefficient on region 4 is 21: . test reg2==reg4 ( 1) reg2 - reg4 =0.0 F( 1, 44) = 2.84 Prob > F * 0.0989 . test reg2=20, accumulate ( 1) reg2 - reg4 = 0.0 ( 2) reg2 = 20.0 F( 2, 44) = 1.63 Prob > F = 0.2076 . test < 1) < 2) ( 3)
reg4=21, accumulate reg2 - reg4 = 0.0 reg2 = 20.0 reg4 =21.0 Constraint 2 dropped F( 2, 44) = 1.82 Prob > F = 0.1737
Note that when we typed test reg4=21, accumulate, test informed us that it was dropping constraint 2. All three equations cannot be simultaneously true, so test drops whatever it takes to get back to something that makes sense.
a
Special syntaxes after multiple-equation estimation Everything said above about tests after single-equation estimation applies to tests after multipleequation estimation as long as you remember to specify the equation name. To demonstrate, let's estimate a seemingly unrelated regression using suxeg; see [R] sureg.
(Continued on next page)
test -f Test linear hypotheses after model estimation
165
. sureg (price foreign mpg displ) (weight foreign length) Seemingly unrelated regression Equation
Qbs Farms
price weight
74 74
3 2
RMSE
"R-sq"
cM2
P
2165.321 245.2916
0.4537 0.8990
49 .6383 661 .8418
0.0000 0.0000
Coef ,
Std. ferr.
k
P>lz|
[95X Conf . Interval]
foreign mpg displacement _cons
3058.25 -104.9B91 18.18098 3904.336
685.T357 58,47209 4.286372 1966.521
4 ,46 -1 .180 4 .J24 1 .•1 99
0.000 0.073 0.000 0.047
1714.233 -219.5623 9.779842 50.0263
4402.267 9.644042 26.58211 7758.645
weight foreign length _cons
-147.3481 30.J94905 -2763.064
75,44314 1.539895 303.9336
-1 .95 0.051 20 .10 0.000 -9 .06 0.000
-295.2139 27.93091 -3348.763
.517755 33.96718 -2157.365
price
If we wanted to test the significance of foreign in the price equation, we could type . test [price]foreign ( 1) [price]foreign = 0 . 0 cM2( 1) = 1 9 . 8 9 Prob > chi2 0.0000 which is the same result reported by sureg: 4.4602'R= 19.89. If we wanted to test foreign in both equations, we could type . test [price]foreign [weigjht] foreign ( 1) [price]foreign = 0 . 0 ( 2) [weight]foreign = 0 . 0 chi2( 2) = 31.61 Prob > chi2 = 0.0000 Or
, test foreign ( 1) [price]foreign = 0 . 0 ( 2) [weight]foreijgn = 0.0 chi2( 2 ) = 31.61 Prob > chi2 = 0.0000 This last syntax—typing the variable name by itself—tests the coefficients in all equations in which they appear. The variable ijength appears in only the weight equation, so typing test length ( 1) [weight] lengtjh chi2( 1); = Prob > chi2 =
0.0 403.94 0.0000
yields the same result as taping test [weight] lejngth. We may also specify a linear expression rather than a list of coefficients:
166
test — Test linear hypotheses after model estimation , test mpg=displ ( 1) [price]mpg - [price]displ = 0.0 cM2( 1) = 4.85 Prob > cM2 « 0.0277
or . test [price]mpg = [price]displ ( 1) [price]mpg - [price]displ = 0.0 chi2( 1) * 4.85 Prob > chi2 • 0.0277
A variation on this syntax can be used to test cross-equation constraints: . test [price]foreign = [weight]foreign ( 1) [price]foreign - [weight]foreign = 0 . 0 chi2( i) = 23.07 Prob > chi2 = 0.0000
Typing an equation name in square brackets by itself tests all the coefficients except the intercept in that equation: . test ( 1) ( 2) ( 3)
[price] [price]foreign = 0 , 0 [price]mpg * 0.0 [price]displacement = 0 . 0 chi2( 3) = 49.64 Prob > chi2 = 0.0000
Typing an equation name in square brackets, a colon, and a list of variable names tests those variables in the specified equation: . test [price]: foreign displ ( 1) [price]foreign = 0.0 ( 2) [price]displacement = 0 . 0 cM2( 2) = 25.19 Prob > chi2 = 0.0000
test [eqnamei^eqname-z] tests that all the coefficients in the two equations are equal. We cannot use that syntax here because we have different variables in our model: . test [price=weight] [weight]mpg not found r(lll);
This syntax is, however, useful after mvreg or mlogit, where all the equations do share the same independent variables. We can, however, use a modification of this syntax with our model if we also type a colon and the names of the variables we want to test: . test [price=weight]: foreign ( 1) [price]foreign - [weight]foreign = 0.0 chi2( i) = 23.07 Prob > chi2 = 0.0000
We have only one variable in common between our two equations, but if there had been more, we could have listed them. Finally, you may use the accum and notest options just as you do after single-equation estimation. Earlier, we tested the joint significance of foreign by typing test foreign. We could also have typed
test — Test linear hypotheses after model estimation
167
. test [price] foreign, uotest ( 1) [price]foreign = 0 . 0 . test [weight]foreign, accum ( 1) [price]foreign * 0.0 ( 2) [weight]foreign = 0,0 chi2( 2) = 31!. 61 Prob > chi2. = 0.0000
Saved Results test and testparm save in r(): Scalars r(p) two-sided p-value r(F) F statistic j r(df ) test constraints) degrees of freedojn r(df_r) residu&l degrees of freedom
r(chi2) x2 r(ss) model sum of squares r(rss) residual sum of squares
r(ss) and r(rss) are defined only when test is used for testing effects after anova.
Methods and Formulas testparm is implemented as an ado-file. test and testparm perform Wald tests. Let the estimated coefficient vector be b and the estimated variarice-covariance matrix be V. Let Rb = r demote the set of q linear hypotheses to be tested jointly. The Wald test statistic is. (Judge et al. 1985, 20-J28)
W = (Rb - rHRVR'r^Rb - r If the estimation command reports its significance Ifevels using Z statistics, a chi-squared distribution with q degrees of freedom is used for computation of the significance level of jhe hypothesis test. If the estimation command reports its significance levels using t statistics with d degrees of freedom, an F statistic
F = -W 91 is computed, and an F distribution with q numerator; degrees of freedom and d denominator degrees of freedom is used to compute the significance levelj of the hypothesis test.
References Beale. E. M. L. I960. Confidence regions in nonlinear estimation. Journal of the Royal Statistical Society. Series. B 22: 41-88. Judge, G. G.. W. E. Griffiths. R. C. Hill. H. Liitkepohl. and T.-JC. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Witey & Sons. !
170
testni — Test nonlinear hypotheses after model estimation
An implication of this model is that peak earnings occur at age -_b [age] / (2*_b [age2]), which here is equal to 33.1. Pretend we have a theory that peak earnings should occur at age 16 + i/_b [grade] (we do not, but pretend we do). . testal -_b[age]/(2*_b[age2]) = 16 + l/_b[grade] (1) -_b[age]/(2*_b[a?e23) - 16 + l/_b [grade] chi2(l) = 0.62 Prob > chi2 « 0.4305 These data do not reject our theory.
Using testni to perform linear tests testni may be used to test linear constraints, but test is faster; see [R] test. You could type . testni _b[x4] = _b[zl]
but it would take less computer time if you typed . test _b[x4] = _b[zl]
Specifying constraints The algebraic form in which you specify the constraint does not matter; you could type . testni _b[mpg]*_b [weight] = 1
or . testni _b[mpg] = l/_b[weight] or you could express the constraint any other way you wished. You must, however, exercise one caution: Users of test often refer to the coefficient on a variable by specifying the variable name. For example: . test mpg = 0
More formally, they should type . test _b[mpg] = 0 but test allows the _b[] surrounding the variable name to be omitted, testni does not allow this shorthand. Typing . testni mpg=0 specifies the constraint that the value of variable mpg in the first observation is zero. If you make this mistake, in some cases testni will catch it: . testni mpg=0 equation (1) contains reference to X rather than b[X] r(19B); In other cases testni may not catch the mistake; in that case, the constraint will be dropped because it does not make sense: . testni mpg=0 Constraint (1) dropped
testnl — Test nonlinear hypotheses after model estimation
171
(There are reasons other than this for constraints bfeing dropped.) The worst case, however, is . testnl _b[weight]*mpg = i
when what you mean is not that _b [weight] equals the reciprocal of the value of mpg in the first observation, but rather . testnl _b[weight]*_b[mpg] = 1 .
|
Sometimes this mistake will be caught by the "contains reference to X rather than _b[X]" error and sometimes not. Be careful. testnl, like test, can be used after any Stata Estimation command. When used after a multipleequation command such as mlogit or heckman,! you refer to coefficients using Stata's standard syntax: [eqname] _b lyarname]. • Stata's single-equation estimation output looks like Coef
f
weight mpg
...
12.27 ... 3.21 ...
<- coefficient is _b [weight]
Stata's multiple-equation output looks like
I
Coef catl weight mpg
12.27 ... 3.21 . . .
weight mpg
5.83 . . . 7.43 . . .
<-! coefficient is [catl] _b [weight] i
8
< - coefficient is [8] _b [we ight ]
You must take care when specifying constraints that contain parentheses. If you specify a constraint that begins with an open parenthesis, then the entirfe constraint must also be enclosed in parentheses. For example, . testnl (39*_b[mpg]"2) = _b[foEeign]
I
will cause Stata to generate an error message indicating invalid syntax, whereas . testnl ((39*_b[mpg]"2) = _b[foreign])
will execute without error. Even though the two expressions are mathematically equivalent, testnl cannot parse the first expression correctly because! the leading parenthesis appears to be indicating that the second syntax is being used so H searches for the pattern (... = ... ) which it cannot resolve. Therefore an error message is issued.
Cropped constraints
\
testnl automatically drops constraints when 1. They are nonbinding. e,g.. _b[mpgl^-b[mpg]. • More subtle cases include _b[ajpgD*_b [we ight] = 4 _b[wjeight] = 2 _b[njpg] = 2
i *i
l]
In this example, the 3r<jl constraint is nonbinding since it is implied by the first two.
172
testnl
—
Test
nonlinear
hypotheses
after
model
estimation
_
2. They are contradictory, e.g., _b[mpg]=2 and _b£mpg]=3. More subtle cases include _b[mpg]*_b [weight] = 4 _b [weight] = 2 _b[mpg] = 3
The 3rd constraint contradicts the first two.
Output testnl reports the constraints being tested followed by a F or x2
test:
. regress price mpg weight weightsq foreign (output omitted ) . testnl (39*_b[mpg]"2 = _b [foreign]) (_b[mpg]/_b [weight] = 4) (1) 39*_b[mpg]"2 = _b[foreign] (2) _b[mpg]/_b [weight] = 4 F(2, 69) = 0.08 Prob > F = 0.9195 . logit foreign price weight mpg (output omitted ) . testnl (45*_b[mpg]~2 = _b[price]) (_b [mpg] /_b [weight] = 4) (1) 45*_b[jnpg]-2 = _b[price] (2) _b [mpg] /_b [weight] = 4 chi2(2) = 2.45 Prob > chi2 = 0.2945
Saved Results testnl saves in r(): Scalars r(df ) degrees of freedom r(df _r) residual degrees of freedom r(chi2) x2 r(F) F statistic
Methods and Formulas testnl is implemented as an ado-file. You have estimated a model. Define b as the resulting 1 x k parameter vector and V as the A; X A; covariance matrix. The (linear or nonlinear) hypothesis is given by J?(b) = q, where R is a function returning a j x 1 vector. The Wald test formula is (Greene 2000, 154)
where G is the derivative matrix of R(b) with respect to b. W is distributed as %2 if V is an asymptotic covariance matrix. F — W/j is distributed as F in the case of linear regression.
testni — Ifot nonlinaar hypotheses after model estimation
173
References Gould. W. W. 1996. crc43: Wald test of nonlinear hypotheses after model estimation. Stata Technical Bulletin 29: 2-4. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pjj>. 15-18. Greene, W. H. 2000. Econometric Analysis. 4th ed. Upper f>addle River, NJ: Prentice-Hall.
.Also See
I
Related:
[R] lineom, [R] Irtest, [R] test
Background:
[U] 23 Estimation and post-estimation commands
Title tobft — Tobit, censored-normal, and interval regression
Syntax tobit depvar [indepvars] [weight] [if exp] [in range] , 11 [ (#)] ul [(#)] [ level (#) offset(varnamg) maximize ^options ] cnreg depvar [indepvars] [weight] [if exp] [in range] , censored(varname} [ level(#) offset(vamame') maximize ^options ] intreg depvari depvar^ \indepvars\ \weighi\ [if exp] [in range] [, noconstant robust cluster (varname') s£ore(newvari newvarz") constraints (numlisf) level (#) off set (varname} no log maximize-Options by .. . : may be used with tobit, cnreg, and iatreg; see [R] by. aweights and fweights are allowed by all three commands; intreg also allows pis-eights and iweights; see [U] 14.1.6 weight. These commands share the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. tobit and cnreg may be used with sw to perform stepwise estimation; see [R] sw.
Syntax for predict predict [type] newvarname [if exp] [in range] [, statistic nooffset where statistic is xb prCa.6) eCa.fc) ystar(fl.i) stdp stdf
Xjb, fitted values (the default) Pr(a < yj < b] E ( y j \ a < y3 < b) E(y*}, y* = max(a,rain(y j , b)) standard error of the prediction standard error of the forecast
where a and b may be numbers or variables; a equal to '.' means — oo; b equal to '.' means +00. These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample.
174
tobit — Tobit, censored-normal, and interval regression
175
Description tobit estimates a model of depvar on indepvqrs where the censoring values are fixed. cnreg estimates a mojdel of depvar on indepvprs where depvar contains both observations and censored observations on 'the process. Censoring values may vary from observation to observation. ?
*
;i
intreg estimates a model of y = [depvar^ debvar^} on indepvars, where y for each observation is either point data, interval data, left-censored dafa, or right-censored data. depvari and depvar^ Should have tfije following form: typie of data
j
point data a = [ ip, a ] interval data [p, b] left-censored dat^ (—qp,6] rigjit-censored dita [ a, HIJ-OO )
mm
depvari
depvar^
a a . a
a b b
Options Options unique to tobjt 11 (#) and ul (#) indicate the censoringipoints in tbbit. You may specify one or both. 110 indicates the lower limit for left censoring. Observations with depvar < 11 () are left-censored; observations with depvar >ul() are right-censored; remaining observations are not censored. You do not have to specify the censoring values at all. It is erijough to type 11, ul, or both. When you do not specify a censoring value, tobit assumes that the lower limit is the minimum observed in the data (if 11 is specified) and the upper limit is the maximum (if ul is specified).
Option unique to cnrej censored (varname} is nbt optional for cnreg. vdmame is a variable indicating if depvar is censored and, if so, whether the censoring is left or right. 6 indicates that depvar is not censored. -1 indicates left censoring; the trub value is known only fo be less than or equal to the value recorded in depvar. +1 indicates right censoring; the true Value is known only to be greater than or equal to the value recorded in depvar. \
Options unique to intfeg ;
:
I
;
J
noconstant suppresses toe constant term (intercept) in the estimation. robust specifies that th^ Huber/WhiteVsandwich! estimator of the variance is to be used in place of the conventional MJLE variance ejstimator. fbbust combined with clusterQ further allows observations which are not independent within Cluster (although they must be independent between clusters). ' ; If you specify pweights, robust is implied: see [U] 23.11 Obtaining robust variance estimates.
176
tobit — Tobit, censored-normal, and interval regression
cluster (varname) specifies that the observations are independent across groups (clusters) but not necessarily independent within groups, varname specifies to which group each observation belongs. clusterO affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients. clusterO can be used with pweights to produce estimates for unstratified cluster-sampled data. clusterO implies robust; that is, specifying robust clusterO is equivalent to typing clusterO by itself. score (newvari newvar^ creates two new variables containing, respectively, u\j — d\nLj/d(x.j/3) and uzj — dlnLj/da for each observation j in the sample, where Lj is the jth observation's contribution to the likelihood. The jth observation's contribution to the score vector is thus [i91n£j/<9/3 d\nLj/d<j] = [uijXj u ^ j ] . The score vector can be obtained by summing over j. See [U] 23.12 Obtaining scores. constraints (numlist} specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command; see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. nolog suppresses the iteration log.
Common options level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. offset (vamame} specifies that varname is to be included in the model with the coefficient constrained to be 1. maximize-options control the maximization process; see [R] maximize. You should never have to specify them. Unlike most maximum likelihood commands, however, cnreg and tobit default to nolog—they suppress the iteration log (see Methods and Formulas below), log will display the iteration log.
Options for predict xb, the default, calculates the linear prediction. pr(a,iO calculates Pr(a < x;b + uj < b), the probability that y/Xj would be observed in the interval (a,b). a and b may be specified as numbers or variable names; Ib and ub are variable names; pr(20,30) calculates Pr(20 < x^b + Uj < 30); pT(lb,ub~) calculates Pr(lb < xyb + Uj < ub}; and pr(20,«6) calculates Pr{20 < x^b + Uj < ub). a — . means —oo; pr(. ,30) calculates Pr(xjb + Uj < 30); prC/6,30) calculates Pr(x ; b -f Uj < 30) in observations for which Ib — . (and calculates Pi(lb < Xjb + Uj < 30) elsewhere). b = . means +co; pr(20,.) calculates Pr(xjb + u3 > 20); pr(20,M£) calculates Pr(xjb + Uj > 20) in observations for which ub - . (and calculates Pr(20 < x^-b + «_,- < ub} elsewhere). e(a,b) calculates i5(Xjb -f uj \ a < x^-b + Uj < b), the expected value of yj KJ conditional on t/jjxj- being in the interval (a,b}, which is to say, yj\Xj is censored. a and b are specified as they are for pr().
tobtt |- Toblt, cfcnsored-normal, and interval regression
177
j
ystar(a,&) calculates B(y*), where y* = a ijf x^b + HJ < a, y*j = b if x_jb + Uj > b, and y*. = Xjb + Uj otherwise, which is to say, y* is truncated, a and b are specified as they are for prO. • ! stdp calculates the standard error of tie prediction. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is also referred to as the standard error <&f the fitted value. stdf calculates the standard error of the forecast This is the standard error of the point prediction for a single observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors pijoduced by stdf are always larger than those by stdp; see [R] regressj Methods and Formulas. nooffset is relevant only if you specified of f sdt(varmzme). It modifies the calculations made by predict so that they Ignore the offset variable!; the linear prediction is treated as x;b rather than xy b + offsetj.
Remarks Remarks are presented under the headings tobit cnreg j'ntreg
U
tobit Tobit estimation was originally developed by fpbin (1958). A consumer durable was purchased if a consumer's desire was high enough, where desife was measured by the dollar amount spent by the purchaser. If no purchase was made, the measure;of desire was censored at zero.
!> Example
i
We will demonstrate tobit using a more artifcial example which, in the process, will allow us to emphasize the assumptions underlying the estimation. We have a dataset containing the mileage ratings and weights of 74 cars. There are no censored variables in this dataset, but we are going to create one. Before that, however, the relationship Ibetween mileage and weight in our complete data is ; . gen wgt = weighs/1000 . regress mpg wgt ; Source SS
df
MS
Model Residual
15J91. 99024 85J1. 469221
1 1591.99(024 72 11.8259614
Total
24k3. 45946
73
n>Pg wgt _cons
-61.008687 3^.44028 i~ - -.
33. 4720|l74
Std;. Err.
Coef.
.5178782 1.6J14003 .
,
Number of obs F( 1, 72) Prob > F R-squeired Adj R-squared Root MSB
—I..,,. .„. „,,. .-
t
-ill. 60 2^.44 .! U
,
74 = 134.62 = 0.0000 = 0.6515 .O4C(
= 3.4389
P>|t|
[95'/, Coaf . Interval]
0.000 0.000
-7.041058 36.22283
-4.976316 42.65774
tobit — Tobit, censored-normal, and interval regression
178
(We divided weight by 1,000 simply to make discussing the resulting coefficients easier. We find that each additional 1,000 pounds of weight reduces mileage by 6 mpg.) mpg in our data ranges from 12 to 41. Let us now pretend that our data were censored in the sense that we could not observe a mileage rating below 17 mpg. If the true mpg is 17 or less, all we know is that the mpg is less than or equal to 17: . replace mpg=17 if mpg<=17 (14 real changes made) . tobit mpg wgt, 11 Tobit estimates
Number of obs LR cM2(l) Prob > chi2 Pseudo R2
Log likelihood = -164.25438
Std. Err.
mpg
Coef .
wgt _ccms
-6.87305 41.49856
.7002559 2.05838
_se
3.845701
.3663309
Obs. summary:
t -9.82 20.16
= = =
74 72.85 0.0000 0.1815
P>|t|
[957, Conf . Interval]
0.000 0.000
-8.268658 37.39621
-5.477442 45.6009
(Ancillary parameter)
18 left-censored observations at mpg<=17 56 uncensored observations
The replace before estimation was not really necessary—we remapped all the mileage ratings below 17 to 17 merely to reassure you that tobit was not somehow using uncensored data. We typed 11 after tobit to inform tobit that the data were left-censored, tobit found the minimum of mpg in our data and assumed that was the censoring point. Alternatively, we could have dispensed with replace and typed 11(17), informing tobit that all values of the dependent variable 17 and below are really censored at 17. In either case, at the bottom of the table, we are informed that there are, as a result, 18 left-censored observations. On these data, our estimate is now a reduction of 6.9 mpg per 1,000 extra pounds of weight as opposed to 6.0. The parameter reported as _se is the estimated standard error of the regression; the resulting 3.8 is comparable to the estimated root mean square error reported by regress of 3.4.
Q Technical Note It should be understood that one would never want to throw away information—which is to say, purposefully censor variables. The regress estimates are in every way preferable to those of tobit. Our example is solely designed to illustrate the relationship between tobit and regress. If you have uncensored data, use regress. If your data are censored, you have no choice but to use tobit. Q
0 Example tobit can also estimate models that are censored from above. This time, let's assume that we do not observe the actual mileage rating of cars yielding 24 mpg or better—we know only that it is at least 24. (Also assume that we have undone the change to mpg we made in the previous example.)
tobit — Tobit, cfensored-normal, and Interval . tobit mpg wgt, u|L(24) Tobit estimates
Number of obs LR chi2(l) Prob > chi2 Pseudo R2
i Log likelihood = -129.8279
mpg
Coef.
wgt _cons
-5i. 080645 3)5. 08037
.43493 1.4*32056
-ljl.68 2(5.19
_se
21 385357
.2444604
I
Obs. summary:
t
Std. Err.
P>[tl 0.000 0.000
=
regression
179
74 90.72 0.0000 0.2589
[957, Conf. Interval] -5.947459 33.22628
-4.213831 38 . 93445
(Ancillary parameter)
Si juncensoredj observations 23 rigfalt-censoredj observations at mpg>=24
> Example tobit can also estimate models whibh are censored from both sides, the so-called two-limit tobit: . tobit mpg wgt, ill(17) ui(24) Tobit estimates
Number of obs LR chi2(l) Prob > chi 2 Pseudo R2
Log likelihood = -104.25976
mpg
Coef.
wgt _cons _se Obs. summary:
Std. Err.
\ t
P>|t|
-5,764448 3fe . 07469
.7245417 2.2&5917
-t.96 16.88
0.000 0.000
2.886337
.39E2143
= =
74 77.60 0.0000 0.2712
[95X Conf. Interval] -7.208457 33.57865
-4.320438 42.57072
(Ancillary parameter)
18 left-censored! observations at mpg<=17 33 jmcensored; observations 23 right-censored; observations at mpg>=24
cnreg cnreg is a generalization of tobit. Rather tfian there being a constant lower- and/or uppercensoring point, each observation is allowed to bjs censored at a different point. With cnreg, you specify censored(vamafyie') rather thin the 11(0 and/or ul(#) options, vamame records a 0 if the observation is not cenjsored, a -1 if it is left-cjensored, and a +1 if it is right-censored.
^ Example You are estimating an ejarnings model The data jare censored from above in that, if the respondent's income was $60,000 or apove, the inteirviewer recorded simply "in excess of $60,000". You might estimate the model using; tobit, specifying ul(!SOOOO) (or, if you are working in natural logs, ul(ll)). Later, you get lanother datasiet to merjje with your current data. In the second survey earnings were again censored, but this time at $l(j)0,000. You can no longer use tobit since some observations are censored!at $60.000 artd others at $100,000. You could, however, use
tobit — Tobft, censored-normal, and interval regression
180
gen cens=0 replace cens=l if survey" 1 & earnings==60000 replace cens=i if survey»=2 & earnings==100000 cnreg earnings etc., censored(cens)
£> Example Censored-normal regression has been used to estimate duration-time models, but such use has now declined in favor of models with more "reasonable" assumptions (e.g., exponential or Weibull; see [R] st streg); or no assumption except proportionality among groups (see [R] st stcox). There are, however, occasions when a normally distributed time variable may be reasonable. You have (fictional) data on the date various newspapers adopted VDT (video display terminal) editing, stored in date as days since 1/1/60. You also know the newspaper's circulation (Incltn) and whether the newspaper is a subsidiary or family-owned (f amown). Your data begin in 1982 and end in 1991. You have left censoring if the paper adopted VDT editing before 1982. You have right censoring if the paper had not adopted VDT editing by 1991. In your data, date is recorded as missing if the paper is still not using VDTs or if it began before 1982. The variable bef ore82 is 1 if the paper started using VDTs before 1982: . gen cnsrd=0 . replace cnsrd=-l if before82 (24 real changes made) . replace date=mdy(1,1,1982) if before82 (24 real changes made) . replace cnsrd=l if date—. (11 real changes made) . replace date =mdy(1,1,1991) if date==. (11 real changes Bade) . cnreg date Incltn famown, censored(cnsrd) Censored normal regression Log likelihood = -519.74678 date
Coef .
Incltn famown _cons
-1377.138 576 . 8444 24439.46
76.92723 185 .5287 815 .7107
_se
607 . 9846
53.19381
Obs. summary:
Std . Err.
t -17.90 3.11 29.96
(nobody censored yet)
Number of obs LR ehi2(2) Prob > chi2 Pseudo R2 P>!t1 0.000 0.002 0.000
= = =
100 201.09 0 . 0000 0.1621
[95V, Conf. Interval] -1529.797 208.6687 22820.71
-1224.478 945.0201 26058.21
(Ancillary parameter)
24 left-censored observations 65 uncensored observations 11 right-censored observations
You estimate that family-owned newspaper's adopted VDT editing 577 days later than nonfamily-owned newspapers (but remember, the data are fictional).
toblt -K Toblt, censored-normal, and intetval regression
181
0 Technical Note Censored-normal regression and tobit models achieve their results by assuming that the distribution of the error term is normajl This is a rather standard assumption but, in the case of these two models, it is vital. Results from cbnsored-normAl regressioh and tobit are not robust to other distributions of the errors; see Goldbergejr (1983). Moreover, heteiroskedastic error terms can wreak havoc on such models; see Kurd (1979)J
0
intreg
It
«
intreg is an obvious generalization of the models estimated by cnreg and tobit. For censored data, their likelihoods contain terms of ihe form Pt(l} < yj} for left-censored data and Pr(Yy > yj} for right-censored data, where yj is the bbserved censoring value and Yj denotes the random variable representing the dependent variable in tie model, flu's can be extended to include interval data. If we know that the value for thje jth individual is somewhere in the interval [ t/ij, 3/2 j ], then the likelihood contribution from this individual is simjply Pr(t/ij| < Yj < y^j}. Hence, intreg can estimate modeljs for data ;Where each observation represents either interval data, left-censored data, right-censored data, or rloint data. Regardless of the type of observation, the data should be storedj in the datasqt as interval data; that is, two dependent variables, depvar\ and depvar%, are used to-hold the end^oints of thb interval. If the data are left-censored, the lower endpoint is -co and is represented by 4 missing value '.' in depvar\. If the data are right-censored, the upper endpoint is +6c and is represented by a missing value ' . ' i n depvar^. Point data are represented by the two erjdpoints being equal. type of data
I
poijnt data a = [M] interval data EM] left-censored date (-0)5,6] right-censored data [a, -Hoo)
depvar^ depvar?
a a . a
a b b
•
Note: Truly missing values of the depehdent variable must be represented by missing values in both depvari and depvar?. ' Interval data arise natjirally in many contexts.' One common one is wage data. Often you only know that, for example, a person's salary is betweejji $30,000 and $40,000. Below we give an example for wage data, and show flow to set up! depvar\ ajid depvar^.
Example We have a dataset tha^ contains the yearly wa|es of working women. Women were asked via a questionnaire to indicate aicategory for tfteir yearly income from employment. The categories were less than 5,000, 5,001-10,00(1 . . . . 25,001^-30,000, 36,001-40,000, 40,001-50,000, and above 50,000. The wage categories are ^tored in the variable w
182
tobit — Tobit, censored-normal, and interval regression . tab wagecat Wage category <$1000s)
Freq.
Percent
Cum.
20 25 30 40 50 51
14 83 158 107 57 30 19 14 6
2.87 17.01 32.38 21.93 11.68 6.15 3.89 2.87 1.23
2.87 19.88 52.25 74.18 85.86 92.01 95.90 98.77 100.00
Total
488
100.00
5 10 15
A value of 5 for wagecat represents the category less than 5tOOO, a value of 10 represents 5,001 -10,000, ..., and a value of 51 represents greater than 50,000. To use intreg, we must create two variables wage! and wage2 containing the lower and upper endpoints of the wage categories. Here's one way to do it. We first create a little dataset containing just the nine wage categories, then lag the wage categories into wagel, and finally match-merge this dataset with nine observations back into the main one. . sort wagecat . save woffienwage file womenwage.dta saved . by wagecat: keep if _n»l (479 observations deleted) . gen wagel = wagecat L.n-1] (1 missing value generated) . keep wagecat wagel . save lagwage file lagwage.dta saved . use ffomenwage, clear (Wages of women) . merge wagecat using lagwage . tab _merge _merge Freq, Percent
Cum.
100.00
3
488
100.00
Total
488
100.00
. drop _merge
The variable _merge created by merge indicates that all the observations in the merge were matched (you should always check this; see [R] merge for more information). Now we create the upper endpoint: . gen wage2 = wagecat . replace wage2 = . if wagecat == 51 (6 real changea made, 6 to missing)
Let's list the new variables:
tobit — tobtt, centored-normal, and interval regression . sort age . list wage! wage2 wagel 5 1. . 2. 10 3. 5 4. 5. 10 6. 10 7. 10 8. 10 9, 15 10 10.
183
ijn 1/10 wage2 10 5 15 ' 10 15 IS '. 15 15 20 15
We can now run irttreg: . intreg wagel wage2j age age2 nevLnar rural: school tenure Fitting constant-onljy aodel: Iteration 0: log likelihood = -1967,24956 \ Iteration 1: log likelihood = f-967.1368 Iteration 2: log likelihood = '-967.1368 Fitting full model: Iteration 0: log likelihood = -JB66.65324 Iteration 1: log likelihood = -656.33294 Iteration 2: log likelihood = -fe56.33293 Interval regression Number of obs = 488 LR chi2(6) = 221.61 Prob > chi2 = 0.0000 Log likelihood = -856.33293 I P>|z| [951/. Conf. Interval] £oef. Std. ferr. age age2 nev_mar rural school tenure _cons
,79i4438 -.01^2624 -.2015022 -3.043044 1.3$4721 . 80^0864 -12.70238
.4433604 .0073J)28 .8116581 ,77571524 . 1357^73 . 10461)77 6.367U7
/sigma
7 . 2$9626
.2529^34
Observation summary:
1.' -1.!32
-o.S26
-3. P2 9. !3 7. >6 -1. 19
0.074 0.069 0.798 0.000 0.000 0.000 0,046
-.0775265 -.0275757 -1.798911 -4.563452 1.068583 . 5952351 -25.1817
1.660414 .0010509 1 . 383906 -1.522637 1.600859 1.004898 -.2230583
6.803827
7.795425
0 uncensored observations 14 left-censored observations 6 fight-censored observations 468 interval observations I
We could also model these data using an ordered; probit model. We do so running oprobit:
(Continued on jiexf page)
184
toblt — Tobit, censored-normal, and interval regression , oprobit wagecat age age2 nev_mar rural school tenure Iteration 0: log likelihood = -881.1491 Iteration 1: log likelihood = -764.31729 Iteration 2: log likelihood = -763.31191 Iteration 3: log likelihood = -763.31049 Ordered probit estimates Number of obs LR chi2{6) Prob > chi2 Log likelihood = -763.31049 Pseudo R2 wagecat
Coef .
Std. Err.
age age2 nev_mar rural school tenure
. 1674519 -.0027983 -.0046417 - . 5270036 .2010587 .0989916
. 0620333 .0010214 .1126736 .1100448 .0201189 .0147887
_cutl _cut2 _cut3 _cut4 _cut5 _cut6 _cut7 _cut8
2.650637 3.941018 5.085205 5.875534 6.468723 6.922726 7.34471 7.963441
.8957242 .8979164 .9056579 .912093 .9181166 .9215452 .9237624 .9338878
z 2.70 -2.74 -0.04 -4.79 9.99 6.69
P>lzl 0.007 0.006 0.967 0.000 0.000 0.000
= =
488 235.68 0 . 0000 0.1337
[957. Conf . Interval] .0458689 -.0048001 -.225478 -.7426875 . 1616263 .0700063
.289035 -.0007964 .2161946 -.3113197 .2404911 . 127977
(Ancillary parameters)
We can directly compare the log likelihoods for the intreg and oprobit models since both likelihoods are discrete. If we had point data in our intreg estimation, the likelihood would be a mixture of discrete and continuous terms and we could not compare it directly with the oprobit likelihood. In this case, the oprobit log likelihood is significantly larger (i.e., less negative), so it fits better than the intreg model. The intreg model assumes normality. But the distribution of wages is skewed and definitely nonnormal. Normality is more closely approximated if we model the log of wages. . gen logwagel = log(wagel) (14 missing values generated) . gen Iogwage2 = Iog(wage2) (6 missing values generated)
(Continued on next page)
tobit — Toblt, censored-normal, and interval regression . intreg logwagel 14igwage2 age a&e2 nev_mai rural school tenure Fitting constant-only model: Iteration 0: log likelihood = 1889.23647 Iteration 1: log ijlljeliheod = 4889.06346 Iteration 2: log likelihood = •'•889.06346! Fitting full model: j Iteration 0: log ]jikelihaod = -773.81968 j Iteration 1: log ijifcelihood = 4773.36566 Iteration 2: log Jjikelihood = -1773.36563 ! Interval regression \ Humber of obs 1 LR chi2(6) Prob > chi2 Log likelihood = -77J3. 36563 Coef.
Std. £rr.
k
P>lzi
age age2 nevjsar rural school tenure _cons
.06^5589 -.00il0812 -.00^8151 -.2098361 .08^)4832 .03*7144 .70*4023
.024Sb54 .0004(115 . 0454867 .0430854 .0076N3 .0058001 .3593193
2.68 -2.63
0.010 0.009 0.898 0.000 0.000 0.000 0.049
/sigaa
.40$7381
.0143^38
-o.is
-4.^7 10.48 6. |5 l.*7
= =
185
488 231.40 0.0000
[95% Conf . Interval] .0155689 -.0018878 -.0949674 -.2959675 .0654341 .0283464 .0041495
.1135489 -.0002746 ,0833371 -.1237047 .0955323 .0510825 1.412656
.3755464
.4319297
. - ..... j
Observation summary:
0 uncens^red observations 14 left-censored observations 6 right-censored observations 468 interval observations
The log likelihood of this intreg model is very dose to the oprobit log likelihood, and the z statistics for both models ar§ very similar.
Saved Results tobit and cnreg save in e(): Scalars • 00
e{N_unc) e(N_lc) e(N_rc) e(llopt) e(ulopt) Macros e(cmd) e(depvar) e(wtype) e(wexp)
number of observations number of unc^nsored observations number of left-censored obs'ervitions number of righi-censorei obseri'ations contents of 110, if specified contents of ul(p, if specified
|e(df_m) j e(df _r) i e (r2_p) : e (F) ie(ll) |e(ll_0)
model degrees of freedom residual degrees of freedom pseudo ^-squared F statistic log likelihood log likelihood, constant-only model
tobit or cnre|; name of dependent variable weight type : weight expression
;e(cM2tvpe) je (off set) je(censored) je (predict)
Maid or LR: type of mode! *2 test offset variable specified in censoredO program used to implement predict
coefficient vector
ie(V)
variance-covariance matrix of the estimators
Matrices e(b)
Functions e(sample) marks estimation sample
186
tobtt — Tobit, censored-normal. and interval regression
intreg saves, in e(): Scalars e(N) e(N_unc) e(K_lc) e(N_rc) e(N_int) e(df_m)
number of observations number of uncensored observations number of left-censored observations number of right-censored observations number of interval observations model degrees of freedom log likelihood
e(ll_0) log likelihood, constant-only model e(N_clust) number of clusters e(sipia) estimate of sigma e(se_sigma) standard error of sigma e (cM2) x2 e (rank) rank of e(V) e(rankO) rank of e(V) for constant-only model
Macros e(cmd) e(depvar) e(title) e(wtype) e(wexp) e(clustvar)
intreg name(s) of dependent variable(s) title in estimation output weight type weight expression name of cluster variable
e(vcetype) e(opt) e (off set) e (predict)
Matrices e(b) e(ilog)
coefficient vector iteration log (up to 20 iterations)
e(V)
Functions e (sample)
marks estimation sample
e(cnslist)
covariance estimation method type of optimization offset program used to implement predict constraint numbers variance-covariance matrix of the estimators
Methods and Formulas cnreg and tobit are built-in commands, intreg is implemented as an ado-file using the ml commands (see [R| ml); its robust variance computation is performed by -robust (see [P] _robust). Let y = X/3-f e be the model, y represents continuous outcomes—either observed or not observed. Our model assumes e ~ JV(0,cr 2 I). We will describe the likelihood for intreg as it subsumes the tobit and cnreg models. For observations j e C, we observe y;-; i.e., point data. Observations j € L are left-censored; we know only that the unobserved y} is less than or equal to yLj, a censoring value that we do know. Similarly, observations j £ Ji> are right-censored; we know only that the unobserved yj is less than or equal to J/RJ. Observations j <E / are intervals; we know only that the unobserved t/,- is in the interval
(Continued on next page)
tobit -f Tobit, censored-normal, and interval regression
187
The log likelihood is
fs< f
where $() is the standard cumulative normal, an| Wj is the weight for the jth observation. If no weights are specified, w}• = 1. If aweights are specified, then Wj = 1 and a is replaced by a/^/a] in the above, where a, are the aveights normalised to sum to N. Maximization is as described in [R] ihaximize: j;he estimate reported as _se or _sigma is a. See [U] 23.11 Obtaining robust variance estimates anjl [P] -robust for a description of the computation performed when robust is specified as an option to intreg. See Tobin (1958) for the original derivation of the tobit model and Amemiya (1973) for a generalization to variable, but known, cutoffs. AiJ introductory description of the tobit model can be found in, for instance, Johnston and DiNardo (J997, 436-441), Kmenta (1997, 562-566), Long (1997, 196-210), and Maddala (1992, 338-342).
References Amemiya. T. 1973. Regression-analysis when the dependent;variable is truncated Normal. Econometrics 41: 997-10)6. . 1984. Tobit models: a survey. Journal of Econometrics 24: 3-61. Cong, R. 2000. sg!44: Marginal effects of the tobit model,: Sfara Technical Bulletin 56: 27-34. Goldberger, A. S, 1983. Abnoijmal selection bi£s. In Studied in Econometrics, Time Series, and Multivariate Statistics. ed. S. Karlin. T. Amemiya, and L. A. Goodman, 67-84. New York: Academic Press. Hurd, M. 1979. Estimation in truncated samples when there )s heteroscedasticity. Journal of Econometrics 11: 247-258. Johnston. J, and J. DiNardo. 1997. Econometric Methods, 4th ed. New York: McGraw.-Hill. Kendall. M. G. and A. Stuart. 1973. T^e Advanced Theory of Statistics, vol. 2. New York: Hafner, Kmenta, J. 1997. Elements of Econometrics. 2d ed. Ann AJrbor: University of Michigan Press. Long. j. S. 1997. degression 'Models for Categorical and l^imited Dependent Variables. Thousand Oaks. CA: Sage Publications. ' Maddala, G. S. 1992. Introduction to Econometrics. 2d e d ' N e w York: Macmillan. McDonald. J. and R. Moffiti. 1980. The use of tobit analysis. Review of Economics and Statistics 62: 318-321. Stewart. M. B. 1983. On leas! iquares estimatioh when the dependent variable is grouped. Review of Economic Studies ; : 50: 737-753. j Tobin. J. 1958. Estimation of relationships for'limited dependent variables. Econometrica 26: 24-36.
18B
tobit — Toblt, censored-normal. and interval regression
Also See Complementary:
[R] adjust, [R] constraint, [R] lincom, [R] linktest, [R] Irtest, [R] mfx, [R] predict, [R] sw, [R] test, [R] testnl, [R] vce, [R] xi
Related:
[R] heckman, [R] oprobit, [R] regress, [R] svy estimators, [R] xtmtreg, [R] xttobit [P] -robust
Background:
[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [U] 23.12 Obtaining scores
title
j translate — Print and! translate logs | and grapb|
Syntax print filename [, likie(ex/) override -options translate filename-m jfilenameont [. translator (tname) override ^options replace
translator query translator set [tnathe setopt setv&l] translator reset tnftme transaap query [.e#] transmap define . exfnew , ext0\d filename in print, in addition to being a filename: to be printed, may be specified as SResults to mean the Results window, OTiewer to mean thej Viewer window, and SGraph to mean the Graph window. I filename-m in translate may be specified just as filename in print. ;
•P-
j
tname in translator specifies the name of a translator; see the translator () option under Options for trans/ate.
;
:
j
Description print prints log, SMCt, and graph files. Although there is considerable flexibility in how print (and translate, which print uses) cari be set td work, they have already been set up and should just work: . print mylog.smcl , print mylog.log . print mygraph.gphi
' !
Unix users may discover tiey need to do a bit of sej^-up before print works; that is discussed below. European Unix users may also wish to modify the default paper size. All users can tailor print and translate to their needs, :
•
1
print may be also bemused to print;the current contents of the Results window, the Viewer, or the Graph window. For instance, the cuirent graph; could be printed by typing . print CGraph
;
j
translate translates Ibg, SMCL, and graph filers from one format to another, the other typically being suitable for printing! An additional use of translate is to translate SMCL logs (logs created by typing, say, 'log using mylog') to ASCII text: . translate mylog.ajncl mylbg.logj
190
translate — Print and translate logs and graphs
Another cute use of translate is to recover a log when you have forgotten to start one. You may type . translate ^Results juylog.txt
to capture as ASCII text what is currently shown in the Results window. (Unix(console) users may not do this.) This entry provides a general overview of print and translate, and covers in detail the printing and translation of text (nongraphic) files. See [G] translate for an in-depth discussion of printing and translating graph files.
Options for print like (ex?) specifies how the file should be translated to a form suitable to print. The default is to determine the translation method based on the extension of filename. Thus, mylog.smcl is translated according to the rule for translating smcl files, myfile.gph is translated according to the rule for translating gph files, and so on. (These rules are, in fact, translated smcl2prn and gph2prn translators, but put that aside for the moment.) Rules for the following extensions are predefined: .txt .log .smcl .gph
Assume input file contains ASCII text Assume input file contains Stata log ASCII text Assume input file contains SMCL Assume input file contains Stata graph files
If you wish to print a file that has an extension different from those listed above, you can define a new extension, but you do not have to do that. Assume you wish to print the file read.me, a file you know to contain ASCII text. If you were just to type 'print read.me', you would be told that Stata cannot translate .me files. (You would literally be told that the translator for me2prn was not found.) You could type 'print read, me, like (txt)' to tell print to print read, me like a .txt file. On the other hand, you could type . transmap .me .trt to tell Stata that .me files are always to be treated like .txt files. If you did that, Stata would remember the new rule even in future sessions. When you specify option likeQ, you override the recorded rules, so if you were to type 'print mylog. smcl, like (txt)', the file would be printed as ASCII text (meaning all the SMCL commands would show). override-options refers to translated override options, print uses translate to translate the file into a format suitable for sending to the printer. To find out what is overrideable for X files (files ending in .X), type 'translator query JsT2prn'. For instance, to find out what is overrideable for the printing of SMCL files, type . translator query smcl2prn (output omitted) In the omitted output (which various across operating systems), we might learn that there is an 'rmargin #' tunable value which specifies the right margin in inches. We could specify the override ^option rmargin(#J to temporarily override the default value, or we could type translator set smc!2prn rmargin #' beforehand to permanently reset the value.
translate — Print and translate logs and graphs
191
Alternatively, on some computers with1 some translators, you might discover that nothing can be set.
Options for translate translator (tname) specifies the name of the translator to be used to translate the file. The available translators are tnante
input
;
smc3.2ps log^ps txttps Viewer2ps Resi)ilts2ps smc}2pm log^prn txtiprn Resalts2prn Vieier2prn smci2txt smcj.21og Resftlts2txt VieW2txt gph?ps gpafcprn Graph2prn Graph2ps
SM(tL Stat4 ASCII text log generic ASCII text file Viewer window Results window SMCL Stadt ASCII text log generic ASCII te*t log Resiilts window Viewer window ' SM^L SM
output
PostScript PostScript PostScript PostScript PostScript default printer format default printer format default printer format default printer format default printer format generic ASCII text file Stata ASCn text log generic ASCII text file generic ASCII text file PostScript default printer format default printer format PostScript
If translator () is nqt specified, translate^ determines which translator to use based on extensions of the filenames specified, typing 'translate myfile.smcl myfile.ps' would use the smclSps translator, typing 'translate myf ijle.smclmyfile.ps, translate(smcl2prn)' would override the default and use the smcl2pr|i translator. Actually, when you type 'trafcslatie a,6 c.q\ translate looks up ,b in the transmap extension-synonym tablei. If .6 is not found, the'translator bid is used. If .6 is found the table, the mapped extension is used (call it 5'). and theli the translator b'ld is used. For example.
. . . .
command |ranslator used translate inyfile.smcl ajrfile.ps femcl2ps translate toyfile.odd tnySfile.ps ibdd2ps, which does not exist, so error transmap .odd .tit translate tayfile.odd myjtile.ps
You can list the mappings ihat translate uses by kyping 'transmap query'. override-options allows you to override arty of the default options of the specified or implied translator. To find out what is overHdeable for, siay, Iog2p4, type . translator query ljog2ps (output omitted) ]
]
In the omitted output, w4 might learn that there is! an 'rmargin #' tunable value which, in the case of Iog2ps, specifies the right margin irt inches. We could specify the override ^option nnargin(t) to temporarily override the default value, or we could type 'translator set Iog2ps rmargin #' beforehand to permanently reset thi value. replace specifies that fileAameout might already eiist and, if so, that it is to be replaced.
192
translate — Print and translate logs and graphs
Remarks Remarks below are presented under the headings Printing files Printing files, Windows and Macintosh Printing files, Unix Translating files from one format to another
Printing files As we said at the outset, printing should be easy; just type . print mylog.smcl . print mylog.log . print mygraph.gph
print can print SMCL files, ASCII text files, and graph files, and even the contents of the Results, Viev/er, and Graph windows: . print ^Results , print SViewer , print OGraph
For a complete description of printing graph files, see [G] translate. We will confine the rest of our discussion to the printing of files containing some form of text.
Printing files, Windows and Macintosh When you type print, you are using the same facility that you would be using had you pulled down File and selected Print. The only other thing to know is that if you try to print a file that Stata does not know about, Stata will complain: . print read.me translator me2prn not found (perhaps you need to specify the likeQ option) r(lll);
In that case, you could type . print read.me, like(txt)
to indicate that you wanted read.me sent to the printer in the same fashion as if the file were named readme.txt, or you could type . transmap define .me .txt . print read.me In this case, you are telling Stata once and for all that you want files ending in .me to be treated in the same way as files ending in .txt. Stata will remember this mapping even across sessions. If you want to clear the .me mapping, type . transmap define .me If you want to see all the mappings, type . transmap query
trajnslate — Print and translate logs and graphs
193
If you want to print to a fjile, use the translate tommand, not print: . translate mylog.Bmcj mylog.pra
translate prints to a file using the Windows or Macintosh print driver when the new filename ends in .prn.
Printing files, Unix Stata assumes you have a PostScript printer attached to your Unix computer and that the Unix command lpr(l) can be used to send PostScript files td it, but you can change this. It is possible that, on your Unix system, typing; mycomputerl Ipr < filenime
is not sufficient to print PostScript files, Fbr instance), perhaps on your system, you would need to type mycomputerf Ipr -Plenjiark < filename
or perhaps you need mycomputer$ Ipr -Plexiark filename
or perhaps you need something else. To set the print lirommand to be 'Ipr -Plexmark filename' and to state that the printer experts to receive PostScript files, type , printer define prn j>s "Ipr -Pleidnark «"
To set the print command to 'Ipr -Ple3dnark < filename' and to state that the printer expects to receive ASCII text files, type ; . printer define prn txt "Ipr -Plefanark < C"'|
That is. just type the commaind necessary to send filis to your printer and include an ® sign where the filename should be substituted. Two file formats kre available: ps and txt. The default setting, as shipped from the factory, iis . printer define prn ps "Ipr <
We will return to the printer command in the technical note that follows because it has some other capabilities you should know} about. ' In any case, perhaps after redefining thfe default pjinter, the following should just work: . print mylog.smcl . print mylog.log . print mygraph.gph :
:
The only other thing to kbow is that if you try to: print a file that Stata does not know about, it will complain: . print read.me translator me2prn not] found rUll); '
In that case, you could type - print read.me, like
194
translate — Print and translate logs and graphs
to indicate you wanted re ad. me sent to the printer in the same fashion as if the file were named readme.txt, or you could type . transmap define .me ,trt . print read.me
In this case, you are telling Stata once and for all that you want files ending in .me to be treated in the same way as files ending in .txt. Stata will remember this setting for .me, even across sessions. If you want to clear the . me setting, type . transmap define .me If you want to see all your settings, type . transmap query
Q Technical Note The syntax of the printer command is printer define printemame [ [ p s j t x t ] "Unix command with @" ] printer query [printemame] and you may define multiple printers. By default, print uses the printer named prn, but print has the syntax print filename [, like (ex/) pr inter (printemame') override-options] and so, if you define multiple printers, you may route your output to them. For instance, if you have a second printer on your system, you might type . printer define lermark ps "Ipr -Plexmark < ®" After doing that, you could type . print myfile.smcl, printer(lexmark)
Any printers that you set will be remembered even across sessions. You can delete printers: . printer define lexmark
You can list all the defined printers by typing 'printer query', and you can list the definition of a particular printer, say prn, by typing 'printer query prn'. The default printer prn we have predefined for you is . printer define prn ps "Ipr < ffi" which means we assume it is a PostScript printer and that the Unix command lpr(l), without options, is sufficient to cause files to print. Feel free to change the default definition. If you change it, the change will be remembered across sessions.
a
Q Technical Note When you pull down File and select Print, the result is exactly the same as if you were to issue the print command. Q
translate 4- Print and translate logs and graphs
195
The translation to Postscript format is done by translate and, in particular, is performed by the translators smc!2ps, Ii»g2ps, and t^t2ps. Thdre are lots of tunable parameters in each of these translators. The current values of tunabl4 parameters for, say, smc!2ps can be shown by typing . translator query smcl2ps (ouipuf omitted)
and you can set any of the tunable parameters (for instance, setting smc!2ps's rmargin value to 3) by typing : . translator set sm^lips rfcargin: 1 (output omitted)
Any settings you make will be remembered across ;sessions. You can reset smcl2ps to be as it was when Stata was shipped by typing . translator reset
femel2pE
'.
'
Translating files from ine format to another j i : If you have a SMCL log, a log you might have Created by previously typing 'log using mylog', you can translate it to an ^SCn text log fey typing . translate myfile.femcl myfile.ljog and you could translate it to a PostScript file by typing . translate myfile-^mcl myfile.pb
translate translates files;from one forrtiat to another and, in fact, it is translate that print uses to produce a file suitable fjor sending to the printed When you type
1
. translate a.b c.d
translate looks for the Predefined translator b2d and uses that to perform the translation. If there is a transmap synonym fbr 6. however, the mapped value 6' is used: b'ld. Only certain translators exist, and the| are listed under the description of the translate () option in Options for translate, ai|ove, or you cjan type I . translator query ;
'
i
i j :
'!
for a complete (and perhais more up-to-date) list. i j \ Anyway, translate fdrms the namej&2d or 6f20, and if the translator does not exist, translate issues an error. With the translator ()| option, yob can specify exactly which translator to use and, in that case, it does not mptter how youf files are jiamed. \
I
j
The only other thing to know is that sbme translators have tunable parameters that affect how they perform their translation. Ifou can type . translator query translator_name
to find out what those parameters are. Some translators have no tunable parameters and some have lots: i ! i
196
translate — Print and translate logs and graphs translator query smcl2ps header headertext logo user projecttext cmdnuaber fontsize pagesize pagewidth pageheight scheme custl_result_color custl_standard_color cust 1 _error_ color custl_input_color custl_link_color cust l_hilite_color cust l_result_bold custl_standard_bold cust l_error_bold custl_input_bold custl_liak_bold custl_hilite_bold cust l_link_ underline custl_hilite_underline
on on on
Imargin rmargin tmargin bmargin
1.00 1.00 1.00 1.00
cust2_result_color 0 0 cust2_standard_color 0 cust2_error_color cust2_input_color 0 cust2_link_color 0 0 cust2_Mlite_ color cust2_result_bold cust2_staitdard_bold cust2_error_bold cust2_input_bold cust2_link_bold cust2_b.ilite_bold
0 0 255 0 0 0 on off on off off on on off
9 letter 8.50 11.00 monochrome 0 0 0 0 0 0 on off on off off on on off
0 0 0 0 0 0
cust2_link_underline cust2_hilite_underline
0 0 0 0 0 0 0 0 0 255 0 0
You can temporarily override any setting by specifying option setopt(setval) on the translate (or print) command. For instance, . translate
cmdnumberCoff)
or you can reset the value permanently by typing . translator set socl2ps setopt setval For instance, . translator set smc!2ps cmdnumber off
If you reset a value, Stata will remember the change, even in future sessions. Windows and Macintosh users: The smc!2ps (and the other *2ps translators) are not used by print even when you have a PostScript printer attached to your computer. Instead, the Windows or Macintosh print driver is used. Resetting smc!2ps values will not affect printing; instead, you change the defaults in the Printers Control Panel under Windows and by pulling down File and selecting Page Setup... under Macintosh. You can, however, translate files yourself using the smc!2ps translator and the other *2ps translators.
Saved Results 'transmap query .exf saves in macro r(suffix) the mapped extension (without the leading period), or saves ext if the ext is not mapped. 'translator query translatorname' saves setval in macro r (setopt) for every setopt, setval pair-
translate -4 Print and translate logs and graphs
197
'printer query printehiame' (Unix only) saves! in macro r (suffix) the "filetype" of the input the printer expects (currently "ps" or "txt") and* in macro r(command), the command to send output to the printer.
Aiso See Related:
[R] if*, [G] translate, [P] stncl -
Complementary:
[R] >ji
Background:
[U] 18 Printing and preserving output
202
treatreg — Treatment effects model
We will assume that the wife went to college if her educational attainment is more than 12 years. Let we be the dummy variable indicating whether the individual went to college. With this definition, our sample contains the following distribution of college education. . gen we = 0 . replace we = 1 if we > 12 (69 real changes made) . tab we
Freq.
Percent
Cum.
1
181 69
72.40 27.60
72.40 100.00
Total
250
100.00
0
We will model the wife's wage as a function of her age, whether the family was living in a big city, and whether she went to college. An ordinary least squares estimation produces the following results: regress w wa cit we Source SS
df
Model Residual
93.2398668 1587.08776
3 31.0799523 246 6.45157627
Total
1680 . 32762
249 6 . 74830369
ww
Coef .
Std. Err.
va cit we _cons
-.0104985 . 1278922 1.332192 2.278337
.0192667 .3389058 . 3644344 .8432385
250 Number of obs 4.82 FC 3, 246) = 0.0028 Prob > F = 0.0555 R-squared Adj R-squared = 0 0440 2. 54 Root MSE
MS
t
-0.54 0.38 3 . 66 2.70
P>!t! 0.586 0.706 0.000 0.007
1957, Con! . Interval]
-.0484472 -.5396351 .6143819 .6174489
.0274502 .7954195 2.050001 3.939225
Is 1.332 a consistent estimate of the marginal effect of a college education on wages? If individuals choose whether or not to attend college and the error term of the model that gives rise to this choice is correlated with the error term in the wage equation, then the answer is no. (See Barnow et al. 1981 for a good discussion of the existence and sign of selectivity bias.) One might suspect that individuals with higher abilities, either innate or due to the circumstances of their birth, would be more likely to go to college and to earn higher wages. Such ability is, of course, unobserved. Furthermore, if the error term in our model for going to college is correlated with ability, and the error term in our wage equation is correlated with ability, then the two terms should be positively correlated. These conditions make the problem of signing the selectivity bias equivalent to an omitted-variable problem. In the case at hand, because we would expect the correlation between the omitted variable and a college education to be positive, we suspect that OLS is biased upwards. To account for the bias, we fit the treatment effects model. We model the wife's college decision as a function of her mother's and her father's educational attainment. Thus, we are interested in estimating the model ww = /?o + /3iwa 4- /?2 cit + <5wc -f e WC* = 7o + 71 wined + 7awf ed-f- u where HC
_ J 1. we* > 0, i.e., wife went to college 0, otherwise
treatreg — Treatment effects model
203
and where e and u have a bivariate normal distribution with zero mean and covariance matrix
The following output gives the maximum likelihood estimates of the parameters of this model. , lireatrag ww wa cit, treat (wc^Waed wfed); Iteration 0: log likelihood = -707.0723^ Iteration 1: log likelihood = -707.0721^ Iteration 2: log likelihood «= -707.0721$ Treatment effects model — HLE : ;
Lo^ likelihood = -707.07215 Coef.
Number of obs Wald chi2(3) Prob > chi2
z
Std. Err.
P>lz|
= =
250 4.11 0.2501
[95*/, Conf . Interval]
i
ww : wa cit we _cons
-.0110424 . 127636 1.271327 2.318638
.0199652 .3361938 .7412951 . 9397573
4-55 p. 38 p.. 72 £.47
0-580 0.704 0.086 0.014
-.0501735 -.5312917 -.1815842 .4767478
.0280887 .7866638 2.724239 4.160529
wmed wfed _cons
.11S8055 ,0961886 -2.631876
. 0320056 .0290868 .3309128
9-74 8.31 -f.95
0.000 0.001 0.000
. 0570757 .0391795 -3.280453
. 1826352 .1531977 -1 .983299
/athrho /Insigma
.0178668 . 9241584
. 1899898 .0447455
i).09 2lt>.65
0.925 0.000
-.3545063 .8364588
.3902399 1.011858
rho signa lambda
.0178649 2.519747 .0450149
. 1809291 .1127473 .4786442
-.3403659 2.308179 -.8931105
.371567 2.750707 .9831404
LR 'test of indep. eqns, (rho = 0):
ckL2^1) =
0.01
Prob > chi2 = 0.9251
In thei input, we specified that the continuous jdependent variable, ww (wife's wage), is a linear function of cit and wa. Nole the syntax for the treatment variable. The treatment we is not included in the first variable list; it is specified in the tre|t() option. In this example, wmed and wfed are specified las the exogenous covariates in the treatment equation. The output has the form of many two-equatioif estimators in Stata. We note that our conjecture that the d>LS estimate was biased upwards is veriijied. But perhaps more interesting is that the size of the biis is negligible, and the likelihood-ratio jest at the bottom of the output indicates that we cannot reject the null hypothesis that the two erro| terms are uncorrelated. This result might be due to several specification errors. We ignored the selectivity bias due to the endogeneity of entering the labor matket. We have also written both the wagfc equation and the college education equation in linear forjti, ignoring any higher power terms or interactions. The results for the two ancillary parameters require explanation. For numerical stability during optimization, treatreg does not directly estimate; p or a. Instead, treatreg estimates the inverse hyperbolic tangent of p. \ atanhp = -
204
treatreg — Treatment effects model
and In a. Also, treatreg reports A = per, along with an estimate of the standard error of the estimate and confidence interval.
£> Example Stata will also produce a two-step estimator of the model with the twostep option. Maximum likelihood estimation of the parameters can be time-consuming with large datasets, and the two-step estimates may provide a good alternative in such cases. Continuing with the women's wage model, we can obtain the two-step estimates with consistent covariance estimates by typing . treatreg ww wa cit, treat(wc*wmed wfed) twostep Treatment effects model — two-step estimates Number of obs Wald chi2(3) Prob > cM2 Goef.
Std. Err.
z
P>lzl
= =
250 3.67 0.2998
[951/, Conf. Interval]
ww wa cit we _cons
-.0111623 . 1276102 1,257995 2.327482
.020152 .33619 .8007428 .9610271
-0.55 0.38 1.57 2.42
0.580 0.704 0.116 0.015
-.0506594 -.53131 -.3114319 .4439031
. 0283348 .7865305 2.827422 4.21106
wined wfed _cons
.1198888 . 0960764 -2.631496
.0319859 .0290581 . 3308344
3.75 3.31 -7.95
0.000 0.001 0.000
. 0571976 ,0391236 -3.279919
.1825801 . 1530292 -1.983072
lambda
.0548738
. B283928
0.10
0.917
-.9807571
1.090505
rho sigma lambda
0.02178 2,5198211 .05487379
.5283928
we
hazard
The reported lambda (A) is the parameter estimate on the hazard from the augmented regression. The augmented regression is derived in Maddala (1983) and is presented in the Methods and Formulas section below. • The default statistic produced by predict after treatreg is the expected value of the dependent variable from the underlying distribution of the regression model. For the case at hand this model is ww =
vc
Several other interesting aspects of the treatment effects model can be explored with predict. Continuing with our wage model, the wife's expected wage, conditional on attending college, can be obtained with the yctrt option. The wife's expected wages, conditional on not attending college, can be obtained with the ycntrt option. Thus, the difference in expected wages between participants and nonparticipants is the difference between yctrt and ycntrt. For the case at hand, we have the following calculation: • predict wwctrt , yctrt
. predict wwcntrt , ycntrt . gen dlff = wwctrt - wwcntrt
treatreg — Treatment effects model
205
summarize diff Variable
dbs
Mean
Stdij DCT.
Min
Max
diff
260
1 3S6912
. 01^4202
1 .34558
1 420173
j
0 Technical Note
|
The difference in expected earnings between participants and nonparticipants is
8 +pa
where $is the standard normal density, and $ is die standard normal cumulative distribution function. If the correlation between the error terms, p, is Zero, then the problem reduces to one estimable by OLS and the difference is simply 5. Since p is positive in our example, we see that least squares overestimates the treatment effect, S. } 1 i Q
Saved Rtesults treajtreg saves in e(): Scalars «0() • Of)
e{M_eq) e(}|_dv) e(df-m) eOll) e(lj_clust) «( lambda) e(Selambda)
number of observations number cjf variibles number of equations number of dependent variables model degrees of freedom
log likelihood number of clusters A standard error of X
Macrps j *(cjad) eCtjepvar) e(Wtle) e(rftype)
treatreg name of dependent variable title in estimation output weight type weight expression e(dlustvar) name of cluster variable e (njethod) requested estimation method e(tfcetype) covariance estimation method
e(rc) e(cki2) e(chi2_c) e (p-c) e(p) e(rho) e(ic) eCtank) e( sigma)
return code
x2 X2 for comparison test p-value for comparison test significance
f> number of iterations rank of e(V) estimate of sigma
e (user) name of likelihood-evaluator program e(opt) type of optimization e(cM2type) Wald or LR; type of model x2 test e(chi2_ct) Wadd or LR; type of model x2 test corresponding to e(chi2_c) e (hazard) variable containing hazard e (predict) program used to implement predict e(cnslist) constraint numbers
Matrices, coefficient vector iteration log (up to 20 iterations) Functions e( Sample)
marks estimaticin sample
e(V)
variance-covariance matrix of the estimators
206
treatreg — Tteatment effects model
Methods and Formulas treatreg is implemented as an ado-file. Maddala (1983, 137-122) derives both the maximum likelihood and the two-step estimator implemented here. Greene QOOO, 933-934) also provides an introduction to the treatment effects model. The primary regression equation of interest is yj = *j/3 + 5zj + €j where Zj is a binary decision variable. The binary variable is assumed to stern from an unobservable latent variable zj = Wj-7 + Uj
The decision to obtain the treatment is made according to the rule
z
J
f i , if**>o \ 0,
otherwise
where e and « are bivariate normal with mean zero and covariance matrix
a p P 1 The likelihood function for this model is given in Maddala (1983, 122) Greene (2000, 180) discusses the standard method of reducing a bivariate normal to a function of a univariate nonnal and the correlation p. The following is the log likelihood for observation j:
Zj = 0 where $() is the cumulative distribution function of the standard normal distribution. In the maximum likelihood estimation, a and p are not directly estimated. Directly estimated are In a and atanh p, where
\l-p The standard error of A =
pff
is approximated through the delta method, which is given by Var(A) « D Vax{ (atanh p In
where D is the Jacobian of A with respect to atanh p and In a. Maddala (1983, 120-122) also derives the two-step estimator. In the first stage, one obtains probit estimates or the treatment equation
treatreg •— Treatment effects model
207
From these estimates, the hazard, hj, for each observation j is computed as
I
I
I
where (f> is the standard normal density function. We also define
Then,
f
The two-step parameter estimates of /? and S ate obtained by augmenting the regression equation with the hazard h. Thus, the regressors become f x z h] and we obtain the additional parameter estimate /?;, on the variable containing the hazard. A consistent estimate of the regression disturbance variance is obtained using the residuals from the augmented regression and the parameter estimate on the hajzard N
The two-step estimate of p is then
We will now describe how the consistent estimates of the coefficient covariance matrix based on the augmented regression are derived. LetA = [ i z / i ] a n d D b e a square diagonal matrix of size N with (jl - p ^ d j ) on the diagonal elements. :
where
Q = p 2 (A'DA)V p (A'DA) and Vp is the variance-covariance estimate from the probit estimation of the treatment equation.
Refererjefes Barnow, Bu G. Cain, and A. Goldberger. 198!. Issues in the analysis of selectivity bias. Evaluation Studies Review Animal 5. Beverly Hills: Sage Publications. Berndt. E. 1991. The Practice of Econometrics. New York Addison-Wesley. Cong. R. and D. M. Drukker. 2000. sgUI: Treatment effdcts model. Stata Technical Bulletin 55: 25-33, Greene, W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River, NJ: Prentice-Hall. Maddala, G. S. 1983. Limited-dependent and Qualitative Variables in Econometrics. Cambridge: Cambridge University Press.
208
treatreg — Treatment effects model
Also See Complementary:
[R] constraint, [R] lincom, [R] Irtest, [R] mfx, [R] predict, [R] test, [R] testnl, [R] xi
Related:
[R] heckman, [R] probit
Background:
[u] 163 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [R] maximize
Title trurtcreg — Truncated regression
Syntax triincreg depvar [var/iW] \weight\ [if \exp] [in range] [, iKvarname } #) uHvamamtf | #) njoconstant offset(wrnzmg) marginal at(mafrtflme) £obust cluster (varname) score (nevnfarfof) level (#) constraints (numlist) ioskip nolog maximize-Options ]
;
by ...; nay be used with -tenncreg; see [R] by.
j
aweights, fweights and pweights are allowed; see [it] 14.1.6 weight. This coihmand shares the features of all estimation comrrtands; see [U] 23 Estimatioo and post-estimation commands.
Syntax for predict
j
predict [type] newvamame [if exp] [ijn range] [, statistic where statistic is
I xb prCa.i)
x_jb, |tted values (the default) Pr(a : Vj < b)
stdp st&E
standa'd error of &e prediction standahl error of die forecast
where a and b may be numbers or variables; a Jequal
to '.' means -oo; b equal to '.' means +00.
These statistics are available both in and out of sample type predict . . . if e (sample) ... if wanted only for the estimation sample.
Description truincreg estimates a regression model of depvar on varlist from a sample drawn from a restricted part of! the population. Under the normality assumption for the whole population, the error terms in the truncated regression model have a truncated ijormal distribution. The truncated normal distribution is a normal distribution that has been scaled upward so that the distribution integrates to one over the restricted range. j i
Options
j li
IKvarname \ #) and nl(varname #} indicate] the truncation points. You may specify one or both. 110 indicates the lower limit for left trun|ation and ulO indicates the upper limit for right truncation. Observations with depvar < 11(3;; are left truncated, observations with depvar>ulO are right truncated, and remaining observations are not truncated. See [R] tobit for a more detailed description. 3209
210
truncreg — Truncated regression
noconstant suppresses the constant term (intercept) in the model. offset (varname} specifies that varname is to be included in the model with the coefficient constrained to be 1. marginal estimates the marginal effects in the model in the subpopulation. marginal may be specified when the model is estimated or when the results are redisplayed. at (matname) specifies the point around which the marginal effect is to be estimated. The default is to estimate the effect around the mean of the independent variables. If there are k independent variables, matname should be 1 x fc. at () may be specified when the model is estimated or when the results are redisplayed. at() implies marginal; specifying marginal at() is equivalent to typing atO by itself. robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the conventional MLE variance estimator, robust combined with cluster () further allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights, then robust is implied; see [U] 23.13 Weighted estimation. cluster (varname') specifies that the observations are independent across groups, varname specifies to which group each observation belongs. clusterO can be used with pweights to produce estimates for unstratified cluster-sampled data. clusterO implies robust; specifying robust clusterO is equivalent to typing clusterO by itself. score (newvarlist') creates two new variables containing the contributions to the scores for the equation and the ancillary parameter in the model. The first new variable specified will contain mj = d\nLj/d(xjb) for each observation j in the sample, where InLj is the jth observation's contribution to the log likelihood. The second new variable specified will contain u^j — dlnLj/da. The scores can be obtained by summing over j. See [U] 23.12 Obtaining scores. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [u] 23.5 Specifying the width of confidence intervals. constraints (numlist') specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command: see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be estimated. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. nolog suppresses the iteration log. maximize-options control the maximization process; see [R] maximize. Use the trace option to view parameter convergence. Use the ltol(#) option to relax the convergence criterion; the default is le—6 during specification searches.
truncreg — Truncated regression
211
Options for predict See [H] tobit for a detailed description.
Remarks Truncated regression Truncated regression estimates a model of a dependent variable on independent variables from a restricted part of a population. Truncation is essentially a characteristic of the distribution from which the sample -data are drawn. If a; has a normal distribution with mean \i and standard deviation a, then the density of the truncated normal distribution I is
f'(x \ a < x < b)
where d and $ are the density and distribution functions of the standard normal distribution. Compared with the mean of the untruncated variable, the mean of the truncated variable is greater if the truncation is from below, and the mean of the jtruncated variable is smaller if the truncation is from above. Moreover, truncation reduces the variance compared with the variance in the untruncated distribution; ;|
Example
j i
We will demonstrate truncreg using a subset of Ithe Mroz dataset distributed with Berndt (1991). This dataset contains 753 observations on women's lalor supply. Our subsample is of 250 observations, with 150 mjarket laborers and 100 nonmafket labore|s. e; laborsub, clear '• dssjeribe Contains data frcm laborsub.dta abSi: 260 vara): 6 size: 2,750 (99. 6'/. of artiory free) variable name Ifp whrs k!6 k618 wa we Sottejd by:
storage display type format byte 7.9, 0| int '/.9.0| byte 7,9. Og byte X9.0| byte %9.0£ byte 7.9.0^
value label
28 Mar 2000 14:49
variable label 1 if woman worked in 1975 Wife's hours of work # of children younger than 6 Jf of children between 6 and 18 Wile's age Wife's educational attainment
truncreg — Truncated regression
212
, suamarize Variable
Obs
Mean
Std. Dev.
Min
Max
Ifp whrs k!6 k618 wa we
250 250 250 250 250 250
.6 799.84 .236 1.364 42.92 12.352
.4908807 915.6035 .5112234 1 . 370774 8.426483 2.164912
0 0 0 0 30 5
1 4950 3 8 60 17
We first perform an ordinary least squares estimation on the market laborers. . regress whrs k!6 k618 va we if whrs >0 Source SS df MS Model Residual
7326995.15 94793104.2
4 1831748.79 145 653745 .546
Total
102120099
149 685369 .794
whrs
Coef .
k!6 k618 wa we _cons
-421.4822 -104.4571 -4.784917 9,353195 1629.817
Std. Err. 167.9734 54.18616 9.690502 31.23793 615.1301
t -2.51 -1.93 -0.49 0.30 2.65
150 Number of obs 2.80 F( 4, 145) = 0.0281 Prob > F = 0.0717 R-squared Adj R-squared = 0.0461 = 808.55 Root MSE P> It! 0.013 0.056 0.622 0.765 0.009
[951/, Conf . Interval] -753.4748 -211.5538 -23 . 9378 -52-38731 414.0371
-89.48953 2.639666 14.36797 71.0937 2845.596
Now, we use tmncreg to perform truncated regression with truncation from below zero. . truncreg whrs k!6 k618 wa we, 11(0) (note: 100 obs. truncated) Fitting full model Iteration 0: log likelihood -1205.6992 Iteration 1: log likelihood -1200.9873 Iteration 2: log likelihood -1200.9159 Iteration 3: log likelihood -1200.9157 Iteration 4: log likelihood -1200.9157 Truncated regression Limit : lower = 0 upper = +inf Log likelihood = -1200.9157 Std. Err.
Number of obs = 150 Wald chi2(4) = 10.05 Prob > chi 2 = 0.0395 z
P>lzl
[957. Conf. Interval]
whrs
Coef.
k!6 k618 wa we _cons
-803 . 0042 -172.875 -8.821123 16.52873 1586.26
321.3614 88.72898 14.36848 46 . 50375 912.355
-2.50 -1.95 -0.61 0.36 1.74
0.012 0.051 0.539 0.722 0.082
-1432,861 -346.7806 -36.98283 -74.61695 -201.9233
-173.1474 1 . 030578 19.34059
_cons
983.7262
94.44303
10.42
0.000
798.6213
1168.831
eql
107.6744
3374 . 442
sigma
If we assume that our data were censored, the tobit model is
truncrtg — Tfancated . tobit wars k!6 k61|8 oa we, 11(0) tobit estimates
Suaber of obs LR cai2(4) Prob > chi2 Pseudo R2
\
Log likelihood = -1367,0903 wars
Coef.
Std. Err.
_COttB
214.7407 74.22303 13,25639 41.82393 841.5467
_se
1309.909
82,73335
Obs. summary:
Q Technte4l Note
=
21S
250 33.03 0.0001 0,0084
,i
!
-827.7657 -140.0192 -24J97919 105J.6896 589,. 0001
k!6 k618 wa we
=
regfeMten
t I 43.85 •fl.89 41.88 2.48 JO. 70
P>lt| 0,000 0.060 0.061 0.014 0.485
[955C Conf . Interval] -1250,731 -286.2128 -51.08969 21.31093 -1068.556
-404.8009 6.174543 1.131316 186.0683 2248.656
(Ancillary parameter)
100 left-censored observations at whrs<=0 ISO uncensoref observations
j
Whether truncated regression is more appropriate than the ordinary least squares estimation depends on the purpose of that estimation. If we are interested in the mean of wife's working hours conditional on the subsample of market laborers, least squaijes estimation is appropriate. However, if it is the mean of'wife's working hours regardless of markjet or nonmarket labor status that we are interested in, least squares estimates eould be seriously misleading. It should be understood that truncation and cebsoring are different concepts. A sample has been censored if no observations'have been systematically excluded, but some of the information contained in them "has been suppressed. In a truncated distribution, only the part of the distribution above (or below, or between) the truncation point(s) is relevant to our computations. We need to scale it up by the probability that an observation falls in the rangje that interests us to make the distribution integrate to one. The censored distribution used by tobit, however, is a mixture of discrete and continuous distributions. Instead of rescaling over the observable range, we simply assign the full probability from the censored region(s) to the censoring poiijjt(s). The truncated regression model is sometimes less well behaved than the tobit model. Davidson .and MacKinnon (1993) provide an example where truncatioii results in more inconsistency than censoring. 1
Marginal Effects
'
Q
j j
The marginal effects in the truncation model in the subpopulation can be obtained by specifying the marginal option. at(matname) specifies the|j points around which the marginal effects are to be estimated. The default is to estimate fie effect aribund the mean of the independent variables. :] |j
Example
j
The marginal effects around the mean of the independent variables conditional on subpopulation for our previous truncated regression are given bjf
214
tnincreg — Truncated regression . truncreg wars k!6 k618 wa we, marginal 11(0) nolog (note: 100 obs. truncated) Marginal effects of truncated regression Limit: lower = 0 upper = +inf Log likelihood = -1200.9157 whrs
dF/dx
k!6 k618 wa we _cons
-521.9979 -112.3785 -5.734226 10.7446 1031.158
Std. Err.
208.903 57.67883 9.340322 30.23005 593.0821
Number of obs = 150 Wald chi2(4) = 10,05 Prob > chi2 = 0.0395
z
P>lzl
X_at
-2 .BO -1 .95 -0 .61 0.36 1.74
0.012 0.051 0.539 0.722 0.082
.173333 1.31333 42.7867 12.64 1
C
95'/. C.I.
-931.44 -225 . 427 -24.0409 -48.5052 -131.262
]
-112.556 .669934 12.5725 69.9944 2193.58
The marginal effects around the mean of independent variables of the entire population may be obtained by using the at() option. Notice that the marginal and the at() options may be specified when results are redisplayed. . mat accvun junk = k!6 k618 wa we, means(B) noc (obs=250) . truncreg, marginal at(B) Marginal effects of truncated regression Limit: lower = 0 upper = +inf Log likelihood = -1200.9157 whrs
dF/dx
Std. Err.
k!6 k618 wa we
-201.7607 -43.43611 -2.216371 4.152964
80.7444 22 . 2938 3.610186 11.68441
z -2 .50 -1 .95 -0 .61 0.36
Number of obs = 150 Wald chi2(4) = 10.05 Prob > chi2 = 0.0395
P>lz|
X_at
0.012 0.051 0.539 0.722
.236 1.364 42.92 12.352
[
95V.C.I.
]
-360.017 -43.5046 -87.1312 . 25894 -9.29221 4.85946 -18.7481 27 . 054
Technical Note Whether the marginal effect or the coefficient itself is of interest depends on the purpose of the study. If the analysis is confined to the subpopulation, then marginal effects are of interest. Otherwise, it is the coefficients themselves that are of interest.
Q
(Continued on next page)
truncreg — Truncated regression
215
S&ved Results truncreig saves in e(): Scalars e(N) e(N_bf) e(llopti) e(ulopti) e(df_m)
Macros e(cad) e(depvar) e(title) e(wtype) e(wetp) e(clustvar) Matrices e(b) e(V) e(V_dfdi)
number of observations number of obs, before truncation contents of lit), if specified contents of ulO, if specified model degrees of freiedom tog likelihood log likelihood, constant-only model
e(N_clust) e(rc) e(cM2) e(p) e(ic) e(rank) e(sigma)
number of clusters return code X2
truncreg name of dependent variable title in estimation output weight type
e(vcetype) e (user) e(opt) e(chi2type) e (off set) e (predict) e( ens list)
covariance estimation method name of likelihood-evaluator program type of optimization Wald or LR; type of model x2 test offset program used to implement predict constraint numbers
e(dfdx) e (means) e(at) e(ilog)
marginal effects means of independent variables points for calculating marginal effects iteration log (up to 20 iterations)
weight expression name of cluster variable
coefficient vector variance-covariance matrix of the estimators variance-covariance matrix of the marginal effects
sipificance number of iterations rank of e(V) estimate of sigma
Functions e(sapple)
marks estimation saihple
Methods Mid Formulas truncreg is implemented as an ado-file. Greene :(2000, 897-905) and Davidson and MacKinnon (1993, 534-537) provide introductions to the truncated regression model. Let y = X/3 -f e be the model, y represents contiguous outcomes either observed or not obsen'ed. Our model;assumes that « ~ JV(0,
L=
--
(yt
_ Ki/3 -
log
The marginal effects at Xj with 'truncation from pelow are
where «i = (a - XjM fa A, = 4>(
216
truncreg — Truncated regression
The marginal effects at X; with truncation from above are
dE (yi i Vi < b)
=
dXi
*
where
The marginal effects at xf with truncation from both above and below are 1 - A j + OttA*-.
( & -°W^)
where an = (a a £2 = (6 - Xj
References Berndt, E. 1991. The Practice of Econometrics. New York: Addison-Wesley. Cong, R. 1999. sg!22: Truncated regression. Stata Technical Bulletin 52: 47-52. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 248-255. Davidson. R. and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. Greene, W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River, NJ: Prentice-Hall.
Also See Complementary:
[R] constraint, [R] lincom, [R] Irtest, [R] mfx, [R] predict, [R] test, [R] festal, [R] xi
Related:
[R] tobit
Background:
[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [R] maximize
title tsrepoft — Report time-series aspects of datasel or estimation sample
Syntax tsrepart [if exp^ [in range\ [, report ileportO list panel ] 1 tsreport typed without options produces no output, but does provide its standard saved results.
:j
;
Description tsraport reports on time gaps in a sample of] observations. A one-line statement displaying a count of the gaps is provided by !the report option, and a full list of records that follow gaps is provided by the list option. A return value r(N_^aps) is set to the number of gaps in the sample.
Options ;
i
report specifies that a count of the number of gapfe in the time series be reported, if any gaps exist. i reportO specifies that the count Of gaps be reported even if there are no gaps, list specifies that a tabular list of gaps be displaced. panel specifies that panel changes are not to be coiinted as gaps. Whether panel changes are counted as gaps usually depends on how the calling confmand handles panels.
Time-series commands sometimes require that observations be on a fixed time interval with no gaps, or that the behavior of the command might be different if the time series contains gaps, tsreport provides a tool for reporting the gaps in a sample.
Example The following monthly panel data have two pajiels and a missing month (March) in the second panel. ' . list edlevel month income «(ilevel moftth 1. 1 1998ml 2, 3. 4. 5. 6.
1 1 2 2 2
109&n2 199&m3 199BtDl 199&B2 199fem4
iacome
687 783 790 1435 li522 1532
218
tsreport — Report time-series aspects of dataset or estimation sample
Invoking tsreport without the panel option, gives us the following report: . tsreport, report Number of gaps in sample:
2
(gap count includes panel changes)
We could get a list of gaps and better see what has been counted as a gap by using the list option: . tsreport, report list Number of gaps in sample: 2 (gap count includes panel changes) Observations with preceding time gaps (gaps include panel changes) Record
edlevel
month
4 6
2 2
1998ml 1998m4
We now see why tsreport is reporting two gaps. It is counting the known gap in March of the second panel and also counting the change from the first to the second panel. (If we are programmers writing a procedure that does not account for panels, a change from one panel to the next represents a break in the time series just as a gap in the data does.) We may prefer that the changes in panels not be counted as gaps. We obtain a count without the panel change by using the panel option: . tsreport, report panel Number oi gaps in sample:
i
To obtain a fuller report, we type . tsreport, report list panel Number of gaps in sample: 1 Observations with preceding time gaps Record
edlevel
month
2
1998m4
6
Saved Results tsreport saves in r(): Scalars r(N_gaps) number of gaps in sample
Also See Complementary:
[R] tsset
Background:
[u] 14.4.3 Time-series varlists, [U] 15,5.4 Time-series formats, [U] 27.3 Time-series dates, [u] 29.12 Models with time-series data
Title tsset — Declare dataset to be time-series data
Syntax tsset [pandvarj timevar [, foraat C?tfint) { daily | weeily monthly | quarter1^ | half yearly yearly | generic ]• tsset tsset, clear tsfill [, full ]
I
Description tsset Declares the data a time series and desi|nates that timevar represents time, rimevar must take on integer values. If panelvar is also specified, the dataset is declared to be a cross section of time series (e.g., time series of different countries), tsset must be used before time-series operators may be used in expressions and varlists. After tsset, the data will be sorted on limevar or on panelvar timevar. If you have annual data with the variable year;representing time, using tsset is as simple as . t$set year, yearly !*?•
although you could omit the yearly option because it affects only how results are displayed. tsset without arguments displays how the dafaset is currently tsset and re-sorts the data on rimevar or on panelvar timevar if it is sorted differently from that. tsset, clear is a rarely used programmer's command to declare that the data are no longer a time series. tsfill is used after tsset to fill in missing times with missing observations. For instance, perhaps observations for timevar = 1 . 3 , 5 , 6 , . . . .22 exist, tsfill would create observations for Timevar = 2 and timevar — 4 containing all missing values. There is seldom reason to do this because Stata's time-series operators work on the basis of timevar and not observation number. Referring to L.gnp to obtain lagged gnp values would correctl^ produce a missing value for timevar = 3 even if the data were not filled in. Referring to L2. gnp would correctly return the value of gnp in the first observation for timevar - 3 even if the data were hot filled in.
Options i
format C/.fmt). daily, weekly, monthly, quarterly, halfyearly, yearly, and generic deal with how timevar will be subsequently displayed—which '/,t format, if any, will be placed on timevar. Whether limevar is formatted is optional; all tsset requires is that timevar take on integer values. :j 216
222
m() is the fiincu'ji *^ return_s month equivalent; m(1995m6) evaluates to the constant 425, meaning . We now have variable newt containing
425 42 months after ;
. list t wsx- -,-s,^
427 428
income 1153 H81 1208
434
1282
426 *•
*
/• . (output amitj«: , 9 ' '*
If we put a %ta
5 it wfl] Ust more prettily.
r/rj
. fonnat . list t income
1. 2. 3.
1153 1181 1208
O/njifc-;
We could now
rather
. tsset nev™, tin*
newt, 1995m7 to 1996m3
I> Example Perhaps we ha\- the same time-series data but no time variable: • list incoc* 1. 2. 3. 4. 5. 6. 7. 8. 9.
IV^
re en we ow thai i|,u fj r s t observation corresponds to July 1995 and continues without gaps. We can create a monthly ,, me variable and format /by typing - gen t = B^&
_n .
• format t '/,),„, We can now tsset Our dala an(}
it;
. tsset t
-
t, 1995m7 to 1996m3
tsset — Declare dataset to be time-series data
223
. list t income
t 1. 199&B7 2. 1995&8 3. 1995m9 4. 1995mlO
income 1153 1181 1208 1272
j
(oatput omitted) 9.
1996m3
1282
,,
p Technical Note Your data do not have to be monthly. Stata understands daily, weekly, monthly, quarterly, halfyearly, and yearly data. Correspondingly, there are the d|), w(), m(), q(), h(), and yO functions and there are the */,td, 7,tw, '/,tm, */,tq, '/,th, and */,ty fdrmats. Here is what we would have typed in the above examples had our data been on a different time scale: Daily:
pretend your t variable had1t=l corresponding to 15marl993 . gen newt = d(!Sma|-1993) + t - 1 . format newt ?.td . tsset newt Weekly: pretend your t variable had;t=l corresponding to 1994wl: . gen newt = w(1994|rl) + t - 1 . format newt %tv . tsset newt Monthly: pretend your t variable had t=l corresponding to 2004m7: . gen newt = m(2004i7) + t - 1 . format newt Xtm ; . tsset newt Quarterly: pretend your t variable hadjt=l corresponding to 1994ql: . g<en newt = q(19944l) + t - 1 . format newt Xtq . tisset newt Halfyearly: pretend your t variable hadjt=l corresponding to 1921h2: . gen newt = h(1921$2) + t - 1 , format newt 7,th . tisset newt Yearly: pretend your t variable hadit=l corresponding to 1842: . gen newt = y(1842J + t - 1 . format newt Xty • . tsset newt
In each of the above examples, we subtracted one frj)m our time variable in constructing the new time variable newt because we assumed that our starting time value was 1. For the quarterly example, if our starting time value had been 5 and that corresppnded to 1994ql, we would have typed . get newt = q(1994ql) •*- t - 5 Had our initial time value been t = 742 and this corresponded to 1994ql, we would have typed . gen newt = q(1994ql) + t - 742 j
The '/,td, 7,tw, 7,tm, 7,tq, */,th, and */.ty formats can display the date in the form you want; when you type, for instance, '/,td, you are specify fie default daily format, which produces dates of the form 15apr2002. If you wanted that to be displayed as "April 15, 2002", '/.td can do that; see [U] 27.3 Time-series dates. Similarly, ail the other |t formats can be modified to produce the results you want. ; Q
224
tsset — Declare dataset to be time-series data
Example Your data might include a time variable that is encoded into i string. Below, each monthly observation is identified by string variable yrmo containing the year zzd month of the observation, sometimes with punctuation in between: . list yrmo incone
1. 2. 3. 4. 5. 6. 7. 8. 9.
ynno 1995 7 1995 8 1995,9 1995 10 1995/11 1995,12 1996-1 1996.2 1996 Mar
income 1153 1181 1208 1272 1236 1297 1265 1230 1282
The first step is to convert the string to a numeric representation. That is easy using the monthly () function; see [U] 27.3 Time-series dates. . gen mdate = monthly(ynno, "ym") . list yrmo mdate income ynno mdate 1. 2. 3. (output omitted ) 9.
1995 7 1995 8
income
1995,9
426 427 428
1153 1181 1208
1996 Mar
434
1282
Our new variable, mdate, contains the number of months from January, I960. Now having numeric variable mdate, we can tsset the data: format mdate 7,tm tsset mdate time variable: mdate, 1995m7 to 1996m3
In fact, we can combine the two and type tsset mdate, format(Xtm) time variable: mdate, 1995m7 to 1996m3
or type . tsset mdate, monthly time variable: mdate, 1995m7 to 1996n3
We do not have to bother to format the time variable at all, but formatting makes it display more prettily: . list ynno mdate income yrmo 1. 1995 7 2. 1995 8 3. 1995,9 4. 1995 10 5. 1995/11 6. 1995,12 7. 1996-1 8. 1996.2 9. 1996 Mar
mdate 1995m7 1995m8 1995m9 1995mlO 1995ml 1 1995ttl2 1996ml 1996m2 1996m3
income 1153 1181 1208 1272 1236 1297 1265 1230
1282
tsset j— Declare dataset to be time-series data
225
Q Technical Note In addition to the monthly () function for translating strings to monthly dates, Stata has daily (), weeklyO, quarterlyO, halfyearlyO, and |yearly(). Stata also has die yw(), ymO, yqO, and yh() functions to convert from two numeric time variables to a Stata time variable. For example, gen qdate .= yq(year,qtr) takes the variable |year containing year values, and the variable qtr containing quarter values (1-4), and produces thd variable qdate containing the number of quarters since 1960ql. See [U] 273 Time-series dates.
!
> Example
Q
\
Gaps in the time series cause no difficulties: . list yrmo income yrmo income 1. 1995 7 1153 2. 1995 8 1181 ; 3. 1995/11 1236 4. 1995,12 1297 5. 1996-1 1265 6. 1996 Mar 1282 . gen mdate = monthly(yrmo,"ym") . tsset radate, monthly time variable: mdate, 1995m7 toil996m3, but with gaps
Once the dataset has been tsset, we can use the'time-series operators. The D operator specifies first (or higher order) differences: . list mdate income d.income income D.income mdate 1. 1995m7 1153 2. 1995m8 1181 28 3. 1995ml 1 1236 4. 1995ml2 1297 61 6. 1996ml -32 1265 6. 1996m3 1282
I
You can use the operators in an expression or varljst context; you do not have to create a new variable to hold D. income. You can use £>. income with tjie list command, or with the regress command, or with any other Stata command that allows tirrije-series varlists.
Example We stated above that gaps were no problem, |and that is true as far as operators are concerned. You might, however, need to fill in the gaps for |ome analysis, say by interpolation. This is easy to do with tsf ill and ipolate. tsf ill will creaije the missing observations, and then ipolate (see [R] ipolate) will fill them in. Staying with the example above, we can fill in the time series by typing . tsfill
:
226
tsset — Declare dataset to be time-series data . list mdate income mdate income 1. 1995m7 1153 2. 1995m8 1181 3. 1995m9 . 4. 1995mlO . 5. 1995ml1 1236 6. 1995ml2 1297 7. 1996ml 1265 8. 1996m2 . 9. 1996m3 1282
4-new <-new
«- new
We listed the data after tsf ill just to show you the role tsf ill plays in this, tsf ill created the observations. We can now use ipolate to fill them in: . ipolate income mdate, gen(ipinc) . list mdate income ipinc income ipinc mdate 1. 1995m7 1153 1153 1181 1181 2. 1995m8 . 1199.333 3. 1995m9 . 1217.667 4. 1995mlO 1236 1236 5, 1995ml 1 1297 1297 6. 1995ml2 126S 7. 1996ml 1265 1273.5 8. 1996m2 1282 1282 9. 1996m3
Panel data > Example Now let us assume that we have time series on annual income and that we have the series for two groups: individuals who have not completed high school (edlevel = 1) and individuals who have (edlevel = 2). . list edlevel year income edlevel year 1. 1 1988 2. 1 1989 3. 1 1990 4. 1 1991 5. 2 1989 6. 2 1990 7. 2 1992
income 14500 14750 14950 15100 22100 22200 22800
We declare the data to be a panel by typing . tsset edlevel year, yearly panel variable: edlevel, 1 to 2 time variable: year, 1988 to 1992, but with, a gap
Having tsset the data, we can now use time-series operators. The difference operator, for example, can be used to list annual changes in income:
tsset -4 Declare dataset to be time-series date list edlevel year edlevel 1. 1 1 2. 1 3. 1 4. 5. 2 6. 2 7. 2
income d.income income year 1988 14500 14760 1989 1990 149SO 15100 1991 1989 22100 1990 22200 1992 22800
227
D.income 250 200 150 100
We see that in addition to producing missing values due to missing times, the difference operator correctly produced a missing value at the start of each panel. Once we have tsset our panel data, we can use time-series operators, and be assured thjat they will handle missing time periods and panel changes correctly.
Example As with nonpanel time series, we can use tsf 13J1 to fill in gaps in a panel time series. Continuing with our example data, tsfill list edlevel year edlevel 1. 1 2. 1 1 3. 4. 1 5. 2 6. 2 7. 2
income year 1988 1989 1990 1991 1989 1990 1991 1992
income 14500 14750 14950 15100 22100 22200
+- new 22800
We could instead ask tsf ill to produce fully balanced panels using the full option: tsfill, full list edlevel year edlevel 1. 1 2. 1 3. 1 4. 1 5. 1 6. 2 7. 2 8. 2 9. 2 10. 2
income year 1988 1989 1990 1991 1992 1988 1989 1990 1991 1992
income 14500 14750 14950 15100 .
4- new f- new
22100 22200 f
new
22800
<3
tsset — Declare dataset to be time-series data
228
Saved Results tsset saves in r(): Scalars
rCtmin) r(tmax) r(imin) r(imax)
minimum elapsed time maximum elapsed time minimum panel id maximum panel id
Macros r(timevar) r(panvar) r(tmins) r(tmaxs) r(tsfmt) r(unit) r(unitl)
elapsed time variable panel variable formatted minimum elapsed time formatted maximum elapsed time format for the current time variable daily, weekly, monthly, quarterly, halfyearly, yearly, or generic d, w, m, q, h, y, or ""
Methods and Formulas tsset is implemented as an ado-file.
References Baum, C. F. 2000. sts!7: Compacting time series data. Stata Technical Bulletin 57: 44-45.
Also See Background:
[U] [U] [U] [U]
14.4.3 Time-series varlists, 15.5.4 Time-series formats, 27.3 Time-series dates, 29.12 Models with time-series data
Title ttest — Mean comparison tests
Syntax ttest vamame = # [if exp] [in range] [, level (#)] ttest varnamei - varname^ [if exp] [in range] [, unequal unpaired welch. level (#) ] ttest varname [if exp] [in range] , by(gHbwpvar) [ unequal welch level (#) ] ttesti #ofcs #meon #sd #val [. level (#) ttesti #o6j]1 #mean,i #«
Description IP
ttest performs t tests on the equality of mea|is. In the first form, ttest tests that varname has a mean of #. In the second form, ttest tests that fyamamei and varname^ have the same mean. Data are assumed to be paired, but unpaired changes this assumption. In the third form, ttest tests that varname has the same mean within the two groups defined by groupvar. ttesti is the immediate form of ttest; see [U] 22 Immediate commands. For the equivalent of a two-sampte t test witjh sampling weights (pweights), use the svymean command with the by() option and then use svjrlc; see [R] svymean and [R] svylc.
Options unequal indicates that the unpaired data are not to be assumed to have equal variances, unpaired indicates that the data are to be treatep as unpaired. welch indicates that the approximate degrees of fijeedom for the test should be obtained from Welch's formula rather than from Satterthwaite's approximation formula (1946), which is the default when unequal is specified. This option is not appropriate unless unequal is also specified. level (#) specifies the confidence level, in percenl for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. •I by (groupvar) specifies the groupvar that defines; the two groups which ttest is to use to test the hypothesis that the means of the two groups ajre equal. Do not confuse the by() option with the ] by . . . : prefix; both may be specified.
2S9
ttest — Mean comparison tests
230
Remarks &> Example In the first form, ttest tests whether the mean of the sample is equal to a known constant under the assumption of unknown variance. Assume you have a sample of 74 automobiles. You know each automobile's average mileage rating and wish to test whether the overall average for the sample is 20 miles per gallon. . ttest mpg=20 One-sample t test Variable
Obs
Mean
mpg
74
21.29T3
Std. Err.
Std. Dev.
.6725511
5.785503
[957. Conf . Interval] 19.9569
22.63769
Degrees of freedom: 73 Ho: mean(mpg) = 20 Ha: mean ~= 20 t = 1.9289 P > |t| * 0.0576
Ha: mean < 20 t = 1.9289 P < t = 0.9712
Ha: mean > 20 t = 1.9289 P > t = 0.0288
The test indicates that the underlying mean is not 20 with a significance level of 5.8%.
^
S> Example You are testing the effectiveness of a new fuel additive. You run an experiment with 12 cars. You run the cars without and with the fuel treatment. The results of the experiment are Without Treatment
With Treatment
Without Treatment
20 23 21 25 18 17
24 25 21 22 23
18 24 20 24 23 19
18
With Treatment 17
28 24 27 21 23
By creating two variables called mpgl and mpg2 representing mileage without and with the treatment, respectively, we can test the equality of means by typing . ttest mpgl=iBpg2 Paired t test Variable
Obs
Mean
Std. Err.
Std. Dev.
[957. Conl . Interval]
mpgl mpg2
12 12
21 22.75
.7881701 .9384465
2.730301 3.250874
19.26525 20.68449
22.73475 24.81551
diff
12
-1.75
.7797144
2.70101
-3.46614
-.0338602
Ho: mean(mpgl - mpg2) = mean(diff) = 0 Ha: raean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean(diff) > 0 t = -2.2444 t = -2.2444 t = -2.2444 P < t = 0.0232 P > It I = 0.0463 P > t = 0.9768
You find that the means are statistically different from each other at any level greater than 4.6%.
ttest — Mean comparison tests
231
Example Let's pretend that the preceding data were collected not by running 12 cars but 24 cars: 12 cars with the additive and 12 without. Although you might be tempted to enter the data in the same way, you should resist (see the technical note below). Instead, you enter the data as 24 observations on mpg with an additional variable, treated, taking;on 1 if the car received the fuel treatment and 0 otherwise: . ttest mpg, by(treated) Two-sample t test With eqttal vatiances
Std. Erf-. Std. Dev.
[95?, Coaf . Interval]
21 22,76
.7881701 .938446I&
2.730301 3.250874
19.26525 20.68449
22.73475 24.81651
21.875
,626447k
3.068954
20.57909
23.17091
-1.T6
1. 2255l£
-4.291568
.7916684
Obs
Mean
0 1
12 12
combined
24
Group
diff
;
Degrees of freedom: 22 1 Ho: mean(O) - meanCJQ = diff = 0 Ha: diff < 0 Ha: diff t* 0 t = -1.4280 t = -i.4280 P < t = 0.0837 P !> I t | = I). 1673
Ha: diff > 0 t = -1.4280 P > t = 0.9163
This time you do not find a statistically significant difference. If you were not willing to assume that the variances were equal and wanted to use Welch's formula, you could type . ttest mpg, by(treated) "anequal welch Two-sample t test with unequal variances Group
Obs
Mean
Std. Er|r.
Std. Dev.
[957. Conf . Interval]
0 1
12 12
21 22.75
.788170JL . 938446fe
2.730301 3 . 250874
19.26525 20 . 68449
22.73475 24.81551
combined
24
21.875
.626447^
3.068954
20 . 57909
23.17091
-1.75
1.22551&
-4.28369
.7836902
diff
Welch's degrees of freedom: 23.2465 : Ho: mean(O) - mean(p = diff = 0 Ha: diff < 0 Ha: diff f= 0 t = -1.4280 t = -4.4280 P < t = 0.0833 P > |t| = t>.1666
Ha: diff > 0 t = -1.4280 P > t = 0.9167
Technical Note In two-group randomized designs, subjects will! sometimes refuse the assigned treatment but still be measured for an outcome. In this case, care must be taken to specify the group properly. One might be tempted to let varname contain missing ijvhere the subject refused and thus let ttest drop such observations from the analysis. Zelen (1979)| argues that it would be better to specify that the subject belongs to the group in which he or she was randomized even though such inclusion will dilute the measured effect.
ttest — Mean comparison tests
232
Q Technical Note There is a second, inferior way the data could have been organized in the preceding example. Remember, we ran a test on 24 cars, 12 without the additive and 12 with. Nevertheless, we could have entered the data in the same way as we did when we had 12 cars, each ran without and with the additive, by creating two variables—mpgl and mpg2. This is inferior because it suggests a connection that is not there. In the case of the 12-car experiment, there was most certainly a connection—it was the same car. In the 24-car experiment, however, it is arbitrary which mpg results appear next to which. Nevertheless, if your data are organized like this, ttest can accommodate you. . ttest mpgl=mpg2, unpaired Two-sample t test with equal variances Variable
Obs
Mean
Std. Err.
Std. Dev.
[957, Conf . Interval]
mpgl mpg2
12 12
21 22.75
.7881701 .9384465
2.730301 3.2B0874
19.26525 20.68449
22.73475 24.81551
combined
24
21.875
,6264476
3.068954
20.57909
23.17091
-1.75
1.225518
-4.291568
.7915684
diff
Degrees of freedom: 22 Ho: mean(mpgl) - mean(mpg2) = diff = 0 Ha: diff < 0 Ha: diff ~= 0 Ha: diff > 0 t = -1.4280 t = -1.4280 t = -1.4280 P < t = 0.0837 P > lt| = 0.1673 P > t = 0.9163
Q
Example ttest can be used to test the equality of a pair of means; see [R] oneway for testing the equality of more than two means. Suppose you have data on the 50 states. The dataset contains the median age of the population (medage) and the region of the country (region) for each state. Region 1 refers to the Northeast, region 2 to the North Central, region 3 to the South, and region 4 to the West. Using oneway, you can test the equality of all four means. oneway medage region Source
Analysis of Variance SS df MS
Between groups Within groups
46.3961903 94.1237947
3 46
Total
140.519985
49
Bartlett's test for equal variances:
Prob > F
15.4653968 2.04616945
7.56
0 . 0003
2.8677548 chi2(3) =
10.S757 Prob>chi2 = 0.014
You find that the means are different. You, however, are only interested in testing whether the means for the East (region==l) and West (region==4) are different. You could use oneway:
ttest — Mean comparison tests oneway medage region if regi6n=S!l I re|jrion==4 Analysis of Variance Source df MS SS 46.241247 46.1969169
Between groups Within groups
233
Prob > F
1 46.241247 20 fe.30984584
20.02
0.0002
92.4381638 21 4.40181733 Total 2.4679 Prob>chi2 = 0.116 Bartlett's test for equal variances: ch]i2{l)
Or you could use ttest:
;
, tt«st medage if region"! I region==4,| by (region) Ttro-saaple t test with equal variances }
Std. Srr. ]
Std. Dev.
[9B1/, Conf. Interval]
31-. 23333 28 . 28462
.3411SJB1
.4923sfrr
1.023474 1.775221
30.44662 27.21186
32.02005 29.35737
29.49091
9
2.098051
28.56069
30.42113
2,948718
.65903172
1.57399
4.323445
Group
Obs
Mean
NE West
9 13
combined
22
diff
Degrees of freedom: 20 Ho: mean(NE) - mean0?est) = diff = 0 Ha: diff < 0 Ha: diff! -= 0 Ha: diff > 0 t = 4.4743 t = •! 4.4743 t = 4.4743 P < t = 0.9999 P > |t| = ; 0.0002 P > t = 0.0001
Note that the significance levels of both tests aret the same.
Immediate form Example ttesti is like ttest except that you specify siummary statistics rather than variables as arguments. For instance, you are reading an article which Reports the mean number of sunspots per month as 62.6 with a standard deviation of 15.8. There alle 24 months of data. You wish to test whether the mean is 75: : . ttesti 24 62.6 15.8 75 One-sample t test Obs
Mean
Std. izr.
Std. Dev.
24
62.6
3.225^61
15.8
[951/. Conf. Interval] 55.92825
69.27176
Degrees of freedom: 23 Ha: Bean < 75 t = -3.8448 P < t = 0.0004
Ho: mean(4) = 75 Ha: mean i~= 75 t = :-3.8448 P > Itl = ; 0.0008
Ha: mean > 75 t = -3.8448 P > t = 0.9996
ttest — Mean comparison tests
234
Example There is no immediate form of ttest with paired data since the test is also a function of the covariance, a number unlikely to be reported in any published source. For nonpaired data, however: . ttesti 20 20 5 32 15 4 Two-sample t test with equal variances
Obs
Mean
y
20 32
20 15
1,118034 .7071068
5 4
17.65993 13.55786
22.34007 16.44215
combined
52
16.92308
.6943785
5.007235
IS. 52905
18.3171
5
1.256135
2.476979
7.523021
X
diff
Std. Err.
Std. Dev.
Degrees of freedom: 50 Ho: mean(x) - mean(y) = diff = 0 Ha: diff < 0 Ha: diff ~= 0 t = 3.9805 t = 3.9805 P < t = 0.9999 P > |t| = 0.0002
[987, Conf . Interval]
Ha: diff > 0 t = 3.9805 P > t = 0.0001
Had we typed ttesti 20 20 5 32 15 4, unequal, the test would have been under the assumption of unequal variances.
Saved Results ttest and ttesti save in r(); Scalars r(N_l) r(N_2) r(p_u)
r(p) r(se)
sample size m sample size n-a lower one-sided p-value upper one-sided p-value two-sided p-value estimate of standard error
r(t) r(sd_l) r(sd_2) r(mu_l) r(mu_2) r(df_t)
t statistic standard deviation for first variable standard deviation for second variable xi mean for population 1 xa mean for population 2 degrees of freedom
Methods and Formulas ttest and ttesti are implemented as ado-files. See, for instance, Hoel (1984, 140-161) or Dixon and Massey (1983. 121-130) for an introduction and explanation of the calculation of these tests. The test for /j — ju0 for unknown a is given by
t=
The statistic is distributed as Student's t with n - 1 degrees of freedom (Cosset 1908).
ttest — Mean comparison tests
235
The test for /zx = fj,y when &% and
J L S^
*~y
I The result is distributed as Student's t with n + fi — 2 degrees of freedom. One could perform ttest (without the unequal: option) in a regression setting given that regression t x
y
assumes a homoskedastic error model. In order tip compare with the ttest command, denote the underlying observations on x and y by Xj, j = j , . . . , nx, and j^, j = 1,..., n r In a regression framework, ttest without the unequal option jis equivalent to creating a new variable Zj that represents the stacked observations on Jc and y (sp that Zj — x j for j — 1,..., nx and znt+j = Vj for j = 1,... ,n y ) and then estimating the equation Zj — fa + fiidj + tj, where dj — 0 for j — 1,..., nx and dj = 1 for j = nx + 1,... j nx + ny (i.e., dj = 0 when the z observations represent x, and dj — 1 when the z observations;represent t/). The estimated value of /?i, 61, will equal y — z and the reported t statistic will be tbi same t statistic as given by the formula above. The test for /j,x = //y when ax and
t= The result is distributed as Student's t with v degrees of freedom, where v is given by (using Satterthwaite's formula)
) U/n*}
_\
U/nv]
/._ i _v
„ _ 1 n.-l
TT
l_
„ —1 i n v
Using Welch's formula (1947), the number of degfees of freedom is given by
: m \ n.ji'+ i (s*
-2 +
^2
The test for jUj = //j, for matched observations:(also known as paired observations, or correlated pairs or permanent components) is giveh by
where d represents the mean of £j - yi and Sd represents the standard deviation. The test statistic t is distributed as Student's t with n — 1 degrees of; freedom. Note that ttest without the unpaired optioli may also be performed in a regression setting since a paired comparison includes the assumption :of constant variance. The ttest with an unequal variance assumption does not lend itself to an easiy representation in regression settings and is not discussed here, (x, — yA = (30 + e*. \
236
ttest — Mean comparison tests
References Dixon, W. J. and F. J. Massey, Jr. 1983. Introduction to Statistical Analysis. 4th ed. New York: McGraw-Hill. Gleason, J. R. 1999. sglOl: Pairwise comparisons of means, including the Tukey wsd method. Stata Technical Bulletin 47: 31-37. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 225-233. Gosset, W. S. [Student, pseud] 1908. The probable error of a mean. Biometrika 6: 1-25, Hoel, P. G. 1984. Introduction to Mathematical Statistics. 5th ed. New York: John Wiley & Sons. Satterthwaite, F, E. 1946. An approximate distribution of estimates of variance components. Biometrics Bulletin 2: 110-114. Welch, B. L. 1947. The generalization of Student's problem when several different population variances are involved. Biometrika 34: 28-35. Zelen, M. 1979. A new design for randomized clinical trials. New England Journal of Medicine 300: 1242-1245.
Also See Related:
[R] bitest, [R] ci, [R] hotel, [R] oneway, [R] sdtest, [R] signrank, [R] svylc, [R] svymean
Background:
[U] 22 Immediate commands
Title tutorials — Quick reference for Stata tutorials
Syntax tutorial name
Description tutorial intro presents an introductory tutorial to Stata. tutorial contents lists the available official Stata tutorials. tutorial followed by the name of a tutorial file presents that particular tutorial.
m
;
I Remarks * * s t
Stata tutorials are introduced in [U] 9 Stata's ojn-line tutorials and sample datasets and a list of the available official tutorials (as of the time of manual printing) is provided there. A current listing of official tutorials can be obtained with the tutorial contents command. Trie tutorial command can also be used to execujte tutorials written by others. For instance, Verbeek and Weesie (1998) provide a tutorial demonstrating Gaussian and Cauchy random walks. Since this tutorial was presented in the Stata Technical Bulletin, you can easily obtain it; see [R] net. Tutorial files end in the suffix . tut and must rjje placed in the Stata directory. You can determine the Stata directory by typing the command sysdir; see [P] sysdir. The Stata directory is the first one listed. Stata's official tutorials are already in this directory. You only need to be concerned with this detail if you install additional tutorials.
Acknowledgments We thank Jeroen Weesie, Utrecht University, JNetherlands for making public the random walk tutorial of the late Albert Verbeek. Albert Verbeek' contributed a great deal to the early development of Stata, for which he wil always be remembered^
References Newton. H. J. and J. L. Harvill. 1997. StatCoflcepts: A Visual Tour of Statistical Ideas. Pacific Grove. CA: Duxbury Press. :; Verbeek, A. and J. Weesie. 1998. tt7: Random walk tutorial. Stata Technical Bulletin 41: 46. Reprinted in Stata Technical Bulletin Reprints, vol. 7. p. 301. i j
Also See j
Background:
[U] 9 Stata's ott-line tutorials a|id sample datasets
TWe isplay contents -jf files
Syntax ' 5lename" Note: doub,-. :ootes must
showtabs
te UM.,
..
enciose
if fog name
contains blanks.
Description type luc the contents o:r & file stored on disk. This command is similar to the DOS TYPE and Umx more or pg(l ) commands. In Stau r,r Unix,
cat
js fc -.ynonym for type,
Options asis specn^ tnat the file should be shown exactly as it is. The default is to display files with suffixes . smcl <s fop as SMCL) meaning the SMCL directives are interpreted and properly rendered. Thus, type cai -,e usec} to ]00i, at fijes createci ^ log using command. showtabs :v;uests that any tabs be displayed.
Remarks I> Example You hfc > -. -;iw (jata containing the level of Lake Victoria Nyanza and the number of sunspots during the years '/,'/_ 1921 stored in a file called sunspots . raw. You want to read this dataset into Stata using inf :.* but you cannot remember the order in which you entered the variables. You can find out by t\p_^ the dataset: sunspots. raw
190S 190E 190£ 191: 1914 191? 192C
'•'>
5
: -'. 63 '•'i
49 7 6 > 10 ; :'. 104 '!
38
1903 13 24 1906 29 54 1909 8 44 1912 -11 4 1915 4 47 1918 27 81 1921 -5 25
1904 18 42 1907 21 62 1910 1 19 1913 -3 1 1916 15 57 1919 8 64
n
g ^ -Jus output, you now remember that the variables are entered year, level, and number of sunspots. />,„ can read tnis ^^^ by typing inf ile year level spots using sunspots. you \^\ wanted to see the tabs in sunspots.raw, you could have typed 238
type — Display contents of files . type sunspots.ra«, showtaibB 1902 -10 51903 13 241904 1905 15 631906 29 641907 1908 10 491909 8 441910 1911 -7 61912 -11 41913 1914 -2 101916 4 471916 1917 35 1041918 27 811919 1920 3 381921 -6 25
18 42 21 62 1 19 -3 i 15 67 8 64
:
; j
Example In a previous Stata session, you typed log usini myres, and so created myres. smcl containing your results. You can use type to list the log: . type myres.smcl log: /work/peb//dof/myres.smcl log type: smcl opened on: 20 Sep 2000, 15:37:48 . use Ibw (Hosmer & Leaeshow data) . xi: logistic low age Iwt i.race smoke ptlj ht ui i.race _Iface_l-3 (naturally coded; _Irace_l omitted) Logit estimates ; Number of obs = 189 LR chi2C8) = 33.22 i Prob > chi2 = 0.0001 Log likelihood = -100.724 '• Pseudo R2 = 0.1416 (owpur omitted) . Ifit, group(10) table Logistic model for low, goodness-of-fit tesk (Table collapsed on quantiles of estimated probabilities) (output omitted) . log close log: /work/peb/Ydof /Byres, smcl log type: smcl closed on: 20 Sep 2000, 16:38:30
You could also use view to look at the log; see [R]| view.
Also See Complementary:
[R] translate
Related:
[R] cd, [R] e&py, [R] dir, [R] (jrase, [R] mkdir, [R] shell. [R] translate, [R] view
Background:
[U] 14.6 File-naming conventions
Titte update — Update Stata
Syntax update update from location update query [, fromUocation') ] update ado [, from(tocarion) into(dirname) ] update executable [, from(toccrion) into(dirname') force ] update all [, fromdocation) ]
Description The update command reports on the current update level and installs official updates to Stata. Official updates are updates to Stata as it was originally shipped from StataCorp, not the additions to Stata published in, for instance, the Stata Technical Bulletin (STB). Those additions are installed using the net command; see [R] net. update without arguments reports on the update level of the currently installed Stata. update from sets an update source, location is a directory name or URL. If you are on the Internet, type 'update from http://www.stata.com'. Updates are also available from STB media obtained from Stata Corporation; installation instructions are included with the media. (There is a charge for the STB media.) update query compares the update level of the currently installed Stata with that available from the update source and displays a report. update ado compares the update level of the official ado-files of the currently installed Stata with those available from the update source. If the currently installed ado-files need updating, update ado copies and installs those files from the update source that are necessary to bring the ado-files up to date. update executable compares the update level of the currently installed Stata executable with that available from the update source. If the currently installed Stata needs updating, update executable copies the new executable from the update source but the last step of the installation—erasing the old executable and renaming the new executable—is left for the user to perform, update executable displays instructions on how to do this. update all does the same as update ado followed by update executable.
Options fTom(location) specifies the location of the update source. The fromO option may be specified on the individual update commands, or it may be set by the update from command. Which you do makes no difference. 240
update — Update Stata
241
into (dimame) specifies the name of the director^ into which the updates are to be copied, dirname may be specified as a directory name or as a iysdir codeword such as UPDATES or STATA; see [p] sysdir, In the case of update ado, the default is intd(UPDATES), the official update directory. Network computer managers might want to specify intoQ if they want to copy down the updates but leave the last step—copying the files into the bfficial directory—to do themselves. In the case of update executable, the default is into (STATA), the official Stata directory. Network computer managers might want to specify intoO so that they can copy the update into a more accessible directory. In that case, the tast step of copying the new executable over the existing executable would be ^left for them to jjjerform. force is used with update executable to forct downloading a new executable even if, based on the date comparison, Stata does not think it necessary. There is seldom a reason to specify this option. There is no such option for update ado because, if one wanted to force the reinstallation of all ado-file updates, one need only erase thfe UPDATES directory. You can type sysdir list to see where the UPDATES directory is on youf computer; see [P] sysdir.
Remarks update is used to update the two official components of Stata, its binary executable and its ado-files, from either of two official sources: http:tywww.stata.com or an official STB media. Jumping ahead of the story, the easiest thing to do if you fere connected to the Internet is type . update all
?" *
i
and follow the instructions. If you are up to dat^. update all will do nothing. Otherwise, it will download whatever is necessary and display detailed instructions on what, if anything, needs to be done next. If you first want to know what updatp all would do, type
*
. update query
{
update query will present a report comparing what you have installed with what is available and will recommend that you do nothing, or that you type oipdate ado, or that you type update executable, or that you type update all. : If you want just a report on what you have installed without comparing with what is available,
type
|
. update
;
update will show you what you have installed anjd where it is and recommend that you type update query to compare that with what is available, i Before doing any of this, you can type . update from http://www.stataiCOHi
but that is not really necessary because http://wWw.stata.com is the default location. Updates are also available on official STB media, which can bk obtained from Stata Corporation. The installation instructions are included with the STB media. ; -i, *'
In addition to the update command, users ma^ pull down Help and select Official Updates. The menu item does the same thing as the commandj but it does not provide the file redirection option intoQ, which managers of networked computers may wish to use so that they can download the updates and then copy the files to the official locations for themselves.
242
update — Update Stata
For examples of using update, see Windows from http://wwH-.siata.com Macintosh from http://www.stata.com
[GSW] 20 Using the Internet and [U] 32 Using the Internet to keep up to date [GSM] 20 Using the Internet and [U] 32 Using the Internet to keep up to date
Unix
from http:ffwww.stata.com
[GSU] 20 Using the Interact and [U] 32 Using the Internet to keep up to date
Notes for multi-user system administrators There are two types of updates that update downloads: ado-file updates and the binary executable update. Typically, there are only ado-file updates, but sometimes there are both, and even more occasionally, there is only a binary update. By default, update handles installation of the ado-file updates. There can be lots of small files associated with an ado-file update, update is very careful about how it does this. First, it downloads all the files you need to a temporary place, then it closes the connection to http://www.stata.com, then it checks them to make sure the files are complete, and only after all that does update copy them to the official UPDATES directory; see [P] sysdir. This is all designed so that, should anything go wrong at any step along the way, no damage is done. Updated binary executables, on the other hand, are just copied down, and then it is left to the user to (1) exit Stata; (2) rename the current executable; (3) rename the updated executable; (4) try Stata; and (5) erase the old executable, update displays simple, detailed instructions on how to do this. In order for update to work as it typically would, however, update must have write access to both the STATA and UPDATES directories. The names of these directories can be obtained by typing sysdir. As system administrator, you must decide whether you are going to fire up Stata with such permissions (Unix users could first become superuser) and trust Stata to do the right thing. That is what we recommend you do, but we provide the intoQ option for those who do not trust our recommendation. If you wish to perform the final copying by hand, obtain the new executable, if any, by typing . update executable, into(.) That will place the new executable in the current directory. You need no special permissions to do this. Later, you can copy the file into the appropriate place and give it the appropriate name. Type update without arguments; the default output will make it clear where the file goes and what its name must be. When you copy the file, be sure to make it executable by everybody. To obtain the ado-file updates, make a new, empty directory, and then place the updates into it. . mkdir inydir . update ado, into(mydir)
In this example, we chose to place the new, empty directory in the current directory under the name mydir. You need no special permissions to perform this step. Later, you can copy all the files in mydir to the official place. Type update without arguments; the default output will make it clear where the files go. When you copy the files, be sure to copy all of them and to make all of them readable by everybody.
update — Update Stata
Also See Related: Background:
[R] net, : [P] sysdir | [U] 32 Using the Internet to ktjep up to date, [GSM] 20 Using the Internet, [GSU] 20 Using the Internet, [GSW] 20 Using the internet
243
Tffle vce — Display covanance matrix of the estimators
Syntax vce [, corr rho j
Description vce displays the variance-covariance matrix of the estimators (VCE) after model estimation, vce may be used after any estimation command. To obtain a copy of the covanance matrix for manipulation, type matrix V = e(V). vce merely displays the matrix; it does not fetch it.
Options corr and rho are synonyms. They display the matrix as a correlation matrix rather than a covariance matrix.
Remarks 0 Example Using the automobile dataset, we run a regression of mpg on weight and displacement. . regress mpg weight displ Source
SS
Model Residual
1595.40969 848 . 049768
2 797.704846 71 11.9443629
Total
2443.45946
73 33.4720474
mpg
Coef .
Std. Err.
weight displacement _cons
-.0065671 . 0052808 40.08452
.0011662 . 0098696 2.02011
t -5.63 0.54 19.84
To display the covariance matrix, . vce weight displa-t weight displacement _cons
1.4e-06 -.00001 -.002075
74 Number of obs 66.79 F{ 2, 71) = 0 . 0000 Prob > F = 0.6529 R-squared = 0.6432 = 3.4561 Root MSE
MS
df
_cons
.000097 .011884 4.08085
244
P>|ti 0.000 0.594 0.000
[95X Conf . Interval] -.0088926 -.0143986 36.05654
-.0042417 . 0249602 44.11251
vce — Display covariance matrix of the estimators
To display the correlation matrix, . vce, eorr weight diapla-t weight displacement _coas
1.0000 -0.8949 -0.8806
1.0000 0.5960
_cons \ 1.0000
Methods and Formulas vce is implemented as an ado-file.
Also See
jit
Related:
[R] saved results
Background:
[U] 23 Estimation and post-estimation commands
245
246
view — View files and logs
Also See Complementary:
[R] type; [R] help, [R] net, [R] news, [R] search, [R] update
Background:
[GS] 3 Using the Viewer
TWe vwls — Variance-weighted least squares
vwls depvar [indepvars] [weight] [if exp]i [in range] [, sd(vamame') noconstant level (#) ] by ... : may be used with vwls; see FJR] by.
;
freights are allowed; see [U] 14.1.6 weight
;
vwls shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands.
Syntax for predict ff' t
* H
predict [type] newvarname [if exp] [in range] [, xb stdp ] '
'
i
These statistics are available both in and out of sample; type predict ... if e(sample) . . . if wanted only for the estimation sample.
Description vwls estimates a linear regression using variance-weighted least squares. It is different from ordinary least squares (OLS) regression in that homogeneity of variance is not assumed, but the conditional variance of depvar must be estimated prior to the regression. The estimated variance need not be constant across observations, vwls treats! the estimated variance as if it were the true variance when it computes the standard errors of the coefficients. An estimate of the conditional Standard deviation of depvar must be supplied to vwls using the sd(vamame) option, or you must have grouped data with the groups defined by the indepvars variables. In the latter case, all indepvars are treated as categorical variables; the mean and standard deviation of depvar are computed separately fdr each subgroup; and the regression of the subgroup means on indepvars is computed. regress with analytic weights can be used; to produce another kind of "variance-weighted least squares"; see the following remarks for an explanation of the difference.
Options sd(vamame) is an estimate of the conditional! standard deviation of depvar (that is, it can vary observation by observation). All values of v^rname must be > 0. If sd() is specified, f weights cannot be used. If sd() is not given, the data will be grouped by indepvars. In this case, indepvars are treated as categorical variables, and the means and istandard deviations of depvar for each subgroup are calculated and used for the regression. Any subgroup for which the standard deviation is zero is dropped. noconstant suppresses the constant term (intercept) in the regression. -A
I 2_
•
]
level (#) specifies the confidence level, in percejnt, for confidence intervals. The default is level (95) or as set by set level; see [U] £3.5 Specifying the width of confidence intervals. 249
250
vwls — Variance-weighted least squares
Options for predict xb, the default, calculates the linear prediction. stdp calculates the standard error of the linear prediction.
Remarks The vwls command is intended for use with two special—and very different—types of data. The first contains data that consists of measurements from physical science experiments in which (1) all error is due solely to measurement errors, and (2) the sizes of the measurement errors are known. Variance-weighted least squares linear regression can also be used for certain problems in categorical data analysis. It can be used when all the independent variables are categorical and the outcome variable is either continuous or a quantity that can sensibly be averaged. If each of the subgroups defined by the categorical variables contains a reasonable number of subjects, then the variance of the outcome variable can be estimated independently within each subgroup. For the purposes of estimation, each subgroup is treated as a single observation with the dependent variable being the subgroup mean of the outcome variable. The vwls command estimates the model Vi = *i/3 + ^ where the errors £; are independent normal random variables with the distribution £j ~ N(Q, Vi). The independent variables Xj are assumed to be known without error.
As described above, we assume that we already have estimates sf for the variances v±. The error variance is not estimated in the regression. The estimates sf are used for the computation of the standard errors of the coefficients; see Methods and Formulas below. In comparison, weighted ordinary least squares regression assumes that the errors have the distribution e, ~ N(Q,cr2/Wi), where the Wi are known weights and a2 is an unknown parameter that is estimated in the regression. This is the difference from variance-weighted least squares: in weighted QLS, the magnitude of the error variance is estimated in the regression using all the data.
> Example An artificial, but informative, example illustrates the difference between variance-weighted least squares and weighted OLS. An experimenter measures the quantities Xi and j/^, and estimates that the standard deviation of j/z is Sj. He enters the data into Stata: input x y s X
1. 2. 3. 4. 5. 6. 7. 8.
1 2 3 4 5 6 7 8
y
S
1.2 1.9 3.2 4.3 4.9 6.0 7.2 7.9
0.5 0.5 i 1 1 2 2 2
9. end
Since the experimenter wants observations with smaller variance to carry larger weight in the regression, he computes an OLS regression with analytic weights proportional to the inverse of the squared standard deviations:
Vwls — Variance-weighted least squares
I
. regress y s [aweight=s~(-2)3 (sum of wgt is 1.17508^01) Source ttf SS Model Residual
i
22. $310183 . 193355117
' Mfc
Number of obs F( 1, 6) Prob > F R-squared Adj R- squared Root MSB
1 22. 631^183 6 .03222^853 N
Total
22 . S243734
y
Coef.
X
.9824683 .1138554
_cons
7 3.2606^477
t
Std. Err. .0370739 .1120078
26.50 ;1.02
= = = = =
251
8 702.26 0.0000 0.9915 0.9901 . 17952
C95'/. Conf. Interval]
P>|tl 0.000 0.349
.8917517 -.1602179
1.073185 .3879288
If he computes a variance-weighted least-squares regression using vwls, he gets the same results for the coefficient estimates, but very different standard errors: . Twls y x, sd(s) Variance-weighted least-fequarefe regression ! Goodness-of-fit chi2(6) = 0.28 Prob > chi2 = 0.9996
y
Coef.
X
.9824688 .1138554
_cons
Sta.
Err.
470409 .51484
•
Number of obs Model chi2(l) Prob > chi2 z
5.77 J O . 22
P>!zl 0.000 0.825
33.24 0.0000
[957, Conf. Interval] . 6484728 -.8952124
1.316464 1 . 122923
Despite the fact that the values of yi were nicely linear with Xj, the vwls regression used the experimenter's large estimates for the standard cjeviations to compute large standard errors for the coefficients. For weighted OLS regression, howejer, the scale of the analytic weights has no effect on the standard errors of the coefficients—only ttte relative proportions of the analytic weights affect the regression. If the experimenter is sure of the sizes of his ejrror estimates for y,-. then the use of vwls is valid. However, if he can only estimate the relative proportions of error among the y^ then vwls is not appropriate.
Example Let us now consider an example of the use otl vwls with categorical data. Suppose that we have blood pressure data for n — 400 subjects, categorized by gender and race (black or white). Here is a description of the data; i
(Continued ifm next page)
252
vwls—-Variance-weighted least squares table gender race, s(mean bp sd bp freq) row col format('/,8.1f) Race White Black Total
Gender Female
117.1 118.5 117.8 10.3 11.6 10.9 200 100 100
Male
122.1 125.8 124.0 15.5 13.3 10.6 200 100 100
Total
119.6 122.2 120.9 10.7 14.1 12.6 200 400 200
Performing a variance-weighted regression using vwls gives . vwls bp gender race Variance-weighted least-squares regression Goodness-of-fit cM2(l) = 0.88 Prob > chi2 = 0.3486 bp
Coef.
gender race _cons
5.876522 2.372818 116.6486
Std. Err.
1 . 170241 1.191683 .9296297
Number of obs Model chi2(2) Prob > chi2 z
5.02 1.99 125.48
400 = 27.11 = 0 . 0000
P>izi
[957. Conf . Interval]
0.000 0.046 0.000
3.582892 .0371631 114.8266
8.170151 4.708473 118.4707
By comparison, an OLS regression gives the following result: . regress bp gender race Source SS
df
Model Residual
4485.66639 58442 . 7305
2 2242.83319 397 147.210908
Total
62928,3969
399
bp
Coef.
gender race _cons
6.1775 2.5875 116.4862
400 Number of obs -ic OA FC 2, 397) __ = 0.0000 Prob > F = 0.0713 R-squared O AftCC Adj R squared = 12.133 Root MSB
MS
157 .71528
Std. Err.
1.213305 1.213305 1.050753
t "
P>|t|
[957. Conf. Interval]
5.09 2.13 110.86
0.000 0.034 0.000
3.792194 .2021939 114.4205
8.562806 4.972806 118.552
Note the larger value for the race coefficient (and smaller p-value) in the OLS regression. The assumption of homogeneity of variance in OLS means that the mean for black men is allowed to pull the regression line higher than in the vwls regression, which takes into account the larger variance for black men and reduces its effect on the regression
Vwls — Variance-weighted (east squares
253
Saved Results vwls saves in e(): Scalars number ofj observations • CM) model decrees of freedom e(dfja) e(chi2) model x2 i e(df^gf) goodness-4f-fit degrees of freedom e(chl2_gf) goodness4f-fit x2 Macros e(cmd) e(depvar)
wls i name of dependent variable
Matrices e(b) e(V)
coefficient' vector variance-cijovariance matrix of the estimators
Functions
e(sattple)
marks estimation sample
Methods and Formulas vwls is implemented as an ado-file, Let y = (t/i, t/2; • • • i J/n)' be the vector of observations of the dependent variable, where n is the number of observations. For the case when sdb is specified, let sj, 53, . . . , sn be the standard deviations supplied by sd(). Por categorical da|a, when sd() is not given, the means and standard deviations of y for each subgroup are computed^, and n becomes the number of subgroups, y is the vector of subgroup means, and Si are the standijrd deviations for the subgroups. Let V = diag(sj,$2i • • • i s n) denote the Estimate of the variance of y. Then the estimated regression coefficients are (X'V-fX)- 1 X'V 1 y and their estimated covariance matrix is
A statistic for the goodness of fit of the model ijs
where Q has a x2 distribution with n - k decrees of freedom (k is the number of independent variables plus the constant, if any).
References Grizzle, J. E., C. F. Starmer, and G. G. Koch. 1969. Aaialysis of categorical data by linear models. Biometrics 25: 489-504. : Press, W. H., S. A. Teukolsky, W. T. Vewerling, and ^. P. Flannery. 1992. Numerical Recipes in C: The Art of Scientific Computing, 2d ed. Cambridge: Cambridge University Press.
254
vwls — Variance-weighted least squares
Also See Complementary:
[R] adjust, [R] tincom, [R] linktest, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce
Related:
[R] regress
Background:
[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands
Title weibUll — Estimate Weibull and other pararnjetric survival-time models
Syrttax
j
| waibull[het] | eregjhet] } depvar [ijar/wf] [werg/tf] [if exp] [in range] [, hjizard hr tr dea.d(varname) tO(wjfraame) frailty (gamma
invgaussian)
ancillary (var/wf) strata(vflrname) robust cluster(varname) score(newvar(s')) ~~~ ' ~| ~~ ~~" constraints (numlist) level(#) nocoeif noheader nolog maximize-options ] | Ittormal [het] j llogistifi [het] j depvar [vor/w/j [vmg/tf] [if exp] [in range] [,' %r dead(vGr«flm«) tO(vamame) frailty (gamma | i.nvgaussian) ancillary (varlist} strata(vamame) robust cluster (varname} score (newvar(s')) nocionstant constraints (nwilisf) lev^l(#) nocoef noheader nolog m^imize-options j goopejrtz[het] depvar [var/ist] [we/'gAf] [if e;cp] [in rang^j [, hr dead(vamara^) td^vcmame) frailty (gamma j i.nvgau^sian) ancillary (varlist) strata (varname) ~
:
• •
"i
rjobiist cluster (varname) score (newvfyr(s')) noconstant constraints (numlisf) \
•
;!
leiv!el(#) nocoef noheader nolog maximize ^options ] gamma [het] depvar [varlist]
[weight] [if 6xp] [in range] [, tr dead(vanwme)
tOdvarname) frailty(gamiia j invgau^sian) ancillary(varlist) anc2(vflrfof) i ~ 1 striata(va77iame) robust £].uster(vam|flme) s^ore (newvar(s)) noconstant :
j
coftjstraints (numlist} level (#) nocoe| noheader nolog maximize-Options ] |
;j
;
']
b y ^ . . . : my be used with these conunajnds; see [R] by, • freights, isreights, and pweig^its are allowed; ereg a|so allows aweights; see [U] 14.1.6 we^fbt. THfese coattllands share the features of all estimation commajjids; see [U] 23 Estimation and post-estimation commands. These conji^ands may be used with sv to perform stepwi$e estimation; see [R] sw, ij
ferjpredict
: j predijct [type] newvatname [if exp] [inl range] [, statistic ] ;1
where statistic is
I
256
weibull — Estimate Wei bull and other parametric survival-time models
median time
median Intime mean time mean Intime hazard hr xb stdp surv csnell mgale
predicted median survival time (the default) predicted median m(survival time) predicted mean survival time predicted mean ln(survival time) predicted hazard predicted hazard ratio linear prediction x^b standard error of the linear prediction; SE(xjb) predicted S(depvar) or $(depvar\ to) (partial) Cox-Snell residuals (partial) martingale-like residuals
These statistics are available both in and out of sample; type predict ... if e(sample) ... if wanted only for the estimation sample. When no option is specified, the predicted median survival time is calculated for all models. The predicted hazard ratio option hr is only available for the exponential, Weibull, and Gompertz models. The mean time and mean Intime options are not available for the Gompertz model, and the mean tine option is not available for the generalized log-gamma model. The mean time and mean Intime options are not available if f xa.il.ty(distname) is specified.
Description weibull estimates maximum-likelihood Weibull distribution (survival time) models, er eg estimates maximum-likelihood exponential distribution (survival time) models. Inormal estimates maximumlikelihood lognormal distribution (survival time) models, llogistic estimates maximum-likelihood log-logistic distribution (survival time) models, gamma estimates maximum-likelihood generalized log-gamma distribution (survival time) models, gompertz estimates maximum-likelihood Gompertz distribution (survival time) models. See [R] st streg for a detailed discussion of these parametric survival models. Using the net variant of a parametric estimation command will allow the user to fit a model with unexplained heterogeneity, or frailty, i.e., if the data display overdispersion. For example, using the weibullhet command will fit a Weibull distribution with overdispersion which acts multiplicatively on the individual hazard function. The functional form of the overdispersion is specified as either that from a gamma distribution (frailty(gamma)) or that from an inverse-Gaussian distribution (frailty(invgaussian)). See [R] st streg for more information on frailty models. In all cases, the dependent variable depvar represents the time of failure or censoring and varlist represents the independent variables. These commands allow estimation with fixed or time-varying covariates, allow for left truncation (delayed entry) and gaps, and may be used with single- or multiple-failure data. We advise use of streg over these commands but only because we think using the st commands is easier; the choice is yours. If you take our advice, see [R] st streg and skip reading this entry altogether. streg produces the same results and this is assured because the streg command calls the commands in this entry to perform the estimation. Also see [R] st stcox (or [R] cox) for estimation of proportional hazards models.
j
weibull — Estimate Weibull and other parametric survival-time models
257
Options hazard Specifies that the model be estimated according to the log hazard rate parameterization rather than the default log time (accelerated failure tifne) parameterization is to be displayed. hr reports the estimated coefficients transformed to hazard ratios, i.e., eb rather than b, and im' plies hazard if issued at estimation time. Standard errors and confidence intervals are similarly • transformed. Hazard estimates may be redisplayed in either form. ! •• 1 tr reports the estimated coefficients transformed jto time ratios, i.e., eb rather than 6, and may not be combined with bazatd. Standard errors arid confidence intervals are similarly transformed. rw • Tlme-to-failure estimates may be redisplayed ii either form. I ; dead(vafname) specifies the name of a variable ^ recording 0 if the observation is censored and a • value 'other than 0—typically 1—if the observation represents a failure. tO(vara4me) specifies the variable that indicates jvhen the observation became at risk. tO() can be ' used tb handle left truncation, gaps, time-varyifig covariates, and recurring failures. • In thelfollowing data, each subject has only one: record, but the third subject was observed starting \ at timfe 5, not 0: :
id 55 56 57 58
to 0 0 5 0
t 12 30 22 16
xl 3 2 1 2
d' 0 -I 1
;
1 i
0 •
x2 0 1 0 0
The interpretation of the data is that subject 53 had xl = 3 and x2 = 0 over the interval (0,12] and trjen, at time 12, was lost due to censoriig; subject 56 had xl = 2 and x2 = 1 over the interval (0,30] and then, at time 30, failed; supject 57 had xl = 1 and x2 = 0 over the interval (5,221 and then, at time 22, failed. \ One could estimate a Weibull regression on thise data by typing . wjeibull t xl x2, d«ad(d) tO(tO)
;
win: I data, eovariate scl varies over time: d id to t 91 91 91 92 92 92
0 15 22 0 11 52
15 22 31 11 52 120
0 0
i
xl 2 1 3
0 0
3
1
2
x2 1 1 1 0 0 0
The interpretation here is that subject 91 had 3J1 = 2 over the interval (0,15], xl = 1 over the interval (15.22], and xl = 3 over the interval {22,31 ]; the value of x2 never varied from 1; and at tim$ 31 a failure was observed. |S While one could estimate a Weibull regression jby typing the same thing as before . wieibull t xl x2, daad(d) tO(tO)
\
we would strongly recommend . weibull t xl x2, daad(d) tO(tO) robust iluster(id)
That is because the observations are no longer independent. This issue is discussed under the clustierO option below. Note the missing value of xl in subject 92's second record. That causes no difficulty.
258
weibull — Estimate Weibull and other parametric survival-time models
In the following data, some subjects fail more than once (and have time-varying regressors): id
to
2 3 23 23 24 24 24
0 12 18 0 8 22
t
d 1 0 1 1 1 1
1 2 18 22 8 22 31
xi 2 1 3 3 1 2
x2 1 1 1 0 0 0
Subject 23 has x2 — 1 at all times. Between (0,12], xl = 2 and a failure is observed at time 12. Between (12.18], xl = 1 and no failure is observed. Between (18,22], xl = 3 and a failure is observed at time 22. Again, the estimation command is the same . weibull t xl x2, dead(d) tO(tO) robust cluster(id)
and note that again, since subjects appear more than once in the data, we also specified options robust and cluster(id). frailty (gamma invgaussian) specifies the assumed distribution of the observation level frailty or heterogeneity. The estimated model will, in addition to the standard parameter estimates, produce an estimate of the variance of the frailties and a likelihood-ratio test of the null hypothesis that this variance is zero. When this null hypothesis is true, the model reduces to the model with f T&ilty(distname) not specified. This option is only valid if the het variant of the estimation command is used, and in that case, the specification of frailty () is required. ancillary (varlist) specifies that the ancillary parameter for the Weibull, lognormal, Gompertz, and log-logistic distributions and the first ancillary parameter (sigma) of the generalized log-gamma distribution are to be estimated as a linear combination of varlist. This option is not available if the het variant of the estimation command is used. anc2(var/m) specifies that the second ancillary parameter (kappa) for the generalized log-gamma distribution is to be estimated as a linear combination of varlist. This option is not available if the het variant of the estimation command is used. strata (vamame) specifies a stratification variable. Observations with equal values of the variable are assumed to be in the same stratum. Stratified estimates (equal coefficients across strata but intercepts and ancillary parameters unique to each stratum) are then estimated. This option is not available if the het variant of the estimation command is used. robust specifies that the robust method of calculating the variance-covariance matrix is to be used instead of the conventional inverse-matrix-of-second-derivatives method. cluster (vamame) implies robust and specifies a variable on which clustering is to be based. By default, each observation in the data is assumed to represent a cluster. Consider the following data: to
t
d
xi
x2
0 IS 22
15 22 31
0 0 1
2 1 3
1 1 1
Does this represent three subjects or just one? Perhaps three subjects were observed: one over (0,15], another from (15,22], and a third from (22,31]. In that case, the three observations are presumably independent and the conventional variance calculation is appropriate, as is the robust calculation. If you wanted the robust estimate of variance, you would specify robust but not cluster(). On the other hand, if the data are
weibull — Estimate Weibull and (t>ther parametric survival-time models i
id
;1
si 91 91
tO 0
t 15
d 0
15 22
22 31
0 1
il 2 1 3
259
x2 1 1 1
that is.| if the data represent the same subject, then these records do not amount to independent observations, The conventional variance calculation is inappropriate. To obtain the robust standard errors, ^ou would specify robust and cluster (Id), although you could omit the robust because clust^rQ implies rdbujt. : score (n4war(s)) requests that newvar(s) be created containing the score function(s). One new variabli is specified in the case of ereg, two ^re specified in the case of weibull, Inormal, llogi^tic and gompertz, and three are specjfied in the case of gamma. If a frailty model is estimated (the net variant is used), then one additional variable is specified. The firt new variable will contain d(ln :
The second and third new variables, if they exist^ will contain <9(ln Lj) with respect to the second and thihl ancillary parameters. See Table 1 in [t] st streg for a list of ancillary parameters.
!
noconstajnt suppresses the constant term (intercept) in the model. I -I constraints Oi«m//£7) specifies by number the lipear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint commahd; see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts.
•
i
i • * ; * ' I
level (#)! specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as sfct by set level; see [U] 23.5 Specifying the width of confidence intervals.
'••' ' •?r •
nocoef is; for use by programmers. It prevents thd displaying of results but still allows the display of the Iteration log.
; ; '
noheader is for use by programmers. It causes trje display of the coefficient table only; the table above jhe coefficients reporting chi-squared tests and the like is suppressed. The code for streg, for instance, uses this option, since streg wantt to substitute its own (more informative) header.
;
nolog suppresses the iteration log.
;
maximize\oplions control the maximization procesls; see [R] maximize. You should never have to specifyjtnem-
{Options fojr predict
' '-•
median time calculates the predicted median survival time in analysis time units. Note that this is the prediction from time 0 conditional on constant covariates. When no options are specified with predict, the predicted median survival time is ^calculated for all models. \
ij
median Ihtime calculates the natural logarithm o| what median time produces. timje calculates the predicted mean survival 'time in analysis time units. Note that this is the prediction from time 0 conditional on constant! covariates. This option is not available for the Gompdnz and gamma regressions, and is not available when fx&ilty(distname) is used. i Intime calculates the mean of the natural logarithm of time. This option is not available for Gompertz regression, and is not available when;frailty(ifo?m3me) is used. calculates the predicted hazard. '. :\
calculates the hazard ratio. This option is valid only for models having a proportional hazard parameterization, i.e., Weibull. exponential, andlGompertz.
260
weibull — Estimate Weibull and other parametric survival-time models
xb calculates the linear prediction from the estimated model. That is, all models can be thought of as estimating a set of parameters 61, & 2 > • • • , bk, and the linear prediction is yj = b\x\j + &2#2j + h bkXkj, often written in matrix notation as yj — x;b. It is important to understand that xjy, x^j, ..., %kj used in the calculation are obtained from the data currently in memory and do not have to correspond to the data on the independent variables used in estimating the model (obtaining the 61, &2> • • • » &fc)stdp calculates the standard error of the prediction; that is, the standard error of yj. surv calculates each observation's predicted survivor probability 5(t|t0). If you did not specify tO() when you estimated the model, tO=0 and thus surv calculates the predicted survivor function at the time of failure or censoring, S(t}. Otherwise, it is the probability of surviving through t given survival through t0. In such cases, you may wish to also see help for streg. csnell calculates the (partial) Cox-Snell residual. If you have single observations per subject, then csnell calculates the usual Cox-Snell residual. Otherwise, csnell calculates the additive contribution of this observation to the subject's overall Cox-Snell residual. In such cases, you may wish to also see help for streg. mgale calculates the (partial) martingale-like residual. The issues are the same as with csnell above.
Remarks See [R] st streg for a discussion appropriate to these commands. streg uses the commands in this entry to perform the estimation but has an easier-to-use syntax. In reading the streg entry, it is just a matter of translating from one syntax to the other. In [R] st streg, if you see an example such as . streg drug age, dist(weibull)
the equivalent weibull command might be . weibull nmevar drug age
or . weibull (j'mevar drug age, dead(fafJvar)
or . weibull ti'mevar drug age, dead(fatfvar) tO(fOvar)
depending on the context. streg fills in the identities of the time-of-censoring or failure variable (timevar), the outcome variable (failvar), and the entry-time variable (tOvar) for you. Users of streg first stset their data and that is how the commands know what to fill in. Another difference between these commands and streg concerns an implied cluster () option when you specify robust. With the commands in this entry it is your responsibility to specify clustering if you want it. With streg, specifying robust implies cluster () if a subject-id variable has been set. The final difference concerns the default metric for the Weibull and exponential models, weibull and ereg default to the accelerated time (log expected time) metric but will, if hazard is specified, use the log relative hazard metric. With streg, it is the other way around. They default to the log relative hazard metric but will, if another option is specified—time—use the log expected time metric.
welbull — Estimate WelbuH and other parametric survival-time models
261
$aved These commands save in Scalars e(N)
•(11) e(ll_OJ! •(re) | e(thfct^) e(cM2J-c) Micros ! e(cmd)i e(dead| e(depv^r) e(titl^) e(utyp^) e(fr_tjltle) Matrices
number of observations number of parameters number of equations number of dependent Variables model degrees of freedom log likelihood log likelihood, constant-only model return code frailty parameter X2, comparison model
e(ll_c) : e(p_c) e(rank)
significance number of iterations ancillary parameter (gamma, Internal) ancillary parameter (gainma) ancillary parameter (welbull) ancillary parameter (llogistic, gompertz) log likelihood, comparison model significance, comparison model rank of e(V)
name of command variable indicating failure name of dependent variable title in estimation output weight type weight expression title in output identifying frailty
; e(tO) e(frm2) e (user) e(opt) e(chi2type) e (predict) e(cnslist)
name of variable marking entry time hazard or time name of likelihood-evaluator program type of optimization LR; type of model x 2 test program used to implement predict constraint numbers
coefficient vector iteration log (up to 20 iterations)
; e(V)
variance-covariance matrix of the estimators
e (p) e(ic) e(sigma) e (kappa) e(aux_p) e(ganma)
:
e(b) ! Functions s)
marks estimation sample
Methods 4nd Formulas See
st streg.
Merencejs \
See [R]st streg.
See * -4#f-
Complemtotary:
[R] adjust, [R] constraint, [R lincom, [R] linktest, [R] Irtest, [R] liable, [R] mfx, [R] predict, [R] sw, R] test, [R] testnl, [R] vce
Related:
[R] st streg; [R] st stcox; [R] jcox, [R] glm, [P] -robust ;
Background:
[U] 16.5 Accessii^ coefflcien|s and standard errors, [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust \priance estimates, [U] 23.12 Obtaining scores j
'
Title which — Display location and version for an ado-file
Syntax which ado ^command
Description which searches the S-ADO path for command-name. ado and, if found, displays the full path and filename together with all lines in the file that begin with "*!" in the first column. If command-name.ado is not found, which checks to see whether the command is internal. If it is, the message "built-in command" is displayed; if not, the message "command not found as either built-in or ado-file" is displayed and the return code is set to 111. For information on internal version control, see [P] version.
Remarks If you write programs, you know that you make changes to the programs over time. If you are like us, you also end up with multiple versions of the program stored on your disk, perhaps in different directories. It is even possible that you have given copies of your programs to other Stata users. This leads to the problem of knowing which version of a program you or your friends are using. The which command helps you solve this problem.
> Example Lines that start with an '*' are comments—Stata ignores them—so Stata also ignores lines that begin with '*!'. Such lines, however, are of special interest to which. The notes command, described in [R] notes, is an ado-file written by StataCorp. Here is what happens when we type which notes: . which notes c:\stata\ado\base\n\notes.ado *! version 1.0.2 13decl998
which informs us that notes, if executed, would be obtained from c:\stata\ado\base\n. There is now no question as to which notes, if we had more than one, Stata would choose to execute. The second line is from the notes.ado file: when we revised notes, we included a line that read '*! version 1.0,2 13decl998'. This is how we, at StataCorp, do version control—see [U] 21.11.1 Version for an explanation of our version control numbers. You, however, can be less formal. Anything typed after lines that begin with '*!' will be displayed by which. For instance, you might write myprog. ado: . «Mch myprog . \tnyprog. ado *! first written 1/03/2000 *! bug fix on 1/05/2000 (no variance case)
*! updated 1/24/2000 to include noconstant option *! still suspicious if variable takes on only two values
262
Title whteatb — Bartlett's periodiogram-based test Ifor white noise
Syntax wntestb vamame [if exp] [in range] [, table level (#) graph^options vntwtb is for use with time-series dm; see [R] tsset. Ycju must tsset your data before using wntestb. In addition, the time series must be dense (ndnmissing and no gafs in the time variable) in the specified sample. varname may contain time-series operators; see [U] 14.4.$ rime-series varlists.
Description wntestb performs Bartlett's periodogram-bafced test for white noise. The result is presented graphically by default, but may be optionally presented as text (table output).
Options table specifies that the test results should be printed as a table instead of as the default graph. level (#) specifies the confidence level, in percent, for the confidence bands included on the graph. l he default is level (95) or as set by set level; see [u] 23.5 Specifying the width of confidence intervals. graph^ptions are any of fte options allowed with graph, twoway; see [G] graph options.
Remarics Bartlett's test is a- test of the null hypothesis that the data come from a white noise process of uncorrelated random variables having a constant mean and a constant variance. For a discussion of this test, see Bartlett (195f, 92-94), Newton (1988, 172), or Newton (1996).
l> Example In ihis example, we generate two time series !^nd show the graphical and statistical tests that can DC 'Obtained from this command. The first time ijeries is a white noise process and the second is a wnue noise process with an embedded determiniitic cosine curve. • • • •
drop _all set seed 12393 set obs 100 gen xl = invnorm(uniformO)
I
+ cos(2*_pi*(_n-l)/10) time = _n tsset time 264
f
which — Display location and version for an ado-file
263
It does not matter where in the program the lines beginning with *! are—they will be listed (in particular, our "still suspicious" comment was buried about fifty lines down in the code). All that is important is that the *! marker appear in the first two columns of a line.
£. > Example &
a I § ; ^ I
K
If *
If we type which command, where command is a built-in command rather than an ado-file, Stata responds with . which regress built-in command:
regress
If command was not either built-in or an ado-file, Stata would respond with . which junk command junk not found as either built-in or ado-file r(lll);
Also See Related:
[P] version
Background:
[U] 20 Ado-files, [U] 21.11.1 Version
wntestb — Bartlett's perlodogram-btsed test for white noise
265
We can then submit the white noise data to the wntestb command by typing . wntestb xl Cumulative Periodogram While Noise Test Bartlett's (B) statistic = 0.73 Prob > B = 0.6591 1.00
0.80 -
* S>
0.60 -
a>
0.40-
E O
0.20
0.00
0.20
0.50
0.30 Frequency
We can see in the graph that the values never appear outside of the confidence bands. We also note that the test statistic has a p-value of .66, so we would conclude that the process is not different from white noise. If we had only wanted the statistic without the plot, we could have used the table option. Turning our attention to the other series (x2), we type . wntestb x2 Cumulative Periodogram White Noise Test Bartlett's (B) statistic = 2.34 Prob > B = 0.0000
o.eo -
o.6o-
o
0.40 -
O
0,20-
o.oo0.00
0.10
0.20
0.30
0.40
0.50
Frequency
Here the process does appear outside of the bands. In fact, it steps out of the bands at a frequency of .1 (exactly as we synthesized this process). We also have confirmation from the test statistic, at a p-value less than .0001, that the process is significantly different from white noise.
266
wntestb — Bartlett's perlodogram-based test for white noise
Saved Results wntestb saves in r(): Scalars r(stat) Bartiett's statistic
r(p) Probability value
Methods and Formulas wntestb is implemented as an ado-file. If x(l),.. .,x(T) is a realization from a white noise process with variance a2, the spectral distribution would be given by F(u>] = a2Lj for u €. [0,1 ] and we would expect the cumulative periodogram (see [R] cumsp) of the data to be close to the points 5^ = k/g for q — [n/2] -f 1, k — !,,..,§. [n/2] is the maximum integer less than or equal to n/2, Except for u — 0 and w = .5, the random variables 2/(wfc)/cr 2 are asymptotically independent and identically distributed as Xz- Since x% is the same as twice a random variable distributed exponentially with mean 1, the cumulative periodogram has approximately the same distribution as the ordered values from a uniform (on the unit interval) distribution. Feller (1948) shows that this results in (where Uk is the ordered uniform quantile)
< a J = ^ (_i)Je-*o~ = G(a)
lira Pr I max
\i
'
j=-c»
The Bartlett statistic is computed as
n B — max A / — Pjt —9
where Ffe is the cumulative periodogram defined in terms of the sample spectral density / (see [R] pergram) as "k
.
./T
^,~
>
The associated p-value for the Bartlett statistic and the confidence bands on the graph are computed as 1 — G(B) using Feller's result.
Acknowledgment wntestb is based on the wntestf command by H. Joseph Newton (1996), Department of Statistics, Texas A&M University.
References Bartlett, M. S. 1955. An Introduction to Stochastic Processes with Special Reference to Methods and Applications. Cambridge: Cambridge University Press. Feller, W. 1948. On the Kolmogorov-Smunov theorems for empirical distributions. Annals of Mathematical Statistics 19: 177-189.
tontestb — BaiHett's periodogram-based test for white noise 267 Newton. H. J. 1988. TIMESLA&: A Time Series Laboratory! Pacific Grove, CA: Wadsworth & Brooks/Cole. I i . 19%. sts!2: A periodograh-based test for white noised Stata Technical Bulletin 34: 36-39. Reprinted in Staia Technical Bulletin Reprints, fol. 6, pp. 203-207.
Also See Complementary:
[R]
Related:
[R] c^rrgram, [R] cumsp, [R]|pergram, [R] wntestq
Background:
Stata (Graphics Manual
Title
gss"
wntestq — Portmanteau (Q) test for white noise
Syntax wntestq varname [if exp] [in range] [, lags(#) ] wntestq is for use with time-series data; see [R] tsset. You must tsset your data before using wntestq. In addition, the time series must be dense (nonmissing and no gaps in the time variable) in the specified sample. varname may contain time-series operators; see [U] 14.43 Time-series varlists.
Description wntestq performs the portmanteau (or Q) test for white noise.
Options lags(#) specifies the number of autocorrelations to calculate. The default is to use min([n/2j -2,40) where [n/2] is the greatest integer less than or equal to n/2.
Remarks Box and Pierce (1970) developed a portmanteau test of white noise that was refined by Ljung and Box (1978). Also see Diggle (1990, section 2.5).
t> Example In the example shown in [R] wntestb, we generated two time series. One (xl) was a white noise process and the other (x2) was a white noise process with an embedded cosine curve. Here we compare the output of the two tests. . wntestb xl, table Cumulative periodogram white noise test Bartletfs (B) statistic = Prob > B =
0.7311 0,6591
. wntestq xl Portmanteau test for white noise Portmanteau (Q) statistic * Prob > chi2(40) =
27.2378 0.9380
. wntestb x2, table Cumulative periodogram white noise test Bartlett's (B) statistic = Prob > B =
2.3364 0.0000
. wntestq x2 Portmanteau test for white noise Portmanteau (Q) statistic = Prob > chi2(40) =
182.2446 0.0000
268
wntestq — Portmanteau (Q) test for white notee
269
This example shows that both tests agree. For the first process, the Bartlett and portmanteau result in nonsignificant test statistics: a p-value of 0.659J for wntestb and one of 0.9380 for wntestq. I
j
For the second process, each of the tests has a Significant result to less than 0.0001.
ived Results wntestq saves in r(); Scales | ir(istat) Q statistic ' degrees of freedom
r(p)
Probability value
[Methods and Formulas wntestq is implemented as an ado-file. The portmanteau test Mies on the fact that if process, then
. . . . , x(n) is a realization from a white noise
where rn is the number pf autocorrelations calculated (equal to the number of lags specified) and —>• indicates convergence in distribution to a x2! distribution with m degrees of freedom, pj is the estimated autocorrelation!for lag j; see [R] corrgfram for details.
References Box, G. E. P. and D. A. Piferce. 1970. Distribution of rtsidual autocorrelations in autoregressive-integrated moving average time series modejs. Journal of the American Statistics! Association 65: 1509-1526. i
|
Diggk, P. J. 1990. Time Series: A Biostatistical Introducijion. Oxford: Oxford University Press. Ljung, G. M. and G. E. P. EJox. 1978. On a measure of Jack of fit in time series models. Biometrika 65: 297-303.
AlsoSee Complementary:
[RJ tsset
Related:
[RJ corrgram, [R] cumsp, [R] wntestb
Title xcorr — Cross-correlogram for bivariate time series
Syntax xcorr varnamei varname^ [if expl [in rangej [, generate(newvarname') lags(#) needle table noplot graph^options ] xcorr is for use with time-series data; see [R] tsset. You must tsset your data before using xcorr. vamamei
an
^ varnamei
ma
y contain time-series operators; see [U] 14.4.3 Time-series varlists.
Description xcorr plots the sample cross-correlation function.
Options generate (newvarname) specifies a new variable to contain the cross-correlation values. lags(#) indicates the number of lags and leads to include in the graph. The default is to use min([n/2]-2,20). needle specifies that the autocorrelations should be depicted with vertical lines from zero instead of a connected line between successive estimates. table requests that the results be presented as a table rather than the default graph. noplot requests that the tabular output not include the character-based plot of the cross-correlations. graph-options are any of the options allowed with graph, twoway; see [G] graph options.
Remarks > Example We have a bivariate time series (Series J, Box, Jenkins, and Reinsel 1994) on the input and output of a gas furnace where 296 paired observations on the input gas rate and output percent COa were recorded every 9 seconds. The cross-correlation function is given by
(Continued on next page,)
270
xcorr — Croas-correlogram for bivariate time series
271
xcorr input output, xline(5) lags(40) foo-
-i.oo
f 75
-0.75
I
1i Ii»
-0.50
!
-0.25
c
"Us.
t
-0.00
-<j.2&
- -0.25
-4 50 -
--0.50
4.75-
-ioo:
--1.00 -40
-20
;
0
20
40
Cross-correlogram
Note that we included a vertical line at lag 5 as thejre is a well-defined peak at this value. This peak indicates that the output lags the input by 5 time periods. Further, the fact that the correlations are negative indicates that as iinput (coded gas rate) is increased, output (%C02) decreases. j We may obtain the table of autocorrelations and the character-based plot of the cross-correlations (analogous to the univariate time-series command ijorrgram) by specifying the table option. i
. xcorr input output, table lags(20)
0
LAG -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5
-4 -3 -2 -1 0 1 2 3
CORK -0 . 1033 -0.1027 -0,0998 -0,0932 -0.0832 -0,0727 -0.0660 -0 . 0662 -0.0751 -0.0927 -0.1180 -0.1484 -0.1793 -0.2059 -0.2266 -0.2429 -0.2604 -0.2865 -0.3287 -0 . 3936 -0.4845 -0.5985 -0.7251 -0 . 8429
A
— \J fl . !7/*tD QOAfi
tjC R \J
—A\f . o^n^ iJOuO
7 8 9
-0.8294 -0.7166 -0.5998
T
O .QEJlTto 1 Afi
1
"ii[C ross-coirrelation]
|
-i — ——i — — 1
;
272
xcorr — Cross-correlogram for bivariate time series 10
11 12 13 14 15 16 17 18 19 20
-0.4952 -0.4107 -0.3479 -0.3049 -0.2779 -0.2632 -0.2548 -0.2463 -0.2332 -0.2135 -0.1869
— — — — — -
Once again, the well-defined peak: is apparent in the plot.
Methods and Formulas The cross-covariance function of lag k for time series x\ and x2 is given by
Note that this function is not symmetric about lag zero, that is
R12(-k) Define the cross-correlation function as
where pu and p22 are the autocorrelation functions for x\ and x-i respectively. The sequence is the cross-correlation function and is drawn for lags k € (—Q,—Q+1. . . . . -1,0, 1, . . . ,Q — 1, Note that if pn(k] = 0 for all lags, then we say that xi and xz are not cross-correlated.
References Box, G. E. P.. G. M. Jenkins, and G. C. Reinsel. 1994. Time Series Analysis: Forecasting and Control. 3d ed. Englewood Cliffs, NJ: Prentice-Hall. Hamilton, J. 1994. Time Series Analysis. Princeton: Princeton University Press. Newton, H. J. 1988. TIMESLAB: A Time Series Laboratory. Pacific Grove, CA: Wadsworth & Brooks/Cole.
Also See Complementary:
[R] tsset
Related:
[R] corrgram, [R] pergram
r
Title xi — Interaction expansion
Syntax xi [, prefix (string') ] term(s) xi [, prefix (string) ] :
any-stata-com^nand varlist^with-ierms
where a term is of the form
:
i.v&mame Or I. varname I. varnamei*!. varname^ i . wmamei*i . varname^ I I. varname i *varname$ I. varnamei I varname^ i . varnamei I varname^ varname, varnamei, andjvflrnam^ denote numeric or string categorical variables. vamame$ denotes a continuous, numeric variable. :
Description
\i
•(
xi expands terms containing categorical variables into indicator (also called dummy) variable sets by creating new variables and, in the second syntak (xi: any-stata-jcommand), executes the specified command with the expanded terms. The dummy variables created are i. varname
'.
Creates duihmies for categorical variable varname
i. varnamei*^.varname^
\ j i.varnamei*varwme3 \ \ i.varnameilvbrnames I \
Options
| t
Creates duijnmies for categorical variables varnamei
and vamat^e^' all interactions and main effects Creates dutonies for categorical variable varnamei and continuous variable varname^. all interactions and main effects Creates duipmies for categorical variable varnamei and continuous variable varname^: all interactions and main effect of varname^ but no maiii effect of varnamei
i
prefix (string) allows users to choose a prefix other than _I for the newly created interaction variables. The length ofUhe prefix cannot be longer than 4 characters. By default, xi will create interaction variables starting with _I. When ybu use xi, xi first drops all previously created interaction variables starting with the prefix specified in the prefix (string) option, or with _I by default. Therefore, if you want to keep the variables starting with a certain prefix, specify a different prefix in the prefix (string) option. j i 'i {
2J73
274
xi — interaction expansion
Remarks Remarks are presented under the headings Background Indicator variables for simple effects Controlling the omitted dummy Categorical variable interactions Interactions with continuous variables Using xi: Interpreting output How xi names variables xj as a command rather than a command prefix Warnings
xi provides a convenient way to include dummy or indicator variables when estimating a model (say with regress, logistic, etc.). For instance, assume the categorical variable agegrp contains 1 for ages 20-24, 2 for ages 25-39, 3 for ages 40-44, etc. Typing . xi: logistic outcome weight i.agegrp bp
estimates a logistic regression of outcome on weight, dummies for each agegrp category, and bp. That is, xi searches out and expands terms starting with "i." but ignores the other variables, xi will expand both numeric and string categorical variables, so if you had a string variable race containing "white", "black", and "other", typing . xi: logistic outcome weight bp i.agegrp i.race
would include indicator variables for the race group as well. The i. indicator variables xi expands may appear anywhere in the varlist, so , xi: logistic outcome i.agegrp weight i.race bp
would estimate the same model. You can also create interactions of categorical variables; typing xi: logistic outcome weight bp i.agegrp*i.race
estimates a model with indicator variables for all agegrp and race combinations, including the agegrp and race main-effect terms (i.e., the terms that are created when you just type i. agegrp i.race). You can interact dummy variables with continuous variables; typing xi: logistic outcome bp i,agegrp*weight i.race
estimates a model with indicator variables for all agegrp categories interacted with weight, plus the main-effect terms weight and i.agegrp. You can get the interaction terms without the agegrp main effect (but with the weight mam effect) by typing xi: logistic outcome bp i.agegrpI weight i.race
You can also include multiple interactions: xi: logistic outcome bp i.agegrp*weight i.agegrp*i-race
We will now back up and describe the construction of dummy variables in more detail.
xi — Interaction expansion
Background
I
275
j
The terms continuous, categorical, and indicator or dummy variables are used below. Continuous variables measure something — such as height or weight — and at least conceptually can take on any real number over some) range. Categorical variables, on the other hand, take on a finite number of values each denoting ifiembership in a subclasis; for example, excellent, good, and poor — which might be coded 0, 1, 2}or 1, 2, 3 or even "Excellent", "Good", and "Poor". An indicator or dummy variable — the terms ari used interchangeably— -is a special type of two- valued categorical variable that contains values 0,j denoting false, and 1, Denoting true. The information contained in any kvalued categorical variable can be equally well represented by k indicator variables. Instead of a single variable recording values representing excellent, good, and poor, one can have three indicator variables, the first indicating the truth or falseness of "result is excellent", the second "result is good", and the third "result is poor". xi provides a convenient way to convert categorical variables to dummy or indicator variables when estimating a mod;! (say with regress, logistic, etc.). For instance, assume) the categorical variable agegrp contains 1 for ages 20-24, 2 for ages 25-39, and 3 for ages 40-44J (There is no one over |44 in our data.) As it stands, agegrp would be a poor candidate for inclusion in a model even if bne thought age affected the outcome. The reason is because the coding would force the restriction tiat the effect of being in the second age group must be twice the effect of bjeing in the first and, similarly, the effect of being in the third must be three times the first. That is, If one estimated the model, i !
: \
.1 |
y = 0Q + &J agegrp + Xfa i
i \ the effect of being in th^ first age group is /?i, tljte second 2/?i, and the third 3/?i. If the coding 1, 2, and 3 is arbitrary, we cpuld just as well have c<j)ded the age groups 1, 4, and 9, making the effects and
The solution is to ccnvert the categorical variable agegrp to a set of indicator variables ai, a2, and 053, where a» is 1 if the individual is a member of the zth age group and 0 otherwise. We can then estimate the mode|: y s= 80 + jSnd! + J312a2 +
+ Xfa
The effect of being in a»e group 1 is now flu; 1, /fo; and 3, /?is; and these results are independent of our (arbitrary) coding. The only difficulty at this point is that the model is unidentified in the sense that there are an infinite number of (Po,@n,fli2,0i3) that fit the data equally well. To see this, pretend (Ab&ii&a^ia) = (T, 1,3,4). The predicted values of y for the various age groups are j 1 + 1 + Xfa =j 2 + Xfa (age group 1) y = I + 3 + Xfa H 4 + xfa (»ge S^P 2) 1+4-1- Xfa M 5 + •*"& (age grouP 3) Now pretend (A),/?u,^12^13) = (2,0,2,3). The predicted values of y are
2 + 0 + Xfa. =! 2 + Xfa (age group 1) y = { 1 + 2 + Xfo =: 4 + Xfa (age group 2) (age group 3)
276
xi — Interaction expansion
These two sets of predictions are indistinguishable: for age group 1, y — 2 -f Xfa regardless of which coefficient vector is used, and similarly for age groups 2 and 3. This arises because we have 3 equations and 4 unknowns. Any solution is as good as any other and, for our purposes, we merely need to choose one of them. The popular selection method is to set the coefficient on the first indicator variable to 0 (as we have done in our second coefficient vector). This is equivalent to estimating the model y = /30 + A 2 a 2 + faas + Xf32 How one selects a particular coefficient vector (identifies the model) does not matter. It does, however, affect the interpretation of the coefficients. For instance, we could just as well choose to omit the second group. In our artificial example, this would yield (0o,(3ii,fa,fa) = (4,-2,0,1) instead of (2,0,2,3). These coefficient vectors are the same in the sense that (age group 1) (age group 2) 2 + 3 + X#z = 5 + Xfa = 4 + 1 -f Xfa (age group 3) but what does it mean that fa can just as well be 3 or 1? We obtain fa = 3 when we set fin =0, so fa = fa — fa and fa measures the difference between age groups 3 and 1. In the second case, we obtain /?is = 1 when we set /3\z = 0, so /3i3 — /?i2 = 1 and fa measures the difference between age groups 3 and 2. There is no inconsistency. According to our fa — 0 model, the difference between age groups 3 and 1 is $13 — @n — 1 — (—2) = 3, exactly the same result we got in the fa = 0 model. The issue of interpretation, however, is important because it can affect the way one discusses results. Imagine you are studying recovery after a coronary bypass operation. Assume the age groups are (1) children under 13 (you have two of them), (2) young adults under 25 (you have a handful of them), (3) adults under 46 (of which you have more yet), (4) mature adults under 56, (5) older adults under 65, and (6) elder adults. You follow the prescription of omitting the first group, so all your results are reported relative to children under 13. While there is nothing statistically wrong with this, readers will be suspicious when you make statements like "compared with young children, older and elder adults ...". Moreover, it is likely that you will have to end each statement with "although results are not statistically significant" because you have only two children in your comparison group. Of course, even with results reported in this way, you can do reasonable comparisons (say with mature adults), but you will have to do extra work to perform the appropriate linear hypothesis test using Stata's test command. In this case, it would be better if you forced the omitted group to be more reasonable, such as mature adults. There is, however, a generic rule for automatic comparison group selection that, while less popular, tends to work better than the omit-the-first-group rule. That rule is to omit the most prevalent group. The most prevalent is usually a reasonable baseline. In any case, the prescription for categorical variables is 1. Convert each fc-valued categorical variable to k indicator variables. 2. Drop one of the k indicator variables; any one will do but dropping the first is popular, dropping the most prevalent is probably better in terms of having the computer guess at a reasonable interpretation, and dropping a specified one often eases interpretation the most. 3. Estimate the model on the remaining k — 1 indicator variables, xi automates this procedure. We will now consider each of xi's features in detail.
xi — Interaction expansion
277
Indicator variables for simple effects When you type i.varname, xi internally tabulates vamame (which may be a string or a numeric variable) and creates indicator (dummy) variabiles for each observed value, omitting the indicator for the smallest value. Fcr instance, say agegrp tjakes on the values 1, 2, 3, and 4. Typing xi: logistic ottcome i.agegrp
creates indicator variables named _Iagegrp42, _Iagegrp_3, and _Iagegrp_4. (xi chooses the names and tries to make them readable; xi Juarantees that the names are unique.) The expanded logistic model is I . logistic outcome _Iagegrp_2 _Iagegrjj_3 _Iag«grp_4
Afterwards, you can diop the new variables xi leaves behind by typing 'drop _!*' (note capitalization). xi provides the following features when you tjype i.varname: 1. vamame may be String or numeric. 2. Dummy variables are created automatically. 3. By default, the lummy-variable set is identified by dropping the dummy corresponding to the smallest value of the variable (how fo specify otherwise is discussed below). 4. The new dumnr variables are left in youf dataset. By default, the names of the new dummy variables start with _I, therefore you eardrop them by typing 'drop _!*'. You do not have time you use xi , any previously created automatically generated dummies with the same prefix as the one specified in the prefix (wring) option, or _I by default, are dropped and new ones created. 5. The new dumrrjy variables have variable labels so that you can determine to what they correspond by tiping 'describe'. 6. xi may be use<: with any Stata command (not just logistic).
Controlling the omitted dummy By default, i.varname omits the dummy corresponding to the smallest value of vamame; in the case of a string variable, this is interpreted a| dropping the first in an alphabetical, case-sensitive sort, xi provides two alternatives to dropping the first: xi will drop the dummy corresponding to the most prevalent val je of vamame or xi will let you choose the particular dummy to be dropped. To change xi's behavior to dropping the mbst prevalent, type char _dta[oaijb] prevalent
although whether yoi type "prevalent" or "i'es" or anything else does not matter. Setting this characteristic affects the expansion of all catfgorical variables in the dataset. If you resave your dataset, the prevalent prefi"erence will be rememjbered. If you want to change the behavior back to the default drop-the-first-nile type . char _dta[omi;]
to clear the characteris tic. Once you set _dt [omit], i.varname oriits the dummy corresponding to the most prevalent value of vamame. Thi|s , the coefficients on thd dummies have the interpretation of change from the most prevalent group. For example,
278
xl — Interaction expansion . char _dta[omit] prevalent . xi: regress y i.agegrp
might create _Iagegrp_l through _lagegrp_4, resulting in _Jagegrp_J2 being omitted if agegrp = 2 is most common (as opposed to the default dropping of _Iagegrp_l), The model is then y = bQ + bi _Iagegrp_l + 63 _Iagegrp_3 + b4 _Iagegrp_4 4- u Then, Predicted y for agegrp 1 = 60 + &i Predicted y for agegrp 2 = bo
Predicted y for agegrp 3 = 60 + 63 Predicted y for agegrp 4 = bo + 64
Thus, the model's reported t or Z statistics are for a test of whether each group is different from the most prevalent group. Perhaps you wish to omit the dummy for agegrp 3 instead. You do this by setting the variable's omit characteristic: . char agegrp[omit] 3 This overrides _dta[omit] if you have set it. Now when you type . xi: regress y i.agegrp
_Iagegrp_3 will be omitted and you will estimate the model: y = fe'0 + b( _Iagegrp_l + b'% _Iagegrp_2 + &4 _Iagegrp_4 + u Later, if you want to return to the default omission, type . char agegrp[omit] to clear the characteristic. In summary, i. varname omits the first group by default but if you define . char _dta[omit] prevalent then the default behavior changes to that of dropping the most prevalent group. Either way, if you define a characteristic of the form . char varname [omit] #
or, if varname is a string, . char varname [omit] string-literal
then the specified value will be omitted. Examples:
. char agegrp [omit] 1 . char race [omit] White . char agegrp [omit]
(for race a string variable) (to restore default for agegrp)
xl — Interaction expansion
27$
Categorical variable interactions i. varnamei *i. varna me^ creates the dummy variables associated with the interaction of the categorical variables vamamei <md varname^. The identification rales—which categories are omitted—are the same as for i.varname. For instance, assumfc agegrp takes on four values and race takes on three values. Typing xi: regress y i agegrp*!.race results in the model dummies for: y = a + b2 _I<jgegrp_2 + 63 -Iagegrj)_3 + &4 _Iagegrp_4 (agegrp) +C2 _Irace_2 + C3 _Irace_3 : (race) ; _IageXrac_2_2 + d23 _ligeXrac_2_3 _I ageXrac_3_2 -f- d33 -ligeXrac J3_3 (agegrp*race) ; _I ageXrac_4_2 + d43 _I£geXrac_4_3
Tlmt is, xi: regress y i agegrp*!. race is the same as typing . zi: regress y i.agegrp i.race i.agegrp4i.race While there are lots of c ther ways the interaction could have been parameterized, this method has the advantage that one can test die joint significance of the interactions by typing . t«Btpans _I
When you perform the estimation step, whether yoi? specify i . agegrp*! . race or i . race*i . agegrp makes no difference (otf er than in the names given to the interaction terms; in the first case, the names will begin with _ ageXrac; in the second!, _IracXage). Thus, . xi: regress y race*!.agegrp estimates the same mode You may also include multiple interactions simultaneously: . xi: regress y i agegrp*!.race i.agegrp^i.sex The model estimated is dummies for: (agegrp) (race)
!*
;**lie.
_Iagegrp_2 + 63 _Iagegrp_3 -f ^ _Iagegrp_4 i _Irace_2 -- eg _Irace_3 | 2 -IageXra:_2_2 + d23 -IageXra4_2_3 (agegrp*race) +^32 -lageXra _IageXra(i _3_3 +d& _IageXra ;_4_2 _IageXraC _4_3 (sex) +62 _Isex_2 +/22 _IageXse|c_2_2 + /23 _IageXse4 J2_3 + /24 _IageXsex_2_4 (agegrp*sex) +u Note that the agegrp dummies are (correctly) included only once.
280
xi — Interaction expansion
Interactions with continuous variables i .vamame-i*varname2 (as distinguished from i.varnamei^i.varname^, note the second i.) specifies an interaction of a categorical variable with a continuous variable. For instance, . xi: regress y i.mgegr*wgt results in the model y = a + 62 -Iagegrp_2 + 63 _Iagegrp_3 + b4 _Iagegrp_4 (agegrp dummies) +cwgt (continuous wgt effect) _IageXwgt_2 + d3 _IageXwgt_3 + d4 _IageXwgt_4 (agegrp*wgt interactions)
A variation on this notation, using I rather than *, omits the agegrp dummies. Typing . xi: regress y i. agegrp I wgt estimates the model y = a! +c' wgt (continuous wgt effect) +d'2 _IageXwgt_2 + d'3 _IageXwgt_3 -f d'^ _IageXwgt_4 (agegrp*wgt interactions)
The predicted values of y are agegrp*wgt model
agegrplwgt model
y=a + cwgt a + cwgt + 62 + d 2 wgt a + cwgt 4 wgt a + cwgt + b4 + djwgt
a' + c'wgt a' + c'wgt + d!2wgt a' + c'wgt + d!3»gt a1 + c'wgt + d!4wgt
+63 +
if if if if
_ 1
agegrp agegrp = 2 agegrp — ^ agegrp = 4
That is, typing . xi: regress y i.agegrp*wgt is equivalent to typing . xi: regress y i. agegrp i. agegrp I wgt
Also note that in either case, it is not necessary to specify separately the continuous variable wgt; it is included automatically.
Using xi: Interpreting output xi: regress mpg i.rep78 i.rep78 _Irep78_l-5 (output from regress appears )
(naturally coded; _Irep78_l omitted)
Interpretation: i.rep78 expanded to the dummies _Irep78_l, _Irep78_2, . . ., _Irep78_5. The numbers on the end are "natural" in the sense that _Irep78_l corresponds to rep78 = 1, _Irep78_2 to rep78 = 2, and so on. Finally, the dummy for rep78 — 1 was omitted. . xi: regress mpg i.make ' _Imake_l-74 (output from regress appears )
i make
(_Imake_l for make==AMC Concord omitted)
xi — interaction expansion
281
Interpretation: i.make expanded to _Imake_l, _Imake_2, . . . , _Imake_74. The coding is not natural because make is a string variable. _ImaJjce_l corresponds to one make, _Imake_2 another, and so on. We can find] out the coding by typing describe. _Imake_l for the AMC Concord was omitted. \
How xi names variables By default, the namejs xi assigns to the duminy variables it creates are of the form i
-Istuh-groupid
You may subsequently refer to the entire set of variables by typing 'Istub*\ For example, name j = _I + stub + _ _Iagegtp-l -I agegifp _ _Iagegi-p_2 _I agegifp _IageXugt_l _I ageXwgt _ _IageXrac_l_2 _I ageXifac _ _IageXtac_2_i _I ageXrac _
+ groupid 1 2 1 1_2 2_1
Entire set -lagegrp* -lagegrp* _IageXwgt* _IageXrac* _IageXrac*
If you specify a prefix in the prefix (string} option, say _S, xi will name the variables starting with the prefix: JSstub-groupid
itper than a command prefix xi as a command ratper xi can be used as aj command prefix or as a command by itself. In the latter form, xi merely creates the indicator ancj interaction variables. Equivalent to typing . ri: regress y a.agegrp*wgt ! i.agegrp _Iagegrp_l-4 (ijaturally coded; _Iagegrp_l omitted) i.agegrp*wgt j _IageXwgt_l-4 (coded as above) (output from regres^ appears) IS
. xi i.agegrp*w i.agegrp _Iagegrp_l-4 i.agegrp*wgt _IageXwgt_l-4 . regress y _Iadegrp* _IageXwgt* (output from regresf appears)
(naturally coded; _Iagegrp_l omitted) (coded as above)
Warnings 1. xi creates new VE riables in your dataset; most are bytes, but interactions with continuous variables will ha\ e the storage type of th;e underlying continuous variable. You may get the message "iinsufficient memory". If so, you will need to increase the amount of memory allocated to Stata' data areas; see [U] 7 Setting the size of memory.
282
xi — Interaction expansion
2. When using zi with an estimation command, you may get the message "matsize too small". If so, see [R] matsize.
Methods and Formulas xi is implemented as an ado-file.
References Hendrickx, J. 1999. dm73: Using categorical variabies in Stata. Stara Technical Bulletin 52: 2-8. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 51-59. . 2000. dm73.1: Contrasts for categorical variables: update. Stata Technical Bulletin 54: 7. Reprinted in Stata
Technical Bulletin Reprints, vol. 9, pp. 60-61.
Also See Background:
[u] 23 Estimation and post-estimation commands
Title xpose — Interchani e observations and variables
1
Syntax xpose, clear [ yajname
Description xpose transposes the data, changing variables into observations and observations into variables, All new variables— thati is, after transposition—fare made the default storage type. Thus, any original variables that were strings will result in observations containing missing values. (Transposing the data twice, therefore, w 11 result in a loss of the contents of string variables.)
Options
!
clear is not optional, "this is supposed to remind you that the untransposed data will be lost (unless you have saved it previously). j varname adds the new! variable _varname to the transposed data containing the original variable names. In addition, with or without the varflame option, if the variable _varname exists in the dataset before transpasition, those names will be used to name the variables after transposition. Thus, transposing the data twice will (almost) yield the original dataset.
Remarks t> Example You have a dataset en something by county tad year that contains . list
1. 2. 3.
county 1 2 3
yearl 57.2 12.5 18
year2 11.3 8,2 14.2
year3 19.5 28.9 33.2
Each observation reflects a county. To change this dataset so that each observation reflects a year: . xpose, clear varname . list
1. 2. 3. 4.
vl 1 57.2 11.3 | 19.5
v2 2 12.5 8.2 28.9
v3 3 18 14.2 33.2
^varname county yearl year2 yearS
283
284
xpose — Interchange observations and variables
It would now be necessary to drop the first observation (corresponding to the previous county variable) to make each observation correspond to one year. Had we not specified the varname option, the variable _varname would not have been created. The _varname variable is useful, however, if we want to transpose the dataset back to its original form. xpose , clear list county 1 1. 2.
3.
2 3
yearl 57,2
year2 11.3
12.5
18
Methods and Formulas xpose is implemented as an ado-file.
Also See Related:
[R] reshape, [R] stack
8.2 14.2
yearS 19.5 28,9 33.2
fitle xt-- Cross-sectioilal time-series analysis
1
Syntax r.tcmd
',•>
iis
clear ]
tis
clear ]
Description The xt series of commands provide tools foi" analyzing cross-sectional time-series datasets. Crosssectional time-series (ongitudinal) datasets are of the form x^, where Xjt is a vector of observations for unit i and time t. The particular commands (such as xtdes, xtsum, rtreg, etc.) are documented in the [R] xt entries that follow this entry. This entry deals with concepts common across commands. iis is related to the i() option on the other xt commands. Command iis or option i() sets the name of the variable < orresponding to the unit index i. iis without an argument displays the current name of the unit variable. tis is related to tie t() option on the other xt commands. Command tis or option t() sets the name of the variable corresponding to the time index t. tis without an argument displays the current name of the time variable. Some xt command use time-series operator? in their internal calculations and thus require that your data be tsset. See [P I tsset for details on this;process. For instance, since xtabond uses time-series operators in its intern 1 calculations, you must tsset your data before using it. As in [R] xtabond, the manual entry for the command explicitly states that you must tsset your data before using the command. For these ommands, iis and tis are neither sufficient nor recommended. If your interest is general time-series analysis, see [U] 29.12 Models with time-series data.
Options
i If •Hr *
specifie: the variable name corresponding to index i in Xjt. This must be a single, numeric variable, a though whether it takes on the values 1, 2, 3 or 1, 7, 9, or even -2, \/2, K is irrelevant. (If the i< ntifying variable is a suing, use egen's group () function to make a numeric variable from it; see [R] egen.) For instance, if the cross-sectional time-series data are of persons in the years 1991-1994, each observation is a pe son in one of the years; there are four observations per person (assuming no missing data). vamamci is the name of the variable that uniquely identifies the persons. All xt commands require that i() be specified, but after i() has been specified once with any of the xt commanc > . it need not be specified again except to change the variable's identity. The identitv can also b set and examined using the iis command. t(varnamet) specifie the variable name corresponding to index t in x^. This must be a single, numeric variable, a though whether it takes:on the values 1, 2. 3 or 1, 7, 9, or even —2, v2, ~ is irrelevant. 285
286
xt — Cross-sectional time-series analysis
For instance, if the cross-sectional time-series data are of persons in the years 1991-1994, each observation is a person in one of the years; there are four observations per person (assuming no missing data). varnamet is the name of the variable recording the year. Not all xt commands require that t() be specified. If the t ( ) option is included in the syntax diagram for a command it must be specified, but after t () has been specified once with any of the xt commands, it need not be specified again except to change the variable's identity. The identity can also be set and examined using the t is command. clear removes the definition of iO or t ( ) . For instance, typing tis, clear makes Stata forget the identity of the t() variable.
Remarks Consider having data on n units—individuals, firms, countries, or whatever—over T time periods. The data might be income and other characteristics of n persons surveyed each of T years, or the output and costs of n firms collected over T months, or the health and behavioral characteristics of n patients collected over T years. Such cross-sectional time-series datasets are sometimes called longitudinal datasets or panels, and we write xa for the value of x for unit i at time t. The xt commands assume that such datasets are stored as a sequence of observations on (t,£,:r).
> Example If we had data on pulmonary function (measured by forced expiratory volume or FEV) along with smoking behavior, age, sex, and height, a piece of the data might be list in 1/6 1. 2. 3. 4. 5. 6.
pid yr_visit 1071 1991 1071 1992 1071 1993 1072 1991 1072 1992 1072 1993
fev 1,21 1.52 1.32 1.33 1.18 1.19
age 25
sex 1
height 69
smokes 0
26 27 18 20 21
1 1 1 1 1
69 69 71 71 71
0 0 1 1 0
The other xt commands need to know the identities of the variables identifying patient and time. With these data, you would type . iis pid . tis yr_visit
Having made this declaration, you need not specify the i 0 and 10 options on the other xt commands. If you resaved the data, you need not respecify iis and tis in future sessions. ,
Q Technical Note Cross-sectional time-series data stored as shown above are said to be in the long form. Perhaps your data are in the wide form with one observation per unit and multiple variables for the value in each year. For instance, a piece of the pulmonary function data might be pid 1071 1072
sex fev91 fev92 fev93 age91 age92 age93 1 1.21 1.52 1.32 25 26 28 1 1.33 1.18 1.19 18 20 21
Data in this form can be converted to the long form by reshape; see [R] reshape.
Q
xt -4- Cross-sectional time-series analysis
287
Example Data for some of the time periods might be missing. That is, we have cross-sectional time-series data on i = 1.... ,n and t — 1,... ,T, but only 7} of those observations are defined. With such missing periods—called unbalanced data—a piece of our pulmonary function data might be list in 1/6 1. 2. 3. 4. 5. 6.
pid yr_v Lsit 1071 L991 1071 L992 1071 L993 1072 L991 1072 L993 1073 L991
lev 1.21 1.52 1.32 1.33 1.19 1.47
age 25 26 28 18 21 24
stex 1 1 • 1 : 1 |1
;
o
height 69 69 68 71 71 64
smokes 0 0 0 1 0 0
Note that patient id 1072 is not observed in 1992. The xt commands are robust to this problem.
d Technical Note In many of the [R] xt entries, we will use data from a subsample of the NLSY data (Center for Human Resource Research 1989) on young women aged 14-26 in 1968. Women were surveyed in each of the 21 vears 1968 through 1988, except for the six years 1974, 1976, 1979, 1981. 1984, and 1986. We use two different subsets: nlswork.dta and union.dta. For nlswork. dta, our subsample is of 4.711 women in years when employed, not enrolled in school and evidently having completed their education, and with wages in excess of Si/hour but less than $700/hour.
(Continued ori next page)
288
xt — Cross-sectionai time-series analysis . use nlswork, clear (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . describe Contains data from nlswork.dta obs: 28,534 vars : size:
21 1,027,224 (1.7X of memory free)
variable name idcode year birth_yr age race msp nev_mar grade collgrad not_smsa c_city south. ind_code occ_code union wks_ue ttl.exp tenure hours wks_work ln_wage
storage display type format int byte byte byte byte byte byte byte byte byte byte byte byte byte byte byte float float int byte float
value label
'/.8 •Og 7,8 •Og 7,8 •Og 7,8 •Og 7.8 •Og 7,8 •Og 7.8 •Og 7.8 .Og 7,8 •Og 7.8 •Og 7.8 •Og 7,8 •Og 7,8 •Og 7.8 •Og 7.8 •Og X8 •Og '/.9 •Og 7.9 •Og 7.8 •Og 7.8 •Og 7.9 •Og
National Longitudinal Survey. Young Women 14-26 years of age in 1968 1 Aug 2000 09:48
variable label NLS id interview year birth year age in current year l=white, 2=black, 3=other 1 if married, spouse present 1 if never yet married current grade completed 1 if college graduate 1 if not SMSA 1 if central city 1 if south industry of employment occupation 1 if union weeks unemployed last year to'tal work experience job tenure, in years usual hours worked weeks worked last year ln(wage/GNP deflator)
Sorted by: idcode year . summarize Variable
Obs
Mean
idcode year birth_yr age race msp nev_mar grade collgrad not.smsa c _city south ind_code occ_code union wks_ue ttl_exp
28534 28534 28534 28510 28534 28518 28518 28532 28534 28526 28526 28526 28193 28413 19238 22830 28534
2601.284 77.95865 48.08509 29.04511 1 .303392 .6029175 .2296795 12.53259 . 1680451 . 2824441 .357218 .4095562 7.692973 4.777672 .2344319 2 . 548095 6.215316
tenure
28101
3 . 123836
hours wks_work
28467 27831
36.55956 53.98944
ln_wage
28534
1.674907
Std. Dev
Min
Max
1487.359 6.383879 3.012837 6. 700584 4822773 ,4893019 4206341 2. 323905 3739129 4501961 4791882 4917605 2 . 994025 3 . 065435 4236542 7. 294463 4.652117 3.751409 9. 869623 29.0325 .4780935
1 68 41 14 1 0 0 0 0 0 0 0 1 1 0 0 0
5159 88 54 46 3 1 1 18 1 1 1 1 12 13 1 76 28.88461
0 1 0 0
25.91667 168 104 5.263916
— Cross-sectional time-series analysis
289
For union.dta, our subset was sampled only from those with union membership information from 1970 to 1988. Ou subsample is of 4,434 women. The important variables are age (16-46), grade (years of schoo^g completed, ranging from 0 to 18), not_smsa (28% of the person-time was spent living outside an S VISA—standard metropolitan statistical area), south (41% of the person-time was in the South), and sjontMt (south interacted with year, treating 1970 as year 0). You also have variable union. Overall 22% of the person-tide is marked as time under union membership and 44% of these women hsve belonged to a union. . describe Contains data from union.dta obs: vars: size:
26,200 10 393,000 (62.3"/. of memory foee)
variable name idcode year
ste rage display t ype format
i nt I yte tyte t yte I yte I yte t pte t yte t yte t yte
age grade not_smsa south
union
to
southXt black
value! label
XS.Og
»LS Women 14-24 in 1968 1 Aug 2000 09:59
variable label NLS id interview year age in current year current grade completed 1 if not SHSA 1 if south 1 if union
Xs.og XS.Og X8.og xs.og XS.Og XS.Og X9.0g XS.Og XS.Og
race black
-f
Sorted by: . summarize Variable idcode year age grade not.smsa south union
to
southXt black
Obs
Mean
26200 26200 26200 26200 26200 26200 26200 26200 26200 26200
2611.582 79.47137 30.43221 12.76145 .2837023 .4130153 .2217939 9.471374 3.96874 . 274542
Std- Dev.
148J4.994 5.9J65499 6.489056 2.41171S .4S|08027 .46*23849 .41154611 5.965499 6.0(57208 .4462917
Min
Max
1 70 16 0 0 0 0 0 0 0
5159 88 46 18 1 1 1 18 18 1
With both datasets, we lave typed . iis idcode . tis year
r
Q
Technical Note The tis and iis o mmands as well as other xt commands that set the t and i index for xt data do so by declaring them as characteristics of the data; see [P] char. In particular, tis sets the characteristic _dta[tis[] to the name of the t index variable, iis sets the characteristic _dta[iisj to the name of the i mcjex variable.
a
290
xt — Cross-sectional time-series analysis
a Technical Note Throughout the xt entries, when random-effects models are estimated, a likelihood-ratio test that the variance of the random effects is zero is included. These tests occur on the boundary of the parameter space, invalidating the usual theory associated with such tests. However, these likelihoodratio tests have been modified to be valid on the boundary. In particular, the null distribution of the likelihood-ratio test statistic is not the usual xt. but rather a 50:50 mixture of a XQ (point mass at zero) and a xf, denoted as XQI- See Gutierrez et al. (2001) for a full discussion. Q
References Center for Human Resource Research. 1989. National Longitudinal Survey of Labor Market Experience, Young Women 14-26 years of age in 1968, Ohio State University. Gutierrez, R. G., S. L. Carter, and D. M. Drukker. 2001. On boundary-value likelihood ratio tests. State Technical Bulletin, forthcoming.
Also See Complementary:
[R] xtabond, [R] xtclog, [R] xtdata, [R] xtdes, [R] xtgee, [R] xtgls, [R] xtintreg, [R] xtivreg, [R] xtlogit, [R] xtnbreg, [R] xtpcse, [R] xtpois, [R] xtprobit, [R] xtrchh, [R] xtreg, [R] xtregar, [R] xtsum, [R] xttab, [R] xttobit
fitle xtabond — Arellano- Bond linear, dynamic panel data estimator
Syntax xtabond depvar [var ist] [if exp\ [in ran$e] [, lags(#) maxldep(#) maxlags(#) diff ars (varlist) inst (varlist) pre (varlist [, lagstruct (#,#)]) [pre (varlist [, laj ;struct (#,#)]) ] ... [p re (varlist [, lagstruct (#,#) ])] artests(#) robu t twostep noconstant small You must tsset your data before using xtabond; see [R] tsset. by . . . : may be used with r abond; see [R] by. All varlists, may contain time series operators; see [U] 14.4.3 Time-series varlists. However, the specification of the depvar may not contain ti :-series operators. xtabond shares the features c all estimation commands: see [U] 23 Estimation and post-estimation commands.
Syntax for predict predict [type] new amame [if exp\ [in range] [, statistic where statistic is xb e
a -f bxij, fittejd values (the default) %, the residuals
t Description xtabond estimates a ynamic panel-data model via the Arellano-Bond estimator. Consider the model
ik -f ci
• 3=1 where the ctj are p paramete s to be estimated ~x.it is a 1 x k\ vector of strictly exogenous covariates ^ is a 1 x k\ vector )f parameters to be estimated Wj ( is a 1 x fc2 vector of predetermined covarifltes /32 is a 1 x k-2 vector >f parameters to be estinjiated f, are the random effe :ts that are independent and identically distributed (iid) over the individuals with variance a^ and e,-t are iid over th whole sample with variance of. 291
(I)
292
xtabond — Arellano-Bond linear, dynamic panel data estimator
It is also assumed that the v± and the e^ are independent for each i over all t. First differencing equation (1) removes the V{ and produces an equation that is estimable by instrumental variables. Arellano and Bond (1991) derived a Generalized Method of Moments estimator for QJ , j € {1,. - . , p], &i, and /32 using lagged levels of the dependent variable and the predetermined variables and differences of the strictly exogenous variables, xtabond implements this estimator, known as the Arellano-Bond dynamic panel data estimator. This methodology assumes that there is no second-order autocorrelation in the first-differenced idiosyncratic errors, xtabond includes the test for autocorrelation and the Sargan test of over-identifying restrictions for this model derived by Arellano and Bond (1991).
Options lags (#) sets p, the number of lags of the dependent variable to be included in the model. The default isp — 1. maxldep(#) sets the maximum number of lags of the dependent variable that can be used as instruments. The default is to use all Tj — p — 2 lags. maxlags (#) sets the maximum number of lags of the dependent variable and all the predetermined variables that can be used as instruments. The default is to use all T; — p — 2 lags. diffvarsCvarfo?) specifies a set of variables that have already been differenced to be included as strictly exogenous covariates. inst (varlist) specifies a set of variables to be used as additional instruments. These instruments are not differenced by xtabond before including them into the instrument matrix. pre(varlist, [lagstruct(prelags, premaxlags)^') specifies that a set of predetermined variables is to be included in the model. Optionally, one may specify that prelags lags of the specified variables are also included. The default for prelags is 0. Specifying premaxlags sets the maximum number of further lags of the predetermined variables to be used as instruments. The default is to include Ti — prelags ~ 2 lagged levels as instruments. The user may specify as many sets of predetermined variables as is feasible within the standard Stata limits on matrix size. Each set of predetermined variables may have its own number of prelags and premaxlags. artests(# ) specifies the maximum order of the autocorrelation test to be calculated and reported. The maximum order must be less than or equal to p + 1. The default is 2. robust specifies that the robust estimator of the variance-covariance matrix of the parameter estimates is to be calculated and reported. This option is not available with two-step estimates. twostep specifies that the two-step estimator is to be calculated and reported, noconstant suppresses the constant term (intercept) in the regression. small specifies that t statistics should be reported instead of Z statistics and F statistics instead of chi-squared statistics. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals.
> tabond — Arellano-Bond linear, dynamic panel data estimator
293
'Options for predict xb, the default, calculate the linear prediction fr6m the first-differenced equation. e calculates the residual ;rror of the differenced dependent variable from the linear prediction.
Remarks Anderson and Hsiao ( 981, 1982) proposed using further lags of the level or the difference of the dependent variable to in; trument the lagged dependent variables included in a dynamic panel-data model after the random ffects had been removed by first differencing. A version of this estimator can be obtained from x1 ivreg.. See [R] xtivreg for an example. Arellano and Bond (1991) built on this idea by noting that, in general, there are many more instruments available. Using the GMM framework developed by fiansen (1982), they identified how many lags of the dependent variable and the predetermined variables were valid instruments and how to combine these lagged levels with first differences of the strictly exogenous variables into a potentially very large instrument matrix. Using this instrument matrix, A rellano and Bond (1991) derived the corresponding one-step and two-step GMM estimators. They al o found the robust VCE estimator for the one-step model. In addition, they derived a test of autocorrelation of order m and the Sargan test of over-identifying restrictions for this estimator.
> Example In their article, Arelli no and Bond (1991) applied their new estimators and test statistics to a model of dynamic labor :emand that had previously been considered by Layard and Nickell (1986). They use data from an u (balanced panel of firms from the United Kingdom. As is conventional, all variables are indexed ove r the firm i and time t. In this dataset, na is the log of employment in firm ?' inside the U.K. at timt t, wn is the natural log of the real product wage, k# is the natural log of the gross capital stocl and ysn is natural log of industry output. The model also includes time dummies yr!980, yr!981, yr!982, yr!983, and yr!984. In Table 4 of Arellano and Bond (1991), the authors present the re suits they obtained from several specifications. In column (al) of Taple 4, Arellano and Bofld report the coefficients and their standard errors from the robust one:step estimation of a dynamic! model of labor demand. In order to clarify some important issues, we wi begin with the homoskedastic one-step version of this model and then consider the robust case Here is the command ising ztabond and the subsequent output for the homoskedastic case:
(Continued tb next page)
xtabond — Arellano-Bond linear, dynamic panel data estimator
294
xtabond n 1(0/1).w 1(0/2),(k ys) yrl980-yrl984, lags(2) Arellano-Bond dynamic panel data Number of obs = 611 Group variable (i) : id Number of groups = 140 Wald cM2(15) = 575.84 Time variable (t) : year min number of obs = 4 max number of obs — 6 mean number of obs ~ 4.364286 One-step results
n
Coef.
Std. Err.
LD L2D
.6862262 -.0853582
. 1486163 .0444365
4.62 -1.92
0.000 0.055
.3949435 -.1724523
.9775088 .0017358
Dl LD
-.6078208 .3926237
.0657694 . 1092374
-9.24 3.59
0.000 0.000
-.7367265 .1785222
-.4789151 .6067251
Dl LD L2D
.3568456 -.0580012 -.0199475
.0370314 .0583051 .0416274
9.64 -0.99 -0.48
0.000 0.320 0.632
.2842653 -.172277 -.1015357
.4294259 . 0562747 .0616408
Dl LD L2D
.6085073 -.7111651 . 1057969
. 1345412 .1844599 . 1428568
4.52 -3.86 0.74
0.000 0.000 0.459
.3448115 -1.0727 -.1741974
.8722031 -.3496304 . 3857912
Dl
.0029062
.0212705
0.14
0.891
-.0387832
.0445957
Dl
-.0404378
. 0354707
-1.14
0.254
-.1099591
. 0290836
Dl
-.0652767
. 048209
-1.35
0.176
- . 1597646
.0292111
Dl
- . 0690928
.0627354
-1.10
0.271
-.1920521
. 0538664
Dl
- . 0650302 .0095545
.0781322 .0142073
-0.83 0.67
0.405 0.501
-.2181665 -.0182912
.0881061 . 0374002
z
P>izi
[957, Conf . Interval]
n w
k
ys
yr!980 yri981 yr!982 yr!983 yr!984 _cons
Sargan test of over-identifying restrictions: chi2(25) = 65.82 Prob > chi2 Arellano-Boad test that average autocovariance HO: no autocorrelation z - -3.94 Arellano-Bond test that average autocovariance HO: no autocorrelation z = -0.54
0.0000 in residuals of order 1 is 0: Pr > z = 0.0001 in residuals of order 2 is 0: Pr > z = 0.5876
The coefficients are identical to those reported in column (al) of Table 4, as they should be. Of course, the standard errors are different because we are considering the homoskedastic case. Only in the case of a homoskedastic error term does the Sargan test have an asymptotic chi-squared distribution. In fact, Arellano and Bond (1991) found evidence that the one-step Sargan test over-rejects in the presence of heteroskedasticity. Since its asymptotic distribution is not known under the assumptions of the robust model, xtabond will not compute it when robust is specified. The Sargan test, reported by Arellano and Bond (1991) in column (al) of Table 4, comes from the one-step homoskedastic estimator and is the same as the one reported here. By default, xtabond calculates and reports the Arellano-Bond test for first and second-order autocorrelation in the first-differenced residuals. There are versions of this test for both the homoskedastic and the robust cases, although their values are different.
: labond — Arellano-Bond linear, dynamic panel data estimator
295
t> Example Now consider the out wt from the one-step robust estimator of the same model: . xtabond n 1(0/1 .w 1(0/2).(k ys} yri9B<>~yrl984, lags(2) robust Arellano-Bond dynamic panel data Number of obs Number of groups Group variable (i id Wald chi2(15) Time variable (t) year min number of obs max number of obs mean number of obs One-step results
611 140 618.58 4
6 4.364286
Coef .
Robust Std. Err.
L0 L2D
.6862262 -. 0853582
. 1446942 .0560155
4.75 -1.52
0.000 0.128
.4028266 -.1951467
.9696257 .0244302
Dl LD
6078208 3926237
. 1782055 .1679931
-3.41 2.34
0.001 0.019
-.9570972 .0633632
- . 2585445 . 7218842
Dl LD L2D
3568456 0580012 — -. 0199475
.0590203 .0731797 .0327126
6.05 -0.79 -0.61
0.000 0.428 0.542
.241168 -.2014308 -.0840631
.4725233 .0854284 .0441681
Dl LD L2D
6085073 1057969
.1725313 .2317163 .1412021
3.53 -3.07 0.7B
0.000 0.002 0.454
.2703522 -1.165321 -.1709542
.9466624 -.2570095 .382548
Dl
5029062
.0158028
0.18
0.854
-.0280667
. 0338791
Dl
-. )404378
.0280582
-1.44
0.150
-.0954307
. 0145552
Dl
-. 5652767
.0365451
-1.79
0.074
-.1369038
. 0063503
Dl
-. )690928
.047413
-1.46
0.145
-.1620205
. 0238348
Dl
-, )650302 J095545
. 0576305 .0102896
-1.13 0.93
0.259 0.353
-.1779839 -.0106127
. 0479235 ,0297217
n
2
P>|zl
[95X Conf . Interval]
n
¥
k
ys
miesi
yr!980 yr!981 yr!982 yr!983 yr!984 _cons
Arellano-Bond test that average autocovariance HO: no au ;ocorrelation z = -3,60 Arellano-Bond test that average autocovariance HO: no au ;ocorrelation z = -0.52
in residuals of order 1 is 0: Pr > z = 0.0003 in residuals of order 2 is 0: Pr > z = 0.6058
The coefficients are the s|me. but now the standard errors and the value of the test for second-order autocorrelation match the alues reported column (al) of Table 4 in Arellano and Bond (1991). As one might suspect, most of th< robust standard errors are higher than those that assume a homoskedastic error term. Note that xtajond does not report the Sargan statistic in this case. Overall, the test result i from the one-step model are mixed. The Sargan test from the one-step homoskedastic estimator •ejects the null hypothesis that the over-identifying restrictions are valid, However, this could be cue to heteroskedasticity. In both the homoskedastic and the robust cases, the null of no first-order i utocorrelation in the differenced residuals is rejected, but it is not possible to reject the null of no s :cond-order autocorrelation. The presence of first-order autocorrelation in the differenced residuals not imply that the estimates are inconsistent. However, the presence of second-order autocorrelati )n would imply that the estimates are inconsistent. (See Arellano and Bond
296
xtabond — AreHano- Bond linear, dynamic panel data estimator
(1991, 281-282) for a discussion on this point.) The above output indicates that we have included several statistically insignificant variables. However, since we want the results to change with the estimation procedure for a given model, we will not remove these variables.
Example xtabond reports the Wald statistic of the null that all the coefficients except the constant are zero. For the case at hand, the null is soundly rejected. In column (al) of Table 4, Arellano and Bond report a chi-squared test of the null that all the coefficients, except the constant and the time dummies, are zero. Here is an example of how to perform this test in Stata: . test Id.n 12d.n d.w Id.w d.k Id.k 12d.k d.ys Id.ys 12d.ys ( 1) LD.n = 0.0 ( 2) L2D.n = 0,0 ( 3) D.w = 0.0 ( 4) LD.w = 0.0 ( 5) D.k = 0.0 ( 6) LD.k = 0 . 0 C 7) L2D.k » 0,0 C 8) D.ys = 0.0 C 9) LD.ys = 0.0 (10) L2D.ys = 0.0 chi2( 10) = 408.29 Prob > cM2 = 0.0000
l> Example Since the rejection of the null of the Sargan test may indicate the presence of heteroskedasticity, we might expect large efficiency gains from using the two-step estimator. The two-step estimator of the same model produces the following:
(Continued on next page)
xtabond — Arellano-Bond linear, dynamic panel data estimator . xtabond n 1(0/1).w l(0/2).(k ys) yrl9SO-yr!984, lags (2) twostep Arellano-Bond dyn attic panel data Number of obs Group variable (i ): id Number of groups iald chi2(15) Time variable (t): year man number of obs max number of obs mean number of obs Two-stap results
n
Coef.
Std. Err.
z
297
= =
611 140 1035.56 4 = 6 = 4.364286
P>|zl
[957, Conf . Interval]
n LD L2D
6287089 - 0651882
.0904543 .0265009
6.95 -2.46
0.000 0.014
.4514217 -.117129
.8059961 -.0132474
Dl LD
- 5267697 3112899
.0537692 .0940116
-9.78 3.31
0.000 0.001
-.6311453 . 1270305
-.420374 .4955492
Dl LD L2D
2783619 0140994 - 0402484
.0449083 .0528046 .0258038
6.20 0.27 -1.56
0.000 0.789 0.119
. 1903432 -.0893957 -.0908229
.3663807 .1175946 .010326
Dl LD L2D
5919243 - 5659863 1005433
.1162114 .1396738 .1126749
5.09 -4.05 0.89
0.000 0.000 0.372
. 3641542 -.8397419 -.1202955
.8196943 -.2922306 .321382
Dl
0006378
.0127959
0.05
0,960
-.0244417
.0257172
Dl
- 0550044
.0235162
-2.34
0.019
-.1010953
-.0089135
Dl
.075978
.0302659
-2.51
0.012
- . 135298
-.0166579
Dl
- 0740708
.0370993
^2.00
0.046
-.146784
-.0013575
Dl
- 0906606 0112155
.0453924 .0077607
"2.00 1.45
0.046 0.148
-.179628 -.0039756
-.0016933 .0264066
w k
ys
yr!980 yr!981 yr!982 yr!983 yr!984 _cons
Warning: Arellano and Bond recommend using one-step results for inferenc on coefficients Sargan test of ov r-identifying restrictions: chi2(25) 31.38 Prob > chi2 = 0.1767 Arellano-Bond tesi that average autocova^iance in residuals of order 1 is 0: HO: no &\ tocorrelation z = -3.00 Pr > z = 0.0027 Arellano-Bond tesi that average autocovatiance in residuals of order 2 is 0: HO: no ai tocorrelatien -5.42 Pr > z = 0.6776
Note that Arellano and E ond recommend using the one-step results for inference on the coefficients. Several studies have fou id that the two-step standard errors tend to be biased downward in small samples. For this reason. the one-step results are generally recommended for inference. (See Arellano and Bond (1991) for details.) However, as this example illustrates, the two-step Sargan test may be better for inference on rrodel specification. In terms of interpreting the output, the most important change is that we can no longer reject the null in the Sargan test. F owever, it remains safe b reject the null of no first-order serial correlation in the differenced residui Is. It is also worth notirtg that the magnitudes of several of the coefficient estimates have changed < nd that one even switched its sign.
xtabond — Arellano-Bond linear, dynamic panel data estimator
298
Example In some cases, the assumption of strict exogeneity is not tenable. Recall that a variable xu is said to be strictly exogenous if E[xueis] = 0 for all t and s. If E[xiXCia] ^ 0 for s < t but E[xit£ia] — 0 for all s < t, the variable is said to be predetermined. Intuitively, if the error term at time t has some feedback on the subsequent realizations of xn then xu is a predetermined variable. Since unforecastable errors today might affect future changes in the real wage and in the capital stock, one might suspect that the log of the real product wage and the log of the gross capital stock are not strictly exogenous but are predetermined. In this example, we treat w and k as predetermined. . xtabond n 1(0/1).ys yrl98Q-yrl984, lags(2) twostep pre(w, lagCl,.)) pre(k,lag(2,.)) Arellano-Bond dynamic panel data Number of obs = 611 Group variable (i) : id Number of groups = 140 8044.21 Wald chi2(14) Time variable (t) : year 4 min number of obs = max number of obs = 6 mean number of obs = 4.364286 Two-step results
n
Coef.
Std. Err.
z
P>lzl
C9SX Goaf. Interval]
n LD L2D
.848743 - . 0988995
.0292901 .0163296
28.98 -6.06
0.000 0.000
.7913355 - . 1309049
.9061505 -.0668941
Dl LD
-.6727914 .5903514
.0215607 .0365629
-31.20 16.15
0.000
o.ooo
-.7150495 .5186894
-.6305333 .6620135
Dl LD L2D
.3886398 -.1735626 -.0500805
.0319844 .0277832 .0155138
12.15 -6.25 -3.23
0.000 0.000 0.001
.3259515 -.2280166 - . 080487
.4513282 -.1191086 -.019674
Di LD
.6665421 -.786225
.0608645 . 0704774
10.95 -11.16
0.000 0.000
. 54725 -.9243582
. 7858343 -.6480919
Dl
-.004441
.0082162
-0.54
0.589
- . 0205445
.0116625
Dl
-.0533043
.0153854
-3.46
0.001
-.0834591
-.0231496
Dl
-.0941273
.0216129
-4.36
0.000
-.1364877
-.0517669
Dl
-.1154812
.0283393
-4.07
0.000
-.1710252
-.0599372
Dl
- . 1374538 . 0209066
.0310743 .0052402
-4.42 3.99
0.000 0.000
-.1983584 .0106359
- . 0765493 .0311772
w
k
ys
yrlSSO yr!981 yr!982 yr!983 yr!984 _cons
Warning: Arellano and Bond recommend using one-step results for inference on coefficients Sargan test of over-identifying restrictions: chi2(75) = 74.92 Prob > chi2 = 0,4809 Arellano-Bond test that average autocovariance in residuals of order 1 is 0: HO: no autocorrelation z = -4.91 Pr > z = 0.0000 Arellano-Bond test that average autocovariance in residuals of order 2 is 0: HO: no autocorrelation z = -0.80 Pr > z = 0.4251
The increase in the p-value of the Sargan test indicates that treating the w and k as predetermined makes it more difficult to reject the null that the over-identifying restrictions are valid. This increase
xtabond — Arellano-Bond linear, dynamic panel data estimator
299
in the Sargan statistic provides some evidence that w and k are better modeled as predetermined variables.
t> Example Treating variables aj the methods and formu its size.) There are two with too many over-ide for a discussion of the estimate. The instrumei the above model, the m the estimation with a s
jredetermined increases the size of the instrument matrix very quickly. (See section for a discussion of how this matrix is created and what determines otential problems with a very large instrument matrix. First, GMM estimators ifying restrictions may perform poorly in small samples. (See Kiviet (1995) dynamic panel data case.) Second, the problem may become too large to matrix cannot exceed the current limit on matsize. For instance, to estimate size must be at least 90; Here is what would have happened if we attempted aller matsize:
. set matsize 50 . xtabond n 1(0/ .ys yr!980-yri984, lags(2) twostep pre(w, lag(l,.» pre(k,lag(2,.)) ma'tsize too smal type -help matsizematsize must be t least 90 (you have 90 ins ruments) r(908);
To handle these probliejis, it is conventional to allow users to set a maximum number of lagged levels to be included a instruments for the predetermined variables. Here is an example in which a maximum of three lagg d levels of the predetermined variables are included as instruments:
(Continued on next page)
xtabond — Arellano-Bond linear, dynamic panel data estimator
300
. itabond n l'C/l).ye yr!980-yrl984, lags<2) twostep pre(w,lag(i,3)) pre(k,lag<2,3» 611 Arellano-Bond dynamic psrel data Number of obs = 140 Number of groups = Group variable (i): id 1488.18 Wald chi2(14) 4 Time variable (t) : year min umber of obs = 6 maz number of obs = 4 . 364286 mean number of obs = Two-step resists
n
Coef.
[95X Conf . Interval]
z
P>\z\
.05054 .0215709
18.84 -4.26
0.000 0.000
. 8529937 -.1342172
1.051107 -.049661
Std. Err.
n LD ' .9520503 L2D ' -.09193S1 V
Di LL ,
-.6535944 .7704224
. 0428962 .0628764
-15.24 12.25
0.000 0.000
-.7376695 .6471869
-.5695193 . 8936579
Dl LD . L2D
.3843598 -.2475157 -.0713522
.0596985 .0486268 .0207376
6.44 -5.09 -3.44
0.000 0.000 0.001
.267353 -.3428225 -.1119971
. 5013667 -.1522089 -.0307073
Dl
LD
.8019601 -.9956238
. 1047645 .1026155
7.65 -9.70
0.000 0.000
. 5966254 -1.196751
1.007295 -.7945061
Di
-.0034317
.012242
-0.28
0.779
- . 0274256
.0205621
Dl •
-.0665268
.0206707
-3.22
0.001
-.1070407
-.026013
Dl ' -.1338765
.026055
-5.14
0.000
-.1849433
- . 0828097
Dl '.
-.184797
.033631
-5.49
0.000
-.2507125
-.1188814
-.2415429 .0311209
.039151 .006089
-6.17 5.11
0.000 0.000
-.3182776 .0191868
- . 1648083 .0430551
k
ys yr!980 yr!981 yr!982 yr!983 yr!984
i
Dl
_cons 1
Warning: Arellano and Bond recommend using one-step results for inference on coefficients Sargan test of over-identifying restrictions: chi2(55) = 49.33 Prob > chi2 •• = 0.6901 Arellano-Bond test that average autocovariance in residuals of order 1 is 0: HO: no autocorrelation z = -4.90 Pr > z = 0.0000 Arellano-Bond test that average autocovariance in residuals of order 2 is 0: HO: no autocorrelation z = -1.24 Pr > z = 0.2142
1> Example xtabond handles data in which there are missing observations in the middle of the panels. In the following example, we deliberately set the dependent variable to missing in the year 1984:
-
xtabond — Arellano-Blond linear, dynamic panel data estimator . replace n=. if year=1980 (140 real change s made, 140 to missing) . xtabond n 1(0/ !).« l(0/2).(k ys) yrl380-yr!984, lags (2) robust note: yr!981 drc sped due to collinearity note: yr!982 drc sped due to collinearity note: the residi als and the L(l) residuals have no obs in common The AR(1) is trivially zero note: the residi als and the L(2) residuals have no obs in common The AR(2) is trivially zero Arellano-Bond dj namic panel data Number of obs Group variable ( i) : id Number of groups Wald chi2(ll) Time variable (t ) : year min number of obs max number of obs mean number of obs One-step results
= =
115 101 42.19 1 2
= = = 1 . 138614
Coef .
Robust Std. Err.
z
LD L2D
.1782912 .0305248
.2213558 .0524455
0.81 0.58
0.421 0.561
-.2655582 -.0722665
.6121406 . 133316
Dl LD
-.265179 .1861825
.1497362 .1430316
-1.77 1.30
0.077 0.193
-.5586565 -.0941542
.0282986 .4665193
Dl LD L2D
.3971506 -.0310769 -.0374041
.0887184 .0908215 .0642649
4.48 -0.34 -0.58
0.000 0.732 0.561
.2232657 -.2090837 -.163361
.5710354 . 14693 .0885528
Dl LD L2D
.3872922 -.6809435 .5562209
.3840567 .4930012 .4333161
1.01 -1.38 1.28
0.313 0.167 0.199
-.365445 -1.647208 -.2930632
1 . 14003 .2853211 1.405505
Dl
(dropped)
Dl
-.006329
.0257789
-0.25
0.806
-.0568547
.0441967
Dl
(dropped) .0015564
.0104385
0.15
0.881
-.0189027
.0220154
n
P>|z|
301
[95*/. Conf . Interval]
n w k
ys
yr!980 yr!983 yr!984 _cons
Arellano-Bond t< st that average autocoTariance in residuals of order 1 is 0: HO: no autocorrelation z = . Pr > z = Arellano-Bond t< st that average autoco^ariance in residuals of order 2 is 0: HO: no autocorrelation z = . Pr > z = There are two importa it aspects to this example. First, note the warnings and the missing values for the AR tests. We asked xtabond to compute tests for which it did not have sufficient data, so it issued warnings and set the v? lues of the tests to missihg. Second, since xtabond uses time- series operators in its computations, if statements and missing values are not equivalent. An if statement will cause the false observations o be excluded from the sample, but it will compute the time-series operators wherever possible. In :ontrast, missing data prohibit the evaluation of the time -series operators that involve missing observ ations. Thus, the above example is not equivalent to the following one:
xtabond — Arellano-Bond linear, dynamic panel data estimator
302
. use abdata_man . xtabond n 1(0/1). w 1(0/2). (k ys) yrl980-yr!984 if year!=1980, lags(2) robust note: yr!980 dropped due to collinearity Arellano-Bond dynamic panel data Number of obs = 473 Group variable (i): id Number of groups = 140 Wald chi2(14) = 458.66 Time variable (t) : year min number of obs = 3 max number of obs = 5 mean number of obs = 3.378571 One-step results Coef .
Eobust Std. Err.
z
P>Izl
LD L2D
.9496527 . 008506
. 1587252 .0772524
5.98 0.11
0.000 0.912
.6385569 - . 1429059
1.260748 . 1599179
Dl LD
-.7122964 .602949
. 1954866 .2116855
-3.64 2.85
0.000 0.004
-1.095443 .1880531
-.3291497 1.017845
Dl LD L2D
.3610396 -.1905422 -.0962995
.0783599 .110239 . 0579643
4.61 -1.73 -1.66
0.000 0.084 0.097
. 2074569 -.4066067 -.2099075
.5146222 .0255223 .0173084
Dl LD L2D
.462348 -1.022322 .1513848
. 1875766 . 2758434 . 1720838
2.46 -3.71 0.88
0.014 0.000 0.379
. 0947047 -1.562966 - . 1858933
. 8299912 -.4816793 .488663
Dl
-.0767973
.0228655
-3.36
0.001
-.1216129
-.0319818
Dl
-.1226698
. 0374486
-3.28
0.001
- . 1960676
-.0492719
Dl
-.1383481
.0509257
-2.72
0.007
-.2381607
-.0385355
Dl
-.1460515 .0216733
.0638775 .0121816
-2.29 1.78
0.022 0.075
-.2712491 -.0022022
-.0208539 .0455488
n
[957, Conf . Interval]
n
w
k
ys yr!981 yr!982 yr!983 yr!984 _cons
Arellano-Bond test that average autocovariance in residuals of order 1 is 0: HO: no autocorrelation z = -3.62 Pr > z = 0 . 0003 Arellano-Bond test that average autocovariance in residuals of order 2 is 0: HO: no autocorrelation z = 0.34 Pr > z = 0.7327
In this case, the year 1980 is dropped from the sample, but when the value of a variable from 1980 is required because a lag or difference or difference is required, the 1980 value is used. <J
(Continued on next page)
x abend — Arellano-Bond linear, dynamic panel data estimator
303
Saved Results xtabond saves in e() Scalars Sargan te t statistic e (sargan) degrees o ' freedom in sargan e(sar_df) e(F) model F statistic (small only) e(F_p) p-value ft )rn model F (small only) e(F_df) restriction ; in model F (small only) denomina or df in F (small only) eCdf-r) number o observations e(N) e(N_g) number o groups e(Ti_aean) mean gro ip size e(Ti_max) largest gr< >up size e(Ti_min) smallest | roup size
e(t_max) e(t-jnin) e(cM2) e(chi2_p) e(chi2^df) e(ann#) e(sig2) e (attests) e(zcols) e(df_ai) e(p)
maximum time in sample minimum time in sample model \2 statistic p-value from model chi2 restrictions in model chi2 test for autocorrelation of order # estimate of a* number of AR tests performed columns instrument matrix Zi model degrees of freedom number of lags of e(depvar)
Macros e(cmd) e(depvar) e(inst_l) e(tsostep)
xtabond name of < ependent variable additional level instruments twostep, if two-step model specified
e(ivar) variable denoting groups e (robust) robust, if robust specified e (predict) program used to implement predict
Matrices e(b)
coefficien vector
e(V)
Functions e (sample)
marks est mation sample
variance-covariance matrix of the estimators
Methods and Formu as Consider dynamic pan l-data models of the fortn
Note that x and w m Let Xit = (yu-i-yit where K = p+ki + k%
contain lagged independent variables and time dummies. Vif-w- Xit, Wit) tte the 1 x K vector of covariates for i at time t.
Now rewrite this re] at onship as a set of Ar equations for each individual:
Simplifying the notaticjn , assume that there are no more than p lags of any of the covariates. Then for each i, t/,-.i;, and e, ire all Ti — p x 1 vectors; jfc is just the stacked values of yn for person L tj is a vector of ones, an e, contains the stacked values of e^ for person i The matrix XT contains the p lags of y,-t, the vai es of Xjj, and the Wi t . And, S is the K x 1 vector of coefficients. Define the first-differeiced versions as
304
xtabond — Arellano-Bond linear, dynamic panel data estimator
yui0+2+p
— 2/itio-fP
\
— Xit ttiO + l+P
e
* =
\ The most difficult part of these estimators is defining and implementing the matrix of instruments for each i, Zi. Begin by considering a simple balanced panel example in which our model is
ya = ytt-i«i + J/«-2«2 + Xit/3 + ^ + tit Note that there are no predetermined variables. Further simplify the situation by assuming that the data come from a balanced panel in which there are no missing values. After first differencing the equation, we have -it
The first three observations are lost to lags and differencing. Since KH contains only strictly exogenous covariates, Axjt will serve as its own instrument in estimating the first-differenced equation. Assuming that €n are not autocorrelated, for each i at t = 4, yn and y^ are valid for the lagged variables. Similarly, at t = 5, yn, y^, and y^ are valid instruments. Continuing in this fashion, we obtain an instrument matrix with one row for each time period that we are instrumenting:
o
o
o
o
0
0
yii
yi2
yi3
0
V 0
0
0
0
0
0 0
y«
0 0
0 0
Ax tr
/
Note that Zi has T - p - 1 rows and Y^=p * columns. The extension to other lag structures with complete data is immediate. Unbalanced data and missing observations are handled by dropping the rows for which there is no data and filling in zeros in columns where missing data would be required. For instance, suppose that for some i the t = 1 observation was missing. Our instrument matrix would then be
xtabond — Arellano-Bond linear, dynamic panel data estimator
/O 0 0 yiz 0 0 0 0 0 0 0 0 jA-2 Jto
\0 0 0 0 0 0
0
0 0 ...
0 0 0
0 0
305
0 0
jfej . . . 3/iT_2
t)
Note that Z{ has Tt - f — 1 rows and 2Ji=p * columns, where r = maxi T, Predetermined variab es are treated similarly to the lagged dependent variables. See Arellano and Bond (1991, 290) for ar example. Note that the numbe of columns in Z{ can grow very quickly for moderately long panels or for models with several pre etermined variables.
Let Hi be the ( errors, i.e.,
p - 1) x (Ti-p-l) covariance matrix of the differenced idiosyncratic
/ 2 _]_
,
< = «[«,«,] =
-1 0 ... 2 __^ ...
0
0
0
A
n
n
...
0 0
0
0
2 1
\
-1 ct
i
Then, for some instrument matrix Zi, the one-step Arellano-Bond estimator of S, Si, is given by N
N
where
and
t=rl
Using <5i. the one-ste residuals for i are
2*
* _ Y*/T
Assuming homoskedasti> ty, the variance-covariance estimator of the parameter estimator <5i is
where N <Ti
i-1
and A7T = ^fli Tt - j=
i.
306
xtabond — Arellano-Bond linear, dynamic panel data estimator
The robust estimator is given by
i=l
where
and
The two-step estimator of 8, 62, is given by N
\
/ N
,i=l
where N
i=l
The two-step VCE is
For the one-step, homoskedastic case, the test for autocorrelation of order m in the differenced residuals e* is given by t=sl. v tnt• t _
where and Lm is the m-order lag operator, and
N
'
*
For the one-step, robust case, the test becomes
>l/2
rtabond -r- Arellano-Bond linear, dynamic panel data estimator
where JV
Blr=Y? t=i N
For the two-step ca« the test is
1/2
B
where N
f=i
where and
w*
^».
and G2
The Sargan test for tl e one-step model is N
s,= The Sargan test for tl e two-step model is
v
i-l
307
308
xtabond — Arellano-Bond linear, dynamic panel data estimator
References Anderson, T. W. and C. Hsiao. 1981. Estimation of dynamic models with error components. JournaJ of the American Statistical Association 76: 598-606. . 1982. Formulation and estimation of dynamic models using panel data. Journal of Econometrics 18: 47-82. Arellano, M. and S. Bond. 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. The Review of Economic Studies 58: 277-297. Baltagi, B. H. 1995. Econometric Analysis of Panel Data, New York: John Wiley & Sons. Hansen, L. P. 1982. Large sample properties of generalized method of moments estimators. Econometrica 50: 1029-1054. Kiviet, J. 1995. On bias, inconsistency, and efficiency of various estimators in dynamic panel data models. Journal of Econometrics 68: 53-78. Layard, R. and S. J. Nickell. 1986. Unemployment in Britain. Economica 53: 5121-5169.
Also See Complementary:
[R] adjust, [R] lincom, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, [R] xtdata, [R] xtdes, [R] xtsum, [R] xttab
Related:
[R] xtgee, [R] xtintreg, [R] xtivreg, [R] xtreg, [R] xtregar, [R] xttobit
Background:
[u] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [R]xt
ritie -
xtciog — Random-f
' '
Syntax Random-effects model xtciog depvar [vailist] [weight] [if exp] [in range] [, re i(varname') quad(#) noconstant nossip level(#) offset {.varname) nolog maximize-options
Population-averaged mo iel xtciog depvar [var list] [weight] [if exp] [in range] , pa [ i(varname') robust noconstant lev il(#) off set (varname) nolog xtgee^options maximize-options by . . . : may be used with rtclog; see [I] by. iweights, fweights, and F»eights are allowed for the population-averaged model and iweights are allowed for the random-effects model see [U] 14,1.6 weight. Note that weights must be constant within panels. xtciog shares the features
all estimation commands; see fU] 23 Estimation and post-estimation commands.
Syntax for predict Random-effects model predict [type] newvarname [if exp] [in range] [, { xb
puO
stdp ]•
rate
xb
nooffset
Population-averaged moc 'el predict [type] ne\>varname [if exp] [in range] [, [ mu
stdp }
nooffset 1 These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample.
Description xtciog estimates popiflation-averaged and random-effects complementary log-log (cloglog) models, There is no command for a conditional fixed-effects model, as there does not exist a sufficient statistic allowing the fixed effects to be conditioned out of the likelihood. Unconditional fixed-effects cloglog models may be estimates with the cloglog corflmand with indicator variables for the panels. The appropriate indicator vai ables can be generated using tabulate or xi. However, unconditional fixed-effects estimates an biased. By default, the popul ition-averaged model is an equal-correlation model; that is, xtciog, pa assumes corr(exchang«able). See [R] xtgee for details on how to fit other population-averaged models. 309
310
xtclog — Random-effects and population-averaged cloglog models
Note: xtclog, re, the default, is slow since it is calculated by quadrature; see Methods and Formulas. Computation time is roughly proportional to the number of points used for the quadrature. The default is quad (12). Simulations indicate that increasing it does not appreciably change the estimates for the coefficients or their standard errors. See [R] quadchk. See [R] logistic for a list of related estimation commands.
Options re requests the random-effects estimator, re is the default if neither re nor pa is specified, pa requests the population-averaged estimator. i (varname} specifies the variable name that contains the unit to which the observation belongs. You can specify the i() option the first time you estimate, or you can use the iis command to set i() beforehand. After that, Stata will remember the variable's identity. See [R] xt. quad(#) specifies the number of points to use in the quadrature approximation of the integral. The default is quad (12). robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the IRLS variance estimator; see [R] xtgee. This alternative produces valid standard errors even if the correlations within group are not as hypothesized by the specified correlation structure. It does, however, require that the model correctly specifies the mean. As such, the resulting standard errors are labeled "semi-robust" instead of "robust". Note that although there is no cluster () option, results are as if there were a cluster () option and you specified clustering on i(). noconstant suppresses the constant term (intercept) in the model. no skip specifies that a full maximum-likelihood model with only a constant for the regression equation be estimated. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. offset (vamame} specifies that varname is to be included in the model with its coefficient constrained to be 1. nolog suppresses the iteration log. xtgee^options specifies any other options allowed by xtgee for family (binomial) link (cloglog) such as corr (); see [R] xtgee. maximize-options control the maximization process; see [R] maximize. Use the trace option to view parameter convergence. Use the ltol(#) option to relax the convergence criterion; the default is le—6 during specification searches.
Options for predict xb calculates the linear prediction. This is the default for the random-effects model. puO calculates the probability of a positive outcome, assuming that the random effect for thai observation's panel is zero (v = 0). Note that this may not be similar to the proportion of observed outcomes in the group.
xtciog — Random-effects and population-averaged cioglog models
311
stdp calculates the standard error of the linear prediction. mu and rate both calculat the predicted probability of depvar. mu takes into account the offset ( ) . rate ignores those adjjistments. mu and rate are equivalent if you did not specify offsetO. mu is the default for thi population-averaged m6del. noof f set is relevant only f you specified offset (yarname) for xtciog. It modifies the calculations made by predict so tfcat they ignore the offset variable; the linear prediction is treated as x^b rather than x,*b + offs
Remarks xtciog, pa is a convenience command if you want the population-averaged model. Typing . xtciog .... pa ...
is equivalent to typing . xtgee ..., ... family(binomial) link(cloglog) corr(exchangeable)
Thus, also see [R] xtgee fo information about xtciog. By default, or when r« is specified, xtciog estimates a maximum-likelihood random-effects model. I
t> Example You are studying unionization of women in the United States and are using the union dataset; see [R] xt. You wish to est mate a random-effects model of union membership: xtciog union age grade not_smsa south southXt, i(id) nolog Random-effects compl smentary log-log Number of obs = Group variable (i) : idcode Number of groups = Random effects u_i ~ Gaussian Obs per group: min = avg = max = Wald chi2(5) Log likelihood = -i D559.721 Prob > chi2 union
Iloef.
Std. Err.
age grade not_smsa south southXt _cons
.0 11177 .05 7635 -.21 ^2413 -.87 4097 .01 3595 -3.0 16854
.0032236 .012636 .0613769 .0851594 . 0059679 .1816358
/Insig2u
1.1 38696
signa_u rho
1.784874 .76 10957
e
P>lzl
26200 4434 1 5.9 12 221.84 0.0000
[951/. Conf. Interval] .0048588 . 0329875 -.3325379 -1.039319 .0056626 -3.422853
.0174952 .0825195 -.0919447 -.7055004 . 0290564 -2.710854
.0392705
1.081727
1 . 235665
.0350464 .0071405
1.717489 .7468207
1.854903 .7748085
3.47 0.001 4.57 0.000 -3.46 0.001 -10.24 0.000 2.91 0.004 -16. is 0.000
Likelihood ratio test of rho=0: chibar2(01) = 5968.96 Prob >= chibar2 = 0.000
xtclog — Random-effects and population-averaged cloglog models
312
The output includes the additional panel-level variance component. This is parameterized as the log of the standard deviation \D.av (labeled Insig2u in the output). The standard deviation av is also included in the output, labeled sigma_u, together with p (labeled rho),
which is the proportion of the total variance contributed by the panel-level variance component. When rho is zero, the panel-level variance component is unimportant and the panel estimator is no different from the pooled estimator (cloglog). A likelihood-ratio test of this is included at the bottom of the output. This test formally compares the pooled estimator with the panel estimator. As an alternative to the random-effects specification, you might want to fit an equal-correlation population-averaged cloglog model by typing . xtclog union age grade not_smsa south southXt, i(id) pa Iteration 1: tolerance = .06580809 Iteration 2: tolerance = .00606963 Iteration 3: tolerance * ,00032265 Iteration 4: tolerance = .00001658 Iteration 5: tolerance = 8.864e-07 GEE population-averaged model Number of obs Group variable : idcode Number of groups Link: cloglog Obs per group: min Family: binomial avg Correlation: exchangeable max Wald chi2(5) Scale parameter: 1 Prob > chi2 union
Coef .
age grade not_smsa south southXt _cons
.0045777 .0544267 -.1051731 -.6578891 .0142329 -2.074687
Std. Err.
.0021754 .0095097 .0430512 .061857 .004133 . 1358008
z 2. 10 5.72 -2.44 -10.64 3.44 -15. 28
P>U! 0.035 0.000 0.015 0.000 0.001 0.000
= = = = =
26200 4434 1 5.9 12 232.44 0.0000
[95*/, Conf . Interval] .0003139 . 035788 -.189552 -.7791266 .0061325 -2.340851
.0088415 .0730654 -.0207943 -.5366515 . 0223334 -1.808522
l> Example In [R] cloglog, we showed the above results and compared them with cloglog, robust cluster 0. xtclog with the pa option allows a robust option (the random-effects estimator does not allow the robust specification), and so we can obtain the population-averaged cloglog estimator with the robust variance calculation by typing
xtclog
Random-effects and population-averaged cloglog models
313
. xtclog union age gr .e not_smsa south southXt, i(id) pa robust nolog GEE population-averag d model Group variable: idcode Link: cloglog Family: binomial exchangeable Correlation: Scale parameter:
Number of obs Number of groups Obs per group: min avg max Wald chi2(5) Prob > cM2
1
= = = = = = =
26200 4434 1 5.9 12 153.64 0.0000
(standard errors adjusted for clustering on idcode)
union
age grade not_smsa south southXt _cons
C ef.
. 004 . 054 - . 105 - . 657 . 014 -2.07-
777 267 731 891 329 687
Semi -robust Std. Err. .003261 .0117612 .0548342 .0793619 .005975 .1770236
z 1.40 4.63 -1.95 -8.26 2.38 -11.72
P>lz| 0.160 0.000 0.055 0.000 0.017 0.000
[957, Conf. Interval] -.0018138 .0313948 -.2126462 -.8134355 .0025221 -2.421647
.0109692 .0774585 .0022999 -.5023427 ,0259438 -1.727727
These standard errors are sirr lar to those shown for cloglog, robust cluster () in [R] cloglog.
iveti Results xtclog, re saves in e() Scalars e(N) e(N_g) e(df_ja) e(ll) e(ll_0) e(ll_c) e(g_max) e(g_min)
number of ob number of gr model degree log likelihood log likelihooc log likelihood largest group smallest grou
Macros e(emd) e(depvar) e (title) e(ivar) e(wtype) e(wexp) e(offset)
xtclog name of depe title in estima variable deno weight type weight expres offset
Matrices e(b)
coefficient ve >r
Functions e (sample)
marks estima n sample
;rvations jps of freedom constant-only model comparison model ze size
lent variable on output ig groups on
e(g_avg) e(chi2) e(cM2_c) e(rho) e(sigina_u) e(N_cd) e(n_quad)
average group size x2 x2 f°r comparison test p panel-level standard deviation number of completely determined obs. number of quadrature points
e(chi2type) Wald or LR; type of model x2 test e ! (chi2_ct) Wald or LR; type of model x2 test corresponding to e(cb!2_c) e(distrib) Gaussian: the distribution of the random effect e(predict) program used to implement predict
e(V)
variance-covariance matrix of the estimators
314
xtclog — Random-effects and population-averaged cloglog models
xtclog, pa saves in e(): Scalars e(N) e(N_g) e(df_m) e(g_max) e (g_miri) e(g_avg) e(chi2) e(df_pear)
number of observations number of groups model degrees of freedom largest group size smallest group size average group size X2 degrees of freedom for Pearson x2
e (deviance) e(chi2_dev) e(dispers) e(chi2_dis) e(tol) e(dif) e(phi)
deviance X2 test of deviance deviance dispersion X2 test of deviance dispersion target tolerance achieved tolerance scale parameter
Macros e(cmd) e(cmd2) e(depvar) e(family) e(link) e(corr)
xtgee xtclog name of dependent variable binomial cloglog; link function correlation structure
e(scale) e(ivar) e(vcetype) e(chi2type) e (off set) e (predict)
x2, dev, phi, or #; scale parameter variable denoting groups covariance estimation method Wald; type of model x2 test offset program used to implement predict
coefficient vector variance-covariance matrix of the estimators
e(R)
estimated working correlation matrix
Matrices
e(b) e(V) Functions e(sample)
marks estimation sample
Methods and Formulas xtclog is implemented as an ado-file. xtclog reports the population-averaged results obtained by using xtgee, family (binomial) link (cloglog) to obtain estimates. Assuming a normal distribution, JV(0, o^), for the random effects i/j, we have that L TT F'(Xit/3
where 1 — exp { — exp(xit/3 + i/{)} if yn ^ 0 exp { — exp(xit/3 + Vi)}
otherwise
We can approximate the integral with M -point Gauss-Hermite quadrature M
where the w^ denote the quadrature weights and the a^ denote the quadrature abscissas The log-likelihood L, where p = alj(al + 1), is then calculated using the quadrature
xtclog — Random-effects and population-averaged clogtog models
315
where W{ is the user-spei ified weight for panel i; if no weights are specified, Wi = l. The quadrature formul i requires that the integrated function be well-approximated by a polynomial. As the number of time p ;riods becomes large (as panel size gets large),
is no longer well- approximated by a polynomial As a general rule of thumb, you should use this quadrature approach onlj for small to moderate panel sizes (based on simulations, 50 is a reasonably safe upper bound), Howejver, if the data really coine from random-effects cloglog and rho is not too large (less than, say, .3), then the panel size could be 500 and the quadrature approximation would still be fine. If the data ire not random-effects cloglog or rho is large (bigger than, say. .7), then the quadrature approxim ition may be poor for panel sizes larger than 10. The quadchk command should be used to invest: *ate the applicability of the numeric technique used in this command.
References Liang. K.-Y. and S. L. Zeger. 1986. Longitudinal data analysis using generalized linear models. Biometrika 73: 13-22. Neuhaus, J. M. 5992. Statis ical methods for longitudinal and clustered designs with binary responses. Statistical Methods in Medical Research 249-273. Neuhaus, J. M., J. D. Kalbfl* isch, and W. W. Hauck. 1991. A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. International Statistical Review 59: 25-35. Pendergast. J. F. S. J. Gan e. M. A. Newton. M. J. Lindstrom. M. Palta. and M. R. Fisher. 1996. A survey of methods for analyzing cli stered binary response data. International Statistical Review 64: 89-118.
Also See Complementary:
[R adjust. [R] lincom, [R] mfx, [R] predict, [R] quadchk, [R] test, [R testnl, [R] vce, [R] xtdata, [R] xtdes, [R] xtsum, [R] xttab
Related:
[R cloglog, [R] xtgee, [R] ktlogit, [R] xtprobit
Background:
16.5 Accessing coefficients and standard errors, 23 Estimation and post-estimation commands, [u 23.11 Obtaining robusit variance estimates, [R xt
Title xtdata — Faster specification searches with xt data
Syntax xtdata [var/wf] [if exp] [in range] [, be fe re ratio(#) clear ±(varnamei') nodouble ]
Description If you have not read [R] xt and [R] xtreg, please do so. xtdata produces a converted dataset of the variables specified or, if varlist is not specified, all the variables in the data. Once converted, Stata's ordinary regress command may be used to perform specification searches more quickly than use of xtreg; see [R] regress and [R] xtreg. In the case of xtdata, re, a variable named constant is also created. When using regress after xtdata, re, specify nocons and include constant in the regression. After xtdata, be and xtdata, f e, you need not include constant or specify regress's nocons option.
Options be specifies that the data are to be converted into a form suitable for between estimation. f e specifies that the data are to be converted into a form suitable for fixed-effects (within) estimation. re specifies that the data are to be converted into a form suitable for random-effects estimation, re is the default if none of be, fe, or re is specified. ratioQ must also be specified. ratio(#), used with xtdata, re only, specifies the ratio ov/
Remarks If you have not read [R] xt, please do so. The formal estimation commands of xtreg—see [R] xtreg—are not instant, especially with large datasets. Equations (2), (3), and (4) of [R] xtreg provide a description of the data necessary to estimate each of the models with OLS. The idea here is to transform the data once to the appropriate form and then use regress to estimate such models more quickly.
316
xtdata — Faster specification searches with xt data
317
Example Please see the example in [n] xtreg demonstrating between-effects regression. An alternative way to estimate the between equation is to convert the dka in memory into the between data: . use nlssork, clear (National Longitudinal Survey. Young Women 14-26 years of age in 1968) , xtdata ln_w grade age* ttl_exp* tenure* bjlack not_smsa south, be clear . regress ln_w grade, age* ttl_exp* tenure* black not_smsa south MS 4697 Source df Number of obs SS F( 10, 4686) — 4<;fv o^ Model = 0.0000 Prob > F 415.621613 10 41.502161S = 0.4900 Residual 431.954995 4686 .092179896 R-squared O XOOQ Adj R— squared Total Root MSB = . 30361 846.976608 4696 .180361288 ln_wage
Coef.
grade .06b7602 (output omitted )
Std. Err. .0020006
t 30.37
P>tt|
[951/, Conf. Interval]
0.000
.0568382
.0646822
The output is the same as produced by rtreg, be; the reported R2 is the J?2 between. There are no time savings in using xtdata followed by just one regress. The use of xtdata is justified when you intend to explore the specification of the model by running many alternative regressions.
[0 Technical Note It is important that when using xtdata you eliminate any variables you do not intend to use and that have missing values, xtdata follows a casewise deletion rule, which means that an observation is excluded from the conversion if it is missing o!n any of the variables. In the example above, we specified that the variables be converted on the command line. Alternatively, we could drop the variables first, and it might even be useful to preserve our estimation sample: . use nlswork, clear (National Longitudinal Survey. Young Womeri 14-26 years of age in 1968) . keep id year ln_w grade age* ttl_exp* tenure* black not_smsa south . save regsmpl
Example
!
xtdata with the fe orition converts the data so that results are equivalent to estimating with xtreg with the f e option. . use regsmpl (NLS Women 14-26 . xtdata, fe
; in 1968) I
360
xtJntreg — Random-effects interval data regression models
Q
* f(x)dx « /
•
m=l
where the w^ denote the quadrature weights and the a^ denote the quadrature abscissas. The log-likelihood L where p = 0"^/(0f + 0^) is then calculated using the quadrature L=
where ti^ is the user-specified weight for panel i\ if no weights are specified, Wi = 1. The quadrature formula requires that the integrated function be well-approximated by a polynomial. As the number of time periods becomes large (as panel size gets large),
t=i is no longer well-approximated by a polynomial. As a general rule of thumb, you should use this quadrature approach only for small to moderate panel sizes (based on simulations, 50 is a reasonably safe upper bound). However, if the data really come from random-effects intreg and rho is not too large (less than, say, .3), then the panel size could be 500 and the quadrature approximation would still be fine. If the data are not random-effects intreg or rho is large (bigger than, say, .7), then the quadrature approximation may be poor for panel sizes larger than 10. The quadchk command should be used to investigate the applicability of the numeric technique used in this command.
References Neuhaus, J. M. 1992. Statistical methods for longitudinal and clustered designs with binary responses. Statistical Methods in Medical Research I: 249-273. Pendergast, J. E, S. J, Gange, M. A. Newton, M. J. Lindstrom, M. Palta, and M. R. Fisher. 1996. A survey of methods for analyzing clustered binary response data. International Statistical Review 64: 89-118.
Also See Complementary:
[R] adjust, [R] lincom, [R] mfx, [R] predict, [R] quadchk, [R] test, [R] testnl, [R] vce, [R] xtdata, [R] xtdes, [R] xtsum, [R] xttab
Related:
[R] tobit, [R] xtgee, [R] xtreg, [R] xtregar, [R] xttobit
Background:
[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [R]xt
xtdata — Faster specification searches with xt data
318
regress ln_w grade age* ttl_exp* tenure* black not.smsa soutli df SS MS Source Number of obs F( 9, 28081) Model 412.443881 9 45.8270979 Prob > F Residual 1976.12232 28081 07037222 R-squared Adj R-squared Total 2388.5662 28090 .085032617 Root MSB
ln_wage
Coef,
Std. Err
grade age age2
-.0147051 .0359987 -.000723
4 . 97e+08 .0030904 . 0000486
t -0,00 11.65 -14.88
P>|t! 1.000 0.000 0.000
28091 —
Rd1 O1
= 0 . 0000 - 0.1727 = 0 . 1724 <* . 26528
t9S7, Conf . Interval] -9.7Ee+08 .0299414 -.0008183
9.75e+08 .0420559 -.0006277
(output omitted ) The coefficients reported by regress after xtdata, fe are the same as those reported by xtreg, f e, but the standard errors are slightly smaller. This is because no adjustment has been made to the estimated covariance matrix for the estimation of the person means. The difference is small, however, and results are adequate for a specification search.
0 Example To use xtdata, re, you must specify the ratio avja^ the ratio of the standard deviations of the random effect and pure residual. Merely to show the relationship of regress after xtdata, re to xtreg, re, we will specify this ratio as .2579053/.2906892 = .8872201, which is the number xtreg reports when the model is estimated from the outset; see the random-effects example in [R] xtreg. For specification-search purposes, however, it is adequate to specify this number more crudely and, in fact, in performing the specification search for this manual entry, we used ratio(l). . xtdata, clear re ratio(.8872201) ' • theta — mm 5% median 96X 0.7016 0.2520 0.2520 0.5499
~ max 0.7206
xtdata reports the distribution of 0 based on the specified ratio. If this were balanced data, 9 would have been constant. When running regressions with these data, you must specify the nocons option and include the variable constant: . reg ln_w grade age* ttl_exp* tenure* black not_smsa south constant, nocons Source SS df MS Number of obs = 28091 F( 11, 28080) =14302.56 Prob > F = 0.0000 Model 13271.7162 11 1206.51965 Residual R-squared = 0.8486 2368.7421 28080 .084356913 Total ln_wage grade (output omitted) constant
15640.4583 28091 Coef.
Root MSB
.556778267
Std. Err.
t
=
.29044
P>|t|
[951/, Conf. Interval]
.0646499
.0017812
36.30
0.000
.0611587
.0681411
.2387207
.049469
4.83
0.000
.1417591
.3356823
xtdata — faster specification searches with xt data
319
Results are the same Coefficients and standard errors that xtreg, re previously estimated. The summaries at the top,, however, should be ignored. These summaries are expressed in terms of equation (4) of [R] xtrfeg and. moreover, for a model without a constant.
a Technical Note Obviously, some caiition is required in using xtdata. The following guidelines will help: 1. xtdata is intended for use during the specification search phase of analysis only. Final results should be estimated: with xtreg on unconverted data. 2. After converting the data, you may use regress to obtain estimates of the coefficients and their standard errors. In the case of regress aftef xtdata, f e, the standard errors are too small, but only slightly, 3. You may loosely interpret the coefficient's significance tests and confidence intervals. However, for results after xtiata, fe and re, a wrong (but very close to correct) distribution is being assumed. 4. You should ignore the summary statistics reported at the top of regress's output. 5. After converting the data, you may form lihear, but not nonlinear, combinations of regressors; that is, if your data contained age, it would feot be correct to convert the data and then form age squared. All nonlinear transformations should be done before conversion. (For xtdata, be, you can get away with forming nonlinear combinations ex post, but the results will not be exact.)
Q Technical Note The xtdata command can be used to assist in examining data, especially with graph. The graphs below were produced b^ typing the following: . . . . .
use nlswork iis idcode tis year xtdata, be graph ln_wage
. . . . .
use nlswork, cjear iis idcode i tis year xtdata, fe I graph ln_wage cjge, various options savin£(f igure2)
i i '_ . • \ ; | 4ge, various options savin£(figurel)
. use nlswork, cljear . graph ln_wage 4ge. various option.1; savin^Cf igureS)
320
xtdata — Faster specification searches with xt data Between data
3-
2CL
z
5IB O!
-I
ffl
'
o15
20
25
30
35
40
45
40
45
age in current year
Within data L_
3 o
Is 03
•o
2 ~
Q.
Z
a
o15
20
25
30
35
age in current year
Overall data
3-
2-
a. z
o O)
1
'
0-
15
20
25
30
35
45
age in current year
Q
xtdata — Faster specification searches with xt data
321
Methods and Formulas xtdata is implementejd as an ado-file. (This is a continuations of the Methods and Formulas of [R] xtreg,) xtdata, be, fe, and ire transform the data according to equations (2), (3), and (4), respectively, of [R] xtreg^except that xtdata, f e adds back in the overall mean, thus forming the transformation Xjt - off +1. ; xtdata, re requires tie user to specify r as in estimate of ffvjae. 0; is calculated from
Also See
i?
f
Complementary:
[R]j xtreg, [R] xtregar; [R] regress
Background:
[R]|xt
Title xtdes — Describe pattern of xt data
Syntax xtdes [if exp] [, pattems(#) ±(varnamei) t(yarnamei) ] by . .. : may be used with xtdes; see [R] by.
Description xtdes describes the participation pattern of cross-sectional time-series (xt) data.
Options patterns (#) specifies the maximum number of participation patterns to be reported; patterns (9) is the default. Specifying patterns (50) would list up to 50 patterns. Specifying pat terns (1000) is taken to mean patterns (oo); all the patterns will be listed. i(varnamei) specifies the variable name corresponding to i in Xj t ; see [R] xt. t(varnamet) specifies the variable name corresponding to t in Xjt; see [R] xt.
Remarks If you have not read [R] xt, please do so. xtdes does not have a simple-data counterpart. It describes the cross-sectional and time-series aspects of the data in memory. > Example In [R] xt, we introduced data based on a subsample of the NLSY data on young women aged 14-26 in 1968. Here is a description of the data used in many of the [R] xt examples:
(Continued on next page)
322
xtdes — Describe pattern of xt data
323
. use nlswork j (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . xtdes, i(id) tfyear) 4711 idcode: 1, 2, ..., 5159 n 15 T year: 68, 6!i, .... 88 Deltatyear) * 1; (88-68)+! "121 (idco< le*year uniquely identifies each observation) max Distribution of f_i: Bin 57, |5'/, 50'/, 757, 95X 15 13 ! 1 1 3 5 9 Freq. Percent Gum. Pattern 2J89 2J42 1J89
136 114 89 87 86 61 56 54 54 3974
1J15 84J36
4711
100 ,'00
1^85 1183 1^29 1J19
ihs
2.89 5.31 7.20 9.04 10 .87 12 .16 13 .35 14 .50 15 .64 100 .00
1
1 1 .11 j 11 111111 .iiil 1.11 .1 .11 . .11.1 .11 11
.ijll 1.11 .1 .11 " \ . . .1-.1 .11
(other
patterns)
xxxxxx.X;Xx.x.xx.x.xx
xtdes tells us that we Have 4,711 women in our data and that the idcode that identifies each ranges from 1 to 5,159. We aie also told that the maximum number of individual years over which we observe any woman is 1J5. (The year variable, hdwever, ranges over 21 years.) We are reassured that idcode and year, takeiji together, uniquely identify each observation in our data. We are also shown the distribution of T,; 5J3% of our women are observed 5 years or less. Only 5% of our women are observed for 13 years oj- more. | Finally, we are showi^ the participation pattern. A 1 in the pattern means one observation that year; a dot means no observation. The largest fraction of our women, but still only 2.89%, was observed in the single year 1968; and not thereafter; the next largest fraction was observed in 1988 but not before: and the next largest fraction was observed in 1985, 1987, and 1988. At the bottom is the kim of the participation patterns, including the patterns that were not shown. We can see that none cjif the women were observed in six of the years (there are six dots). (The survey was not administered in those six years.) i If we wished, we cojuld see more of the patterns by specifying the patterns () option and, in fact, we could see all trie patterns ' by specifying patterns (1000).
Example The strange participa ion patterns shown above have to do with our subsampling of the data and not with the administrators of the survey. As art aside, here are the data from which we drew the sample used in the [R] \i examples:
(Continued on next page)
xtdes — Describe pattern of xt data
324
. xtdes idcode: year;
n = 1, 2, ,. ., S1B9 68, 69, .... 88 T = Delta(year) = 1; (88-68)+! = 21 (idcode*year does not uniquely identify observations)
Distribution of T_i: Freq.
1034 153 147 130 122 113 84 79 67 3230 5159
Percent
min 1
5'/, 2
Cum.
20,04 20.04 2,97 23.01 2.85 25.86 2.52 28.38 2.36 30.74 32.93 2.19 1.63 34,56 1.53 36,09 1.30 37,39 62.61 100.00
25% 11
50'/, 15
75'/, 16
5159
15
95'/, 19
max 30
Pattern
111111.1.11.1.11.1.11 1 112111.1.11.1.11.1.11 111112.1.11.1.11.1.11 111211.1.11.1.11.1.11 11 111111.1.11.1.11.1.12 111111.1.12.1.11.1.11 111111.1.11.1,11.1.1. (other patterns)
100.00
xxxxxx.x.xx.x.xx.x.xx
Note that we have multiple observations per year. In the pattern, 2 is used to indicate that a woman appears twice in the year, 3 to indicate 3 times, and so on—X is used to mean 10 or more should that be necessary. In fact, this is a dataset which was itself extracted from the NLSY, in which t is not time but job number. In order to simplify exposition, we made a simpler dataset by selecting the last job in each year.
Methods and Formulas xtdes is implemented as an ado-file.
Also See Related:
[R] xtsum, [R] xttab
Background:
[R] xt
Title xtgee — Estimate pobulation-averaged panel-ilata models using GEE
Syntax
j
xtgee depvar [var/wj] [wejg/tr] [if exp\ [ill range] [, 1 family (family} IJLnkOinfc) corr (correlation) nap rgf trace i (vamame) t (vatpame} force robust sjcore (newvar) ef orm level (#) exposure (varnarie} offset (varoame) npconstant sicale (x21 dev | # j phi ) tolerance (#) iterate (#) nolog ] xtcorr [, compact) ] where family is one of |binoitial(#| vamame] \ gaussian | gamma | igaussian | nbinomial [#]
pioisson }
and link is one of {identity | cloglog | 16>g j logit | nbinomial | opower\#\ power {#] and correlation is one of {independent
problt j reciprocal ]• excjhangeable | ar #
stationary #
npnstationary # | unstructured | fixed matname} For example, . xtgee y xl x2, JEaMly(gauss) liiik(ideni) corr(excli5Uigeable) i(id)
would estimate a random-effects linear regression^—note that the corr (exchangeable) option does not in general provide random effects. It actually!fits an equal-correlation population-averaged model that is equal to the random-effects model for linear regression. I i
I
by . .. : may be used with Jrtgee; see [R] by.
1 ':
',
iweights, freights, and piweights are allowed; see [U] 14.1.6 weight. Note that weights roust be constant within panels. j I xtgee shares the features of all estimation commands; sie [U] 23 Estimation and post-estimation commands.
Syntax for predict predict [type] neyvarname [if exp] [iA range] [, { mu nooffset 1j ~"*
rate
xb
stdp }
1i I
These statistics are available! both in and out of sample; type predict . . . if e(seunple) . . . if wanted only for the estimation sample, i
$25
326
xtgee — Estimate population-averaged panel-data models using GEE
Description xtgee estimates cross-sectional time-series linear models. In particular, xtgee estimates general linear models and allows you to specify the within-group correlation structure for the panels. xtcorr is for use after xtgee. It displays the estimated matrix of the within-group correlations. See [R] logistic and [R] regress for lists of related estimation commands.
Options Options for xtgee family (family) specifies the distribution of depvar; family (gauss) is the default. link (link') specifies the link function; the default is the canonical link for the family () specified. corr (correlation) specifies the within-group correlation structure; the default corresponds to the equal-correlation model, corr (exchangeable). When you specify a correlation structure that requires a lag, you indicate the lag after the structure's name with or without a blank, e.g., corr(ar 1) or corr(arl). If you specify the fixed correlation structure, you specify the name of the matrix containing the assumed correlations following the word fixed, e.g., corr (fixed myr). nmp specifies that the divisor N — P is to be used instead of the default N, where N is the total number of observations and P is the number of coefficients estimated. rgf specifies that the robust variance estimate is multiplied by jfEp, where N is the total number of observations and P is the number of coefficients estimated. This option can only be used with family (gaussian) when robust is either specified or implied by the use of pweights. Using this option implies that the robust variance estimate is not invariant to the scale of any weights used. trace specifies that the current estimates should be printed at each iteration. i(varname') specifies the variable that contains the unit to which the observation belongs. You can specify the i() option the first time you estimate, or you can use the iis command to set i() beforehand. After that, Stata will remember the variable's identity. See [R] xt. t(varname') specifies the variable that contains the time at which the observation was made. You can specify the t ( ) option the first time you estimate, or you can use the tis command to set t ( ) beforehand. After that, Stata will remember the variable's identity. xtgee does not need to know t() for the corr (independent) and corr (exchangeable) correlation structures. Whether you specify t () makes no difference in these two cases. force specifies that estimation is to be forced even though t () is not equally spaced. This is relevant only for correlation structures that require knowledge of t ( ) . These correlation structures require that observations be equally spaced so that calculations based on lags correspond to a constant time change. If you specify a t() variable that indicates observations are not equally spaced, xtgee will refuse to estimate the (time-dependent) model. If you also specify force, xtgee will estimate the model and assume that the lags based on the data ordered by t() are appropriate. robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the default IRLS variance estimator (see Methods and Formulas below). This produces valid standard errors even if the correlations within group are not as hypothesized by the specified correlation structure. It does, however, require that the model correctly specifies the mean. As such, the
xtgee[— Estimate population-averaged panel-data models using GEE
327
resulting standard errors are labeled "semi-robbst" instead of "robust". Note that although there is no cluster () optk>4 results are as if there were a cluster () option and you specified clustering on i. score (newvar) creates ] newvar containing ua = dlnL/dfeuP), where L is the quasi-likelihood function. Note that tne scores for the independent panels can be obtained by Ui = ]T)t uu or' equivalently in Stata,;egen ui=sum(uit) , b^(t) assuming you specified score(uit), The score vector is SlnL/df3 = Y)uitXit, 4e product of newvar with each covariate summed over observations. Seje [U] 2&12 Obtaining stores. i ef ona displays the exponentiated coefficients and corresponding standard errors and confidence intervals as described in [R] maximize. For family (binomial) link(logit) (i.e., logistic regression), exponentiation results in odds ratios; for family (poisson) link(log) (i.e., Poisson regression), exponent ated coefficients are incidence rate ratios. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set levlel; see [U] 23.5 Specifying the width of confidence intervals. exposure (vamame) and off set (varrazme) are different ways of specifying the same thing. exposure () specifies a variable that reflects the amount of exposure over which the depvar events were observec for each observation; \$(varname) with its coefficient constrained to be 1 is entered into the lo^-link function, off set () specifies a variable that is to be entered directly into the log-link function with its coefficient constrained to be 1; thus, exposure is assumed to be evarnarne. If you were estimating a Poisson: regression model, family (poisson) link (log), for instance, you woujld account for exposure time by specifying off set () containing the log of exposure time. j noconstant specifies thbt the linear predictor has no intercept term, thus forcing it through the origin on the scale defined lj>y the link function. scale(x2 1 dev | #| phi) overrides the default scale parameter. By default, scale(l) is assumed for the discrete distributi )ns (binomial, negative binomial, and Poisson) and scale (x2) is assumed for the continuous di: itributions (gamma, Gaussian, and inverse Gaussian). scale (x2) specifies; that the scale parameter^ be set to the Pearson chi-squared (or generalized chi-squared) statistic divided by the residual degrees of freedom, as recommended by McCullagh and Nelder (1989) asia good general choice for continuous distributions. scale (dev) sets the'scale parameter to the dtviance divided by the residual degrees of freedom. This provides an alternative to scale (x2) for continuous distributions and over- or underdispersed discrete distributions.! scale (#) sets the scile parameter to #. For example, using scale (1) in family (gamma) models results in exponentia -errors regression (if ypu use assume independent correlation structure), Additional use of 1: nk(log) rather than t|e default link(power -1) for family (gamma) essentially reproduces Stata's ereg command (See [R] weibull) if all the observations are uncensored (and if you again assume independent correlation structure). scale (phi) specific that the variance matrik should not be rescaled at all. The default scaling that xtgee applies rjiakes results agree with! other estimators and has been recommended by McCullagh and Nelde (1989) in the context of GLM. If you are comparing results with calculations made by other software, you may find that the other packages do not offer this feature. In such cases, specifying scale (phi) should match their results. tolerance(#) specifies the convergence criterion for the maximum change in the estimated coefficient vector between iterat ons; tol(le-G) is the default and you should never have to specify this option.
328
xtgee — Estimate population-averaged panel-data models using GEE
iterate(#) specifies that the maximum number of iterations allowed in estimating the model; iter(lOO) is the default. You should never have to specify this option. nolog suppresses the iteration log.
Option for xteorr compact specifies that only the parameters (alpha) of the estimated matrix of within-group correlations rather than the entire matrix be displayed.
Options for predict mu, the default, and rate both calculate the predicted value of depvar. mu takes into account the off set () or exposure () together with the denominator if the family is binomial; rate ignores those adjustments, mu and rate are equivalent if (1) you did not specify of f set () or exposure () when you estimated the xtgee model, and (2) you did not specify family (binomial #) or family (binomial varname}, which is to say, the binomial family and a denominator. Thus, mu and rate are the same for link(identity) family(gaussian). mu and rate are not equivalent for link (logit) family (binomial pop). In that case, mu would predict the number of positive outcomes and rate would predict the probability of a positive outcome. mu and rate are not equivalent for link (log) family (poisson) expo sure (time). In that case, mu would predict the number of events given exposure time and rate would calculate the incidence rate—the number of events given an exposure time of 1. xb calculates the linear prediction. stdp calculates the standard error of the linear prediction. nooffset is relevant only if you specified offset(varaame), exposure (varname), family (binomial #), or family (binomial varname} when you estimated the model. It modifies the calculations made by predict so that they ignore the offset or exposure variable and ignore the binomial denominator. Thus, predict . . . , mu nooffset produces the same results as predict . . . , rate.
Remarks For a thorough introduction to GEE in the estimation of GLM, see Liang and Zeger (1986). Further information on linear models is presented in Nelder and Wedderburn (1972). Finally, there have been a number of illuminating articles on various applications of GEE in Zeger, Liang, and Albert (1988). Zeger and Liang (1986), and Liang (1987). Pendergast et al. (1996) provide a nice survey of the current methods for analyzing clustered data in regard to binary response data. Our implementation follows that of Liang and Zeger (1986). xtgee fits generalized linear models of ya with covariates Xfj: g{E(yit}} = Xtf/3,
y ~ F with parameters Oit
for i — 1 , . . . ,m and t — 1 , . . . ,n;, where there are Hi observations for each group identifier iIn the above, g() is called the link function and F the distributional family. Substituting various definitions for g ( ) and F results in a surprising array of models. For instance, if yn is distributed Gaussian (normal) and g ( ) is the identity function, we have E(ylt) = Xitfr
y ~ N(}
L
xtgee 4- Estimate population-averaged panektata models using GEE
329
yielding linear regression, random-effects regression, or other regression-related models depending on what we assume for tl ic correlation structure. If g( ) is the logit function and yn is distributed Bernoulli (binomial), we have logit{£(yit)} = Xitfl,
y ~ Bernoulh'
or logistic regression. If j(] is the natural log function and ya is distributed Poisson, we have !
Xi
~ Poisson
or Poisson regression, also known as the log-lineai model. Other combinations are possible. You specify the link function using the linkO option, the distributional family using family (), and the assumed within- roup correlation structure using corrQ. The allowed link functions are nk function
xtgee option
Min. abbreviation
Jgfog entity g git gative binomial ds power wer obit :iprocal
linkCcloglog) link(identity) linkClcjg) link(l<^git) link(nbinomial) link(ojjower) link(pdwer) link(pijobit) link(rdciprocal)
Kcl) l(i)
Klog) Klogl) Kttb) Kopo) Kpow) Kp) l(rec)
Link function cloglog is defined as ln{ - ln(l - y)}. Link function identity is defined as y = y. Link function log is efined as ln(y). Link function logit s defined m{y/(l - y)|, the natural log of the odds. Link function nbinomial a is defined as ln{i/(y + a}}. Link function opower k is defined as [{t//(l - y)} - l}/k. If fc = 0, then this link is the same as the logit link. Link function power i ; is defined as yk. If k == 0, then this link is the same as the log link. •
Link function probitjis defined $ l(y), where
is the inverse Gaussian cumulative.
Link function reciprocal is defined as l/y. The allowed distributional families are Fa nily
xtgee obtion
Min. abbreviation
Be •noulliybinomial ga nma Ga asSian (normal) im erse Gaussian nej :ative binomial Po sson
family (fcinofflial) family (gasuna) family (gaussian) family (ijiga'assian) family (^ibinomial) family (^oisson)
f(b) f(gam) f(gau) f(ig) fCnb) f(p)
family (normal) is allowed as a synonym for family (gaussian).
330
xtgee — Estimate population-averaged panel-data models using GEE
The binomial distribution can be specified as (1) family (binomial), (2) family (binomial #), or (3) family (binomial varname'). In case 2, # is the value of the binomial denominator JV, the number of trials. Specifying family (binomial 1) is the same as specifying family (binomial); both mean that y has the Bernoulli distribution with values 0 and 1 only. In case 3, varname is the variable containing the binomial denominator, thus allowing the number of trials to vary across observations. The negative binomial distribution must be specified as family(nbinomial #), where # denotes the value of the parameter a in the negative binomial distribution. The results will be conditional on this value. You do not have to specify both family() and linkQ; the default linkO is the canonical link for the specified family(): Family
Default link
family(binomial)
link(logit)
family(gamma) family(gaussian) family(igaussian)
link(reciprocal) link(identity) link(power -2)
family(nbinomial)
link(log)
family(poisson)
link(log)
If you do specify both f amilyO and linkO, note that not all combinations make sense. You may choose among the following combinations: cloglog identity log logit nbinom opower power probit reciprocal binomial gamma Gaussian inverse Gaussian negative binomial Poisson
x
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
You specify the assumed within-group correlation structure using the corrQ option. The allowed correlation structures are Correlation structure
xtgee option
Min. abbreviation
Independent Exchangeable Autoregressive Stationary Non-stationary Unstructured User-specified
corr ( independent ) corr (exchangeable) corr(ar #) corr (stationary #) corr(nonstationary #) corr (unstructured) corr (fixed matname)
c(ind) c(exc) c(ar #) c(sta #) c(non #) c(uns) c(fix matname')
Let us explain. Call R the working correlation matrix for modeling the within-group correlation, a square max{rai} x max{n,j} matrix. Option corrQ specifies the structure of R. Let RM denote the t, s element. The independent structure is defined as
1
ift = s
0
otherwise
xtgee 4- Estimate population-averaged panel-data models using GEE
331
The corr (exchangeable) structure (corresponding to equal-correlation models) is defined as 's | p otherwise The corr(ar g~) structure is defined as the usiial correlation matrix for an AR(g) model. This is sometimes called multiplicative correlation. For eicample, an AR(1) model is given by 1 P"
otherwise
The corr (stationary g) structure is a stationary^) model. For example, a stationaryfl) model is given by
i,s = < p if t - s - I 10 otherwise The corr(nonstationary g) structure is a honstationary(g) model that imposes only the constraints that the elements of the working correlation matrix along the diagonal are 1 and the elements outside of the gth band a re zero, 1 if t = \s if 0 < ]\t - s < g, pia = 0 otherwise The corr (unstructured) imposes only the constraint that the diagonal elements of the working correlation matrix are 1. •0 _^ t>JS " \ Pts otherwise, pti - pst The corr (fixed malfiame} specification is taiken from the user-supplied matrix, so that I
R = rkatname \
In this case, the correlations are not estimated ffom the data. The user-supplied matrix must be a valid correlation matrix with Is on the diagonal. Full formulas for all I he correlation structures are provided in the Methods and Formulas below.
i Q Technical Note Some family(), linkQ, and corrQ combinations result in models already estimated by Stata. These are
t
corr()
Other Stata estimation command
identity
independent
identity
exchangeable exchangeable indepenjient eschangisable independent exchangeable indepenjient exchangeable indepentlent indepenjient exchangeable independent
regress xtreg, re (see note 1) xtreg, pa cloglog (see note 2) xtclog, pa logit or logistic xtlogit, pa
f easily ()
linkO
gauss Ian gaussian gaussian binomial binomial binomial binomial binomial binomial nbinomial poisson poisson gamma family
identity clogiog cloglog logit logit probit
probit nbinomial log log log link
independent
probit (see note 3) xtprobit , pa nbreg (see note 4)
poisson rtpois, pa
ereg (see note 5) glm, iris (see note 6)
332
xtgee— Estimate population-averaged panel-data models using GEE
^
Notes:
1. These methods produce the same results only in the case of balanced panels; see [R] xt. 2. For cloglog estimation, xtgee with corr(independent) and cloglog (see [R] cloglog) will produce the same coefficients, but the standard errors will be only asymptotically equivalent because cloglog is not the canonical link for the binomial family. 3. For probit estimation, xtgee with corr(independent) and probit will produce the same coefficients, but the standard errors will be only asymptotically equivalent because probit is not the canonical link for the binomial family. If the binomial denominator is not 1, the equivalent maximum-likelihood command is bprobit; see [R] probit and [R] glogit. 4. Fitting a negative binomial model using xtgee (or using glm) will yield results conditional on the specified value of a. The nbreg command, however, fits that parameter and provides unconditional estimates; see [R] nbreg. 5. rtgee with corr(independent) can be used to estimate exponential regressions, but this requires specifying scale (1). As with probit, the xt gee-reported standard errors will be only asymptotically equivalent to those produced by ereg (see [R] weibull) because log is not the canonical link for the gamma family, xtgee cannot be used to estimate exponential regressions on censored data. Using the independent correlation structure, the xtgee command will estimate the same model as estimated with the glm, iris command if the family-link combination is the same. f
i- If the xtgee command is equivalent to another command, then the use of corr (independent) and the robust option with xtgee corresponds to using both the robust option and the cluster(varname) option in the equivalent command, where varname corresponds to the i() group variable.
ztgee is a direct generalization of the glm, iris command and will give the same output whenever the same family and link are specified together with an independent correlation structure. What makes xtgee useful is 1 • the number of statistical models that it generalizes for use with panel data, many of which are not otherwise available in Stata; 2. the richer correlation structure xtgee allows even when models are available through other xt commands; and 3. the availability of robust standard errors (see [u] 23.11 Obtaining robust variance estimates) even when the model and correlation structure are available through other xt commands. " In the following examples, we illustrate the relationships of xtgee with other Stata estimation commands. It is important to remember that although xtgee generalizes many other commands, the computational algorithm is different; therefore, the answers that you obtain will not be identical. The dataset we are using is a subset of the nlswork data (see [R] xt); we are looking at observations prior to 1980.
Example We can use xtgee to perform ordinary least squares performed by regress: • gen age2 = age*age (9 missing values generated)
i
xtgcje — Estimate population-averaged panel-data models using GEE regress ln_w grfde age age2 Source SS df Model Residual Total
i 97 . 54468 3 2: !65. 74584 16081
Number of obs = 1608S F( 3, 16081) = 1413.68 Prob > F = 0.0000 R-squared = 0.2087 Adj R-squared = 0.2085 Root MSE = .37536
199.11156 . 1408^683
2!I63.290S2 16084 . 17802t047
In.wage
Coef .
Std. Err.
grade age age2 _cons
0724483 1064874 0016931 8881487
.0014229 .0083644 . 0001655 . 1024896
—
t ^0.91 }2.73 -10.23 +8.47
. xtgee ln_w grad age age 2, i(id) corr(indep) Iteration 1: tolerance = 1.310e-12 GEE population-averaged model Group variable: i idcoije Link: \ identify Family: Gaussiin Correlation: independent Scale parameter: Pearson chi2(1608J) : Dispersion (Pearson) :
. 14089&8 2265. fs . 1408918
ln_wage
Coef.
Std. Err.
grade age age2 _cone
0724483 1064874 - 0016931 — 8681487
.0014229 .0083644 .0001655 . 1024896
z 90.91 12.73 -10.23 -8.47
333
P>ltl 0.000 0.000 0.000 0.000
[95*/. Conf . Interval] .0696592 .0900922 -.0020174 -1.06904
.0752374 . 1228825 -.0013688 -.6672577
nmp 16085 Number of obs = 3913 Number of groups = 1 Obs per group: min = 4.1 avg = 9 max = 4241.04 Wald chi2(3) Prob > cM2 = 0.0000 Deviance = 2265.75 Dispersion = . 1408958 P>lz| 0.000 0.000 0.000 0.000
[951/, Conf. Interval] .0696594 .0900935 -.0020174 -1.069025
. 0752372 .1228812 - . 0013688 - . 6672728
When the nmp option is s pecified, the coefficients, the standard errors produced by the estimators are exactly the same. Moreo /er, the scale parameter Estimate from the xtgee command equals the MSE calculation from regresb ; both are estimates of the variance of the residuals.
:
Example
The identity link and Gaussian family produce regression-type models. With the independent correlation structure, we reproduce ordinary least squares. With the exchangeable correlation structure, we produce an equal-coiielation linear regression;estimator. i xtgee, f am (gauss) lini(ident) corr(exich) is asymptotically equivalent to the weightedGLS estimator provided y xtreg, re and to trie full maximum-likelihood estimator provided by xtreg, mle. In balances data, xtgee, fam(gauss) link(ident) corr(exch) and xtreg, mle produce exactly the same results. With unbalanced data, the results are close but differ because the two estimators handle un lalanced data differently. For both balanced and unbalanced data, the results produced by xtgee, fan(g;;auss) link(identj corr(exch) and xtreg, mle will differ from those produced by xtreg re. Below, we demonstrate the use of the three estimators with unbalanced data. We begin with xtgjee , then show the maximum likelihood estimator xtreg, mle, then show the GLS estimator xtreg re, and finally show xtgee with the robust option.
334
xtgee — Estimate population-averaged panel-data models using GEE . rtgee ln_w grade age age2, i(id) nolog GEE population-averaged model Group variable: idcode identity Link: Gaussian Family : Correlation: exchangeable Scale parameter:
Number of obs = Number of groups = Obs per group: min = avg = max ~ Wald chi2(3) Prob > chi2 =
.1416586
ln_wage
Coef .
grade age age2 _cons
.0717731 . 1077645 -.0016381 -.9480449
z
Std. Err. .00211 .006885 .0001362 .0869277
34.02 15.65 -12.03 -10.91
. xtreg ln_w grade age age2, i(id) mle Fitting constant-only model; Iteration 0: log likelihood = -6035.2751 Iteration 1: log likelihood = -5870.6718 Iteration 2: log likelihood * -5858.9478 Iteration 3: log likelihood * -S858.8244 Fitting full model: Iteration 0: log likelihood = -4591.9241 Iteration 1: log likelihood = -4562.4406 Iteration 2: log likelihood = -4562.3526 Random-effects ML regression Group variable (i) : idcode Random effects u_i - Gaussian
P>|z|
[951/, Conf . Interval]
0.000 0 . 000 0.000 0.000
.0676377 . 0942701 -.001905 -1.11842
Number of obs Number of groups : Obs per group: min = avg = max = = LR chi2(3) Prob > chi2
Log likelihood = -4562.3526 ln_wage
Coef.
grade age age2 _cons
.0717747 . 1077899 -.0016364 -.9500833
.0021419 .0068265 .000135 .0863831
33.51 15.79 -12.12 -11.00
0.000 0.000 0.000 0.000
/sigma_u /sigma_e
. 2689639 . 2669944
. 004085 .0017113
65.84 156.02
0 . 000 0.000
rho
. 5036748
, 0086443
Std. Err.
z
1608B 3913 1 4.1 9 2918.26 0.0000
P>lzl
[957, Conf .0675766 .0944102 -.0019011 -1.119391
.0759086 .1212589 -.0013712 -.7776698
16085 3913 1 4.1 9 2592.94 0 . 0000 Interval] . 0759728 .1211696 -.0013718 -.7807755
.2609574. . 2636404
.2769704 .2703484
.486734
. 5206089
Likelihood ratio test of sigma_u=0: caibar2(01)= 4996.22 Prob>=chibar2 = 0.000
(Continued on next page)
xtgee — Estimate population-averaged panel-data models using GEE . xtreg ln_w grade age age2, Kid) re Random-effects GLS regression Group variable (i) idcode R-sq: within = 0.0583 between = 0.2S46 overall = 0.2(76 Random effects u_i Gaussian corr(u_i, X) 0 (assumed)
Number of obs = Number of groups = Obs per group: min = avg = max = Wald chi2(3) = Prob > chi2 =
{oef.
grade age age2 _cons
.07: 7757 .10' 8042 -.00: 6355 -.95: 2088
.0021665 .0068126 .0001347 .0863141
sig»m_u sigma_e rho
.27$ 3336 ,26< 2536 .514( 3157
(fraction of variance due to u_i)
z 33.13 15.^2 -12. i4 -11.02
P>!zi 0.000 0.000 0.000 0.000
16085 3913 1 4.1 9 2875.09 0.0000
[955C Goaf. Interval]
ln_wage
Std. Err.
335
.0675295 .0944518 -.0018996 -1.120381
.076022 .1211566 -.0013714 -.7820363
. xtgee ln_w grade aj;e age2, i(id) nolog roljmst Number of obs = 16085 GEE population-averaged model Group variable: idcode Number of groups = 3913 Link: identity Obs per group: min = 1 Family: avg = 4.1 Gaussian Correlation: exchangeable max = 9 Hald chi2(3) = 2031.28 Scale parameter: .1416586 Prob > chi2 = 0.0000 (standard errorsjadjusted for clustering on idcode) ln_wage
i loef .
grade age age2 _cons
.07 ,7731 .10'7645 -.00 6381 -.94i 10449
Semi-robust Std. Err. .0023341 .0098097 .0001964 .1196009
* 30. T5 10. t9 -8.34 -7.|3
P>lz| 0.000 0.000 0.000 0.000
[95*/. Conf . Interval] .0671983 .0885379 -.002023 -1.1S2262
. 0763479 .1269911 -.0012532 -.7138274
. , ,.j
In [R] regress, we noted the ibility of regress, robist cluster () to produce inefficient coefficient estimates with valid standaid errors for random-effects models. These standard errors are robust to model misspecification. The robust option of xtgee, on the other hand, requires that the model correctly specifies the mean
Example One of the features of xtjgee is being able to estimate richer correlation structures. In the previous example, we estimated the model . xtgee ln_w grade g After estimation, xtcorr
age2, i(id) report the working correlation matrix R:
xtgee — Estimate population-averaged panel-data models using GEE
336
. xtcorr Estimated witfain-ideode correlation matrix R: rl r2 r3 r4 r5 r6 r7 r8 r9
cl I .0000 0.4861 0.4851 0.4851 0.4851 0.4851 0.4851 0.4851 0.4851
c2
c3
c4
1 .0000 0.4851 0.4851 0.4851 0.4851 0.4851 0.4851 0.4851
1.0000 0.4851 0.4851 0,4851 0.4851 0.4851 0.4851
1 .0000 0.4851 0.4851 0.4851 0.4851 0.4851
c6
c8
c9
1.0000 0.4851 .0000 0.4851 .4851 1.0000 0.4851 .4851 0.4851 1.0000 0.4851 0.4851 0.4851 0.4851
1.0000
cS
c7
The equal-correlation model corresponds to an exchangeable correlation structure, meaning that the correlation of observations within person is a constant. The working correlation estimated by xtgee is 0.4851. (xtreg, re, by comparison, reports .5140.) We constrained the model to have this simple correlation structure. What if we relaxed the constraint? To go to the other extreme, let's place no constraints on the matrix (other than it being symmetric). This we do by specifying correlation (-unstructured), although we can abbreviate the option. . xtgee ln_w grade age age2, i(id) t(year) corr(unstr) nolog GEE population-averaged model Number of obs Group and time vars: idcode year Number of groups Link: identity Obs per group: min avg Family: Gaussian Correlation: unstructured max Wald cM2(3) Scale parameter: .1418513 Prob > chi 2 ln_wage
Coef .
Std. Err.
grade
.0720684 . 1008095 -.0015104 -.8645484
.002151 .0081471 .0001617 . 1009488
age age2 _cons
. xtcorr Estimated within-idcode correlation cl c2 c3 c4 rl 1 .0000 r2 0 .4355 1 .0000 r3 0 .4280 0.5597 1.0000 r4 0 .3772 0.5012 0 . 5475 1.0000 r5 0 .4031 0,.5301 0.5027 0.6216 r6 0 .3664 0,4519 0.4783 0.5685 r7 0 .2820 0..3606 0.3918 0.4012 r8 0 .3162 0..3446 0.4285 0.4389 r9 0 .2149 0,.3078 0.3337 0.3584
z 33.50 12.37 -9.34 -8.56
matrix R: c5
1.0000 0.7306 0.4643 0.4697 0.4866
c6
16085 3913 1 4.1 9 2405.20 0.0000
[951/. Conf . Interval]
P>1z| 0.000 0.000 0 . 000 0.000
= = = = =
. 0678525 .0848416 -.0018272 -1.062404
c7
c8
. 0762843 .1167775 -.0011936 - . 6666923
c9
1.0000 0.5022 1.0000 0.5223 0.6476 1.0000 0.4613 0. 5791 0 . 7387 1.0000
This correlation matrix looks quite different from the previously constrained one and shows, m particular, that the serial correlation of the residuals diminishes as the lag increases, although residuals separated by small lags are more correlated than, say, AR(1) would imply.
Estimate population-averaged panel-data models using GEE
xtgee
337
Example In [R] xtprobit, we shbwed a random-effects model of unionization using the union data described in [R] xt. We estimated using xtprobit but said it could be estimated using xtgee as well, and here we estimate a popui ation-averaged (equal-cdrrelation) model for comparison: . xtgee union age grade not_sasa south stmthXt, Iteration 1: tele: aace = .04796083 Iteration 2: tole: •aBce = ,00362657 Iteration 3: tole: •ance = .00017886 Iteration 4: tole: •aace = 8.654e-06 Iteration 5: tole: •aace = 4.1BOe-07 GEE population-av sraged model idcode Group variable: Link: probit Family: binomial exchangeable Correlation: 1
Scale parameter: union
Coef.
Std. Err.
age grade not.smsa south southXt _cons
0031597 0329992 - 0721799 r. 409029 0081828 - . . 184799
.0014678 .0062334 .0275189 .0372213 . 002545 .0890117
z
i(id) fam(bin) link(probit)
Number of obs Number of groups Obs per group: min avg max Wald chi2(5) Prob > chi2 P>lzl
2.15 5.29 -2.62 -10.99 3.22
0.031 0.000 0 . 009 0.000 0.001 0.000
-is. si
= = = = = = =
26200 4434 1 5.9 12 241.66 0.0000
[957, Conf. Interval] .0002829 .020782 -.1261159 -.4819815 .0031946 -1.359259
.0060366 .0452163 -.0182439 -.3360765 ,0131709 -1.01034
Let us look at the correl ation structure and then felax it: . xtcorr Estimated within- .dcode correlation matrix R: cl
c2
c3
rll r!2
1 . 0000 0.4630 0.4630 0.4630 0.4630 0.4630 0.4630 0 . 4630 0.4630 0.4630 0.4630 0.4630
1.00 50 0.46 50 0.46 30 0.46 !0 0.46 !0 0.46 50 0.46 SO 0.46 50 0.46 SO 0.46 !0 0.46 50
1.0000 0 .4630 0 .4630 0 .4630 0 .4630 0 .4630 0 .4630 0 .4630 0 .4630 0 .4630
clO
c LI
c!2
rlO ril r!2
1.0000 0.4630 1.00 )0 0.4630 0.46 50
1.0000
f1
rl r2 r3 r4 r5
L
r6
*.
i
r7 r8 r9
no
c4
c5
1. 0000 0. 4630 1.oboo 0. 4630 0 .4630 0. 4630 0 .4fe30 0. 4630 0 .4630 0. 4630 0 .4630 0. 4630 0 .4630 0. 4630 0 .4630 0. 4630 0 .4630
c6
c7
1.0000 0.4630 1.0000 0.4630 0.4630 0.4630 0.4630 0.4630 0.4630 0.4630 0.4630 0.4630 0.4630
c8
c9
1.0000 0.4630 1.0000 0.4630 0.4630 0.4630 0.4630 0.4630 0.4630
.1* f correlation between observations within person to be 0.4630. We have a lot of data (an average of f .9 observations on 4,434 women), so estimating the full correlation matrix is feasible. Let's do that and then examine the results: „
*!, _
£1
_ l_ „
>-k r\
i
338
xtgee — Estimate population-averaged panel-data models using GEE . xtgee union age grade not_smsa south southXt, i(id) t(t) fans (bin) link(probit) > corr(unstr) nolog GEE population-averaged model Group and time vars: idcode tO Link: probit Family : binomial Correlation: unstructured Scale parameter:
Number of obs Number of groups Obs per group: min avg max Wald chi2(5) Prob > cM2
1
union
Coef .
age grade not_smsa south southXt _cons
.0020207 .0349572 -.0951058 -.3891526 .0078823 -1 , 194276
Std. Err.
.0019768 .0065627 .0291532 .0434868 .0034032 .1000155
z 1.02 5.33 -3.26 -8.95 2.32 -11.94
26200 4434 1 5.9 12 196.76 0 . 0000
[957. Conf . Interval]
P>lzl 0.307 0.000 0.001 0.000 0.021 0.000
. xtcorr Estimated within-idcode correlation matrix R: cl c2 c3 c4 c5 c6 rl 1.0000 r2 0.6796 1.0000 r3 0.6272 0.6628 1 . 0000 r4 0.5365 0.5800 0.6170 1 . 0000 r5 0 . 3377 0.3716 0.4037 0.4810 1.0000 r6 0.3079 0.3771 0.4283 0.4591 0.6435 1.0000 r7 0.3053 0.3630 0.3887 0.4299 0.4949 0.6407 r8 0.2807 0.3062 0.3251 0.3762 0.4691 0.5610 r9 0 . 3045 0.3013 0.3042 0.3822 0.4620 0.5101 rlO 0.2324 0.2630 0.2779 0.3655 0.3987 0.4921 rll 0 . 2369 0.2321 0.2716 0.3265 0.3555 0.4425 r!2 0 . 2400 0.2374 0.2561 0.3153 0.3478 0.3835 ell c!2 clO rlO 1 . 0000 rll 0.5706 1.0000 r!2 0.5302 0.6406 1.0000
= = = = =
-.0018539 .0220946 -.152245 -.4743853 .0012121 -1.390303
.0058952 .0478198 -.0379665 -.30392 .0145524 -.9982495
c7
c8
c9
1 .0000 0.7000 0.6093 0.5878 0.5094 0 .4782
1.0000 0.6709 0.5957 0.5607 0.4985
1 . 0000 0.6308 0.5740 0 . 5404
As before, we find that the correlation of residuals decreases as the lag increases, but more slowly than an AR(1) process, and then quickly decreases toward zero, ,
Example In this example, we examine injury incidents among 20 airlines in each of 4 years. The data are fictional and, as a matter of fact, really are from a random-effects model.
(Continued on next page)
xtgee — Estimate population-averaged panel-data models using GEE
339
. gen Inpm = InCpmilBB) . xtgee i_cnt inprog, f(pois) i(airline) t(time) eform off(Inpm) nolog 80 GEE population-averaged model Number of obs = 20 Number of groups = Group variable: airline 4 Obs per group: min = Link: log 4.0 avg = Family: Poisson 4 max = Correlation: exchangeable 5.27 Wald chi2(l) 0.0217 Prob > cM2 = Scale parameter: i_cnt
IRR
Std. Err.
z
P>lz|
[951 Conf . Interval]
0.022
.8327758
1
inprog Inpai
.90 59936 (of ifset)
.0389528
-2.30
.9856487
. xtcorr Estimated within-eaii|line correlation matrix R: cl c2 c3 c4 rl 1.0000 r2 0.4606 1.0000 r3 0.4606 0.4606 1.0000 r4 0.4606 0.4606 0.4606 1.0000
Now, there are not really enough data here to reliabty estimate the correlation without any constraints of structure, but here is whjat happens if we try: . xtgee i_cnt inprog , f(pois) i(airline) t(time) eform off(lnpm) corr(unstr) nolog GEE population-aversged model Number of obs = 80 Group and time vars: airline time Number of groups = 20 Link: log Obs per group: min = 4 Family: Poisson avg = 4.0 Correlation: unstructured max = 4 Wald chi2(l) = 0.36 Scale parameter: Prob > chi2 = 0.5496 i_cnt
IRR
Std. Err.
z
P>|z|
inprog Inpm
.97 91082 (of fset)
. 0345486
-0.60
0.5SO
[95'/, Conf. Interval] .9136826
1.049219
. xtcorr Estimated within-.aiiuine correlation matrix R: cl c2 c3 c4 rl 1. 0000 r2 0.5700 1.0000 r3 0.7164 0.4192 1 .0000 r4 0.2383 0.3840 0.3521 1.0000
There is no sensible patterr to the correlations. We admitted previously that we created this da|aset from a random-effects Poisson model. We reran our data-creation pro, ;ram and this time had it create 400 airlines rather than 20, still with 4 years of data each. Here is the equal-correlation mddel and estimated correlation structure:
xtgee — Estimate population-averaged panel-data models using GEE
340
. xtgee i_cnt inprog, f(pois) i (airline) efonn off (Inpm) nolog GEE population-averaged model Group variable: airline Link: log Family: Poisson Correlation: exchangeable Scale parameter:
1
i_cnt
IRR
inprog Inpm
Number of obs Number of groups Obs per group: min avg max Wald cM2(l) Prob > chi2
.8915304 (offset)
Std. Err. .0096807
z -10.57
P>|z| 0.000
1600 400 4 4.0 4 111.80 0 . 0000
[95% Conf. Interval] .8727571
.9107076
. xtcorr Estimated within-airline correlation matrix R: cl c2 c3 rl 1.0000 r2 0.5292 1.0000 r3 0.5292 0.5292 1.0000 r4 0.5292 0.5292 0.5292
c4
1.0000
and here are the estimation results assuming unstructured correlation: . xtgee i_cnt inprog, f(pois) i(airline) corr(unstr) t(time) efonn off (Inpm) nolog GEE population-averaged model Group and time vars : airline time Link: log Family: Poisson Correlation: unstructured Scale parameter:
1
i_cnt inprog Inpm
Number of obs Number of groups Obs per group: min avg max Wald chi2(i) Prob > chi2
IRR .8914155 (offset)
Std. Err. .0096208
z -10.65
1600 400 4 4.0 4 113.43 0.0000
P>|zl
[95tf Conf . Interval]
0.000
.8727572
.9104728
. xtcorr Estimated within-airline correlation matrix R:
cl rl r2 r3 r4
c3
c4
1.0000 0.4733 1.0000 0.5241 0.5749 1.0000 0.5140 0.5049 0.5841
c2
1.0000
The equal-correlation model estimated a fixed .5292, and above we have correlations ranging between .4733 and .5841 with little pattern in their structure.
(Continued on next page)
xtgee
Estimate population-averaged panet-oata mooeis using vacc
Saved Results xtgee saves in e(): Scalars
e(N)
e(dl_m) e(g_max) e(g_min) e(chi2) e(df_p«ar) Macros e(cmd) e (depvar) e (family) e(link) e(corr) e (scale) Matrices e(b) e(V) Functions e(sample)
e (deviance) e(cM2_jdev) e(dispers) e (chi2_dis) e(tol) e(dif) e(phi)
number o observations number o f groups model def ;-ees of freedom largest gr(]iip size smallest g oup size average g aup size
x2
deviance x2 test of deviance deviance dispersion x2 test of deviance dispersion target tolerance achieved tolerance scale parameter
degrees o freedom for Pearson x2 xtgee name oi distribut link func correlati x2, dev,
! ependent variable i family on structure >hi, or #; scale parameter
coefficien vector variance- :ovariance matrix of the est mators
e(ivar) variable denoting groups e(cM2type) Wald; type of model x2 test e(disp) deviance dispersion e (off set) offset e (predict) program used to implement predict
e(R)
estimated working correlation matrix
marks e mation sample
Methods and Form tas xtgee is implements as an ado-file. xtgee estimates gen< al linear models for panel data using the GEE approach described in Liang and Zeger (1986). Belo- we present the derivation of that estimator. A related method, referred to as GEE2, is described in hao and Prentice (1990J) and Prentice and Zhao (1991). The GEE2 method attempts to gain efficien t in the estimation of (S by specifying a parametric model for a, and then relies on assuming that e models for both the mean and dependency parameters are correct. Thus, there is a tradeoff in ro ustness for efficiency. The preliminary work of Liang, Zeger, and Qaqish (1987), however, indicat that there is little efficiency gained with this alternate approach. In the GLM approach see McCullagh and Nelder 1989), we assume that
Cov(yj) = Ai
for independent observations.
In the absence of a conv nient likelihood function with which to work, one can rely on a multivariate analog of the quasi-scor function introduced by Wedderburn (1974)
342
xtgee — Estimate population-averaged panel-data models using GEE
The correlation parameters a can be solved for by simultaneously solving
In the GEE approach to GLM, we let Rj(a) be a "working" correlation matrix depending on the parameters in a (see the Correlation structures section for the number of parameters), and we estimate (3 by solving the generalized estimating equation m
where V f (a) = A t 1/2 R;(a)AV 2 To solve the above, we need only a crude approximation to the variance matrix. We can obtain one from a Taylor's series expansion where
which allows that D* « (Z^ZiJ-^iLr 1 {(y, -= A A fa.; - ftj)2 ~
Calculation of GEE for GLM Using the notation from Liang and Zeger (1986), let y^ = (y i j l ; ... ,yi.ni)T be the n.; x 1 vector of outcome values, and let Xz = (z;,i, - - . , £i,nJT be the n^ x p- matrix of covariate values for the 2th subject i — 1 , . . . . m. We assume that the marginal density for y^j may be written in exponential family notation as where ^j^- = h(??z,j ),%,.? = ^i,j/3- Under this formulation, the first two moments are given by
We define the quantities (assuming that we have an n x n working correlation matrix R(a)),
xtgoe — Estimate populatioin-averaged panel-data models using GEE Ai =
343
di&g(d9ij/dr)ij) n x n matrix
A, = diag{a"(#ij)}
n x n matrix
Si = yi - a'(9i}
n x 1 matrix
DJ = AiAiX.i
nxp matrix
V; = A t 1/2 R(a)A- /2
nxn matrix
such that the GEE bec< mes 1=1
We then have that -1
.
.1=1
where the term
-i
.1=1 is what we call the IRL variance estimate (iteratively reweighted least squares). It is used to calculate the standard errors if th ; robust option is not specified. See Liang and Zeger (1986) for the calculation of the robust variance estimator. Define
V = nm x nm block diagonal matrix with Vj Z = Df3 - S
At a given iteration, tl e correlation parameters a and scale parameter
where (9tj depends on the current value for (3. We can then estimate <j> by m i n,-
As the above gener ,1 derivation is complicated, let's follow the derivation of the Gaussian family with the identity link (n gression) to illustrate the generalization. After making appropriate substitutions, we will see a familiar updating equation. First, we rewrite the updating equation for (3 as
344
xtgee — Estimate population-averaged panel-data models using GEE
and then derive for Zi and 7>z-
Zl =
m
a"(9ij}} diag{o"(0i,-)} L
'
R(a)diag{<
Li = XTX
*.. i ^
i=l
*=1
1=1
j diag < — ^ ~ r > diag (a"(^j,-)} diag {a"(^£,-)} a Am
L
R(a)diag{a"(^i,-)}
J
So, that means that we may write the update formula as
which is the same formula for IRLS in regression. Correlation structures The working correlation matrix R is a function of a and is more accurately written as R(a)Depending on the assumed correlation structure, a might be Independent Exchangeable Autoregressive Stationary Nonstationary Unstructured
No parameters to estimate a is a scalar a is a vector a is a vector a is a matrix a is a matrix
Also note that throughout the estimation of a general unbalanced panel, it is more proper to discus.s Rj, which is the upper left n^ x rij submatrix of the ultimately saved matrix in e(R), which i* maxlrii} x max{rii}. The only panels that enter into the estimation for a lag-dependent correlation structure arc those with ni > g (assuming a lag of g). xtgee drops panels with too few observations (and mentions the fact when it does so).
f
xtgec — Estimate population-averaged panel-data models using GEE
345
Independent The working correl ion matrix R is an identity matrix. Exchangeable Q=
f. ^ Em3=1 ^RJ L,k=\ rt -J r «.
E
and the working corre tion matrix is given by a
f
otherwise
Autoregressive and st tionary These two structun require g parameters t6 be estimated so that a is a vector of length g -f (the first element of a. .s 1). m
E The working correlatio matrix for the AR model is calculated as a function of Toeplitz matrices formed from the a vector. Se Newton (1988) for a discussion of the algorithm. The working correlation matrix for the stationa / model is given by
otherwise Nonstationary and U structured These two correlati n structures require a matrix of parameters, a is estimated (where we replace ft,j = 0 whenever i > or j > n,) as -1;
2,2t,2
m
E
Hi
VA where JVp,g =
i,p,q] and
_ f 1 if panel i hai valid observations at times p and q 10 otherwise where ATj.j = mil max(n 1 ,n 2 ,... ! n m )
Vj.A r j), JVj = number of panels observed at time i and n —
The working correlati i matrix for the nonstaiionary model is given by
1 : ifs = t s,t = <( a a , t : if 0 < )s - t\ < 9 0 otherwise
346
xtgee — Estimate population-averaged panel-data models using GEE
The working correlation matrix for the unstructured model is given by
Rs>t =
\ «s,t
otherwise
such that the unstructured model is equal to the nonstationary model at lag g = n — 1, where the panels are balanced with rn = n for all i.
References Hosmer, D. W., Jr., and S. Lemeshow. 1989. Applied Logistic Regression. New York: John Wiley & Sons. (Second edition forthcoming in 2001.) Liang, K.-Y. 1987. Estimating functions and approximate conditional likelihood. Biometrika 4: 695-702. Liang, K.-Y. and S. L. Zeger. 1986, Longitudinal data analysis using generalized linear models. Biometrika 73: 13-22. Liang, K.-Y, S. L. Zeger, and B. Qaqish. 1987. Multivariate regression analyses for categorical data. Journal of the Royal Statistical Society, Series B 54: 3-40. McCullagh, P. and J. A. Nelder. 1989. Generalized Linear Models. 2d ed. London: Chapman & Hall. Nelder, J. A. and R. W. M. Wedderburn. 1972. Generalized linear models. Journal of the Royal Statistical Society, Series A 135: 370-384. Newton, H. J. 1988. TIMESLAB: A Time Series Analysis Laboratory, Belmont, CA: Wadsworth & Brooks/Cole. Pendergast, J. F., S. J. Gange, M. A. Newton, M. J. Lindstrom, M. Palta, and M. R. Fisher. 1996. A survey of methods for analyzing clustered binary response data. International Statistical Review 64: 89-118. Prentice, R. L. and L. P. Zhao. 1991. Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics 47: 825-839. Rabe-Hesketh, S., A. Pickles, and C. Taylor. 2000. sg!29: Generalized linear latent and mixed models. Stata Technical Bulletin 53: 47-57, Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 293-307. Wedderburn, R. W. M. 1974. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61: 439-447. Zeger, S. L. and K.-Y. Liang. 1986. Longitudinal data analysis for discrete and continuous outcomes. Biometrics 42: 121-130. Zeger, S. L., Liang, K.-Y, and P. S. Albert. 1988. Models for longitudinal data: a generalized estimating equation approach Biometrics 44: 1049-1060. Zhao, L. P. and R. L. Prentice. 1990. Correlated binary regression using a quadratic exponential model. Biometrika 77: 642-648.
Also See Complementary:
[R] adjust, [R] lincom, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, [R] xtdata, [R] xtdes, [R] xtsum, [R] xttab
Related:
[R] logistic, [R] prais, [R] regress, [R] svy estimators, [R] xtclog, [R] xtgls, [R] xtintreg, [R] xtlogit, [R] xtnbreg, [R] xtpcse, [R] xtpois, [R] xtprobit, [R] xtreg, [R] xtregar, [R] xttobit
Background:
[u] 16.5 Accessing coefficients and standard errors, [u] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [U] 23.12 Obtaining scores, [R] xt
Title xtgls — Estimate pane data models using GLS
I
Syntax xtgls depvar ^varlist force igls
weight]
[if exp]
[in range]
[, i(vamame) t(vamatne')
panels({iid heterosriedastic[correlated])
t nocoSstant :Level(#) tolerance(#) corr((independ.ei t|arl|psarl}) iterate (#) nolo; rhotype({regress|dw|freg|nagar theil|tscorrl) by . . . : may be used with xt ;ls; see [R] by. aweights are allowed; see [U] 14.1.6 weight. xtgls shares the features of a estimation commands; see [U] 23 Estimation and post-estimation commands.
Syntax for predict predict [type] newv irname [if ejcp] [in tonge] [, { xb
stdp } ]
These statistics are available b th in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample.
Description xtgls estimates cross- ctional time-series linear models using feasible generalized least squares, This command allows est nation in the presence of AR(1) autocorrelation within panels and crosssectional correlation and/o heteroskedasticity across panels.
Options i(\~arname') specifies the ariable that identifies the: panel to which the observation belongs. You can specify the i() option he first time you estimate, or you can use the iis command to set i() beforehand. After that, 'tata will remember the variable's identity. See [R] xt. t(varname') specifies the ariable that contains the time at which the observation was made. You can specify the t ( ) op ion the first time you estimate, or you can use the tis command to set t ( ) beforehand. After hat, Stata will remember the variable's identity. See [R] xt. xtgls does not need know t() in all caseS and, in those cases, whether you specify t() makes no difference. W : note in the descriptions of the pane Is () and corrQ options when t ( ) is required. When t ( ) s required, it is also required that the observations be spaced equally over time: however, see opt: force below. force specifies that estim ion is to be forced even (hough t() is not equally spaced. This is relevant only for correlation stn :tures that require knowledge of t(). These correlation structures require that observations be eq rally spaced so that calculations based on lags correspond to a constant lime change. If you sp :cify a t() variable that indicates observations are not equally spaced, xtgls will refuse to mate the (time-dependent) model. If you also specify force, xtgls will estimate the model anc assume that the lags based on the data ordered by t() are appropriate. 347
348
xtgls — Estimate panel-data models using GLS
igls requests an iterated GLS estimator instead of the two-step GLS estimator in the case of a nonautocorrelated model, or instead of the three-step GLS estimator in the case of an autocorrelated model. The iterated GLS estimator converges to the MLE for the corr(independent) models, but does not for the other corrO models. nmk specifies standard errors are to be normalized by n — k, where k is the number of parameters estimated, rather than n, the number of observations. The one used varies among authors. Greene (2000, 600) recommends n and remarks that whether you use n or n - k does not make the variance calculation unbiased in these models. panels (.pdist) specifies the error structure across panels. panels(iid) specifies a homoskedastic error structure with no cross-sectional correlation. This is the default. panels (heteroskedastic) (typically abbreviated p(h)) specifies a heteroskedastic error structure with no cross-sectional correlation. panels (correlated) (abbreviation p(c)) specifies heteroskedastic error structure with crosssectional correlation. If p(c) is specified, you must also specify t(). Note that the results will be based on a generalized inverse of a singular matrix unless T>m (the number of time periods is greater than or equal to the number of panels). corr(corr) specifies the assumed autocorrelation within panels. corr(independent) (abbreviation c(i)) specifies that there is no autocorrelation. This is the default. corr(arl) (abbreviation c(a)) specifies that, within panels, there is AR(1) autocorrelation and that the coefficient of the AR(1) process is common to all the panels. If c(arl) is specified, you must also specify t (). corr(psarl) (abbreviation c(p)) specifies that, within panels, there is AR(1) autocorrelation and that the coefficient of the AR(1) process is specific to each panel, psarl stands for panel-specific AR(1). If c (psarl) is specified, t ( ) must also be specified. noconstant suppresses the estimation of the constant term (intercept). level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. tolerance (#) specifies the convergence criterion for the maximum change in the estimated coefficient vector between iterations; tol(le-6) is the default. iterate (#) specifies the maximum number of iterations allowed in estimating the model; iterate (100) is the default. You should never have to specify this option. nolog suppresses the iteration log. rhotype (calc} specifies the method to be used to calculate the autocorrelation parameter. Allowed strings for calc are regress regression using lags (the default) dw Durbin-Watson calculation f reg regression using leads nagar Nagar calculation theil Theil calculation tscorr time series autocorrelation calculation AH the calculations are asymptotically equivalent and al] are consistent; this is a rarely used option.
xtgli — Estimate panel-data models using GLS
349
Options for predicjt xb, the default, calc lates the linear prediction. stdp calculates the tandard error of the linear prediction.
Remarks Information on G.S can be found in Greene (2000), Maddala (1992), Davidson and MacKinnon (1993), and Judge e al. (1985). If you have a large number of panels relative to time periods, see [R] xtreg and [R] xtgee. xtgee, in particular, provide 5 similar capabilities as *tgls but does not allow cross-sectional correlation. On the other hand, xtg< e will allow a richer description of the correlation within panels subject to the constraint that the sane correlations apply to all panels. That is, xtgls provides two unique features: 1. Cross-sectional c >rrelation may be modeled (panels(correlated)). 2. Within panels, th : AR(1) correlation coefficient may be unique (corr(psaxl)). It is also true tha xtgls allows models with heteroskedasticity and no cross-sectional correlation, , xtgee does not, but xtgee with the robust option relaxes the assumption whereas, strictly s of equal variances, least as far as the standard error calculation is concerned. In addition, xtgl s, panels(iid) corrfindependent) nmk is equivalent to regress. The nmk option use: n — k rather than n to normalize the variance calculation. To estimate a mcjdel with autocorrelated errors (corr(arl) or corr(psarl)). the data must be equally spaced in tin e. To estimate a model with cross-sectional correlation (panels(correlated)), panels must have th same number of observations (be balanced). The equation from which the models are developed is given by
where i = 1..... m s the number of units (or panels) and t = 1. for panel i. This me del can equally be written as
yi Y2
X2
LymJ
LAmj
j is the number of observations
€2
Le m
The variance matrix of the disturbance terms can be written as
ee'l = Ci =
In order for the matrices to be parameterized to model cross-sectional correlation, they must be square (balanced panels). In these models, we assume that the coefficient vector fl is the same for all panels and consider a variety of models >y changing the assumptions on the structure of f i .
350
xtgls — Estimate panel-data models using GLS
For the classic OLS regression model, we have
E[ciit] = 0 ] =
=0
ift-^s
This amounts to assuming that f2 has the structure given by
cr2! 0 ••• 0 o-2! -.. 0
0 0 a2!
0
whether or not the panels are balanced (the 0 matrices may be rectangular). The classic OLS assumptions are the default panels (uncorrelated) and corr(independent) options for this command.
Heteroskedasticity across panels In many cross-sectional datasets, the variance for each of the panels will differ. It is common to have data on countries, states, or other units that have variation of scale. The heteroskedastic model is specified by including the panels (heteroskedastic) option, which assumes that
o
n~
0 0
o
Lo
o
£> Example Greene (2000, 592 and CD) reprints data in a classic study of investment demand by Grunfeld and Griliches (1960). Below we allow the variances to differ for each of the five companies. . xtgls invest market stock, i(company) panels(hetero) Cross-sectional time-series FGLS regression Coefficients: generalized least squares Panels; heteroskedastic Correlation: no autocorrelation Estimated covariances = 5 Number of obs = Estimated autocorrelations = 0 Number of groups = Estimated coefficients = 3 No. of time periods= Wald chi2(2) Log likelihood == -570.1305 Prob > chi2 invest market stock _cons
Coef. . 0949905 .3378129 -36.2537
Std. Err. . 007409 .0302254 6.124363
z 12.82 11.18 -5.92
P>|zl 0.000 0.000 0.000
100 5 20 865.38 0 . 0000
[957, Conf . Interval] . 0804692 .2785722 -48.25723
.1095118 .3970535 -24.25017
xtgls — Estimate panel-data models using GLS
351
Correlation across panels (cross-sectibnaJ correlation) We may wish to assun e that the error terms of panels are correlated, in addition to having different scale variances. This variance structure is specified by including the panels (correlated) option and is given by
02,11
0?I
a*nl J
-ffm,l!
Note that since we must estimate cross-sectional correlation in this model, the panels must be balanced (and T>m for valid results). In addition, we must now specify the tO option so that xtgls knows how the observations wifiin panels are ordered.
t> Example . xtgls invest market stock, i(company) t(time) panels(correlated) Cross-sectional t: me-series FGLS regression Coefficients: gei eralized least squares Panels: het eroskedastic with cross-sectional correlation Correlation: no autocorrelation Estimated covariadces Estimated autocorrelations = Estimated coeffic: ents Log likelihood
Number of obs = Number of groups = No. of time periods* Hald cM2(2) Prob > chi2
15 0 3
= -53T.8045
invest
Coef .
market stock _cons
0961894 3095321 8.36128
—^
Std. Err. .0054752 .0179851 5.344871
z 17.57 17.21 -7.18
P>|z| 0.000 0.000 0.000
[951/. Conf . Interval] .0854583 .2742819 -48.83703
The estimated croi ;-sectional covariances are stored in e (Sigma). . matrix list e(S gma) symmetric e(Sigma [5,5] _ee _ee2 _ee3 _ee4 _ee 9410.9061 _ee2 -168.04631 755.85077 _ee3 -1915.9538 -4163.3434 34288.49 _ee4 -1129.2896 -80.381742 2259.3242 633.42367 _ee5 258.50132 4035.872 -27898.235 -1170.6801
100 5 20 1285,19 0.0000
_ee5
33455.511
.1069206 .3447822 -27.88552
xtgls — Estimate panel-data models using GLS
352
Example We can obtain the MLE results by specifying the igls option, which iterates the GLS estimation technique to convergence: . xtgls invest market stock, i(company) t(time) panels(correlated) igls Iteration 1: tolerance - .2127384 Iteration 2: tolerance « .22817 (output omitted) Iteration 1046: tolerance * l.OOOe-07 Cross-sectional time-series FGLS regression Coefficients: generalized least squares Panels: heteroskedastic with cross-sectional correlation Correlation: no autocorrelation Estimated covariances = 15 Number of obs = Estimated autocorrelations = 0 Number of groups = Estimated coefficients = 3 No. of time periods2 tfald chi2(2) Log likelihood = -515.4222 Prob > chi2 = invest
Coef .
market stock _cons
. 023631 . 1709472 -2.216508
Std. Err.
.004291 .0152526 1.958845
z 5.51 11.21 -1.13
P>|zl 0.000 0.000 0.258
100 5 20 558.51 0.0000
[957. Conf. Interval] .0152207 . 1410526 -6.055774
.0320413 .2008417 1.622759
Autocorrelation within panels The individual identity matrices along the diagonal of O may be replaced with more general structures in order to allow for serial correlation, xtgls allows three options so that you may assume a structure with corr(independent) (no autocorrelation); corr(ari) (serial correlation where the correlation parameter is common for all panels); or corr(psarl) (serial correlation where the correlation parameter is unique for each panel). The restriction of a common autocorrelation parameter is reasonable when the individual correlations are nearly equal and the time series are short. If the restriction of a common autocorrelation parameter is reasonable, this allows us to use more information in estimating the autocorrelation parameter, and thus to produce a more reasonable estimate of the regression coefficients.
> Example If corr(arl) is specified, each group is assumed to have errors that follow the same AR(D process; that is, the autocorrelation parameter is the same for all groups.
xtgls -^ Estimate panel-data models using GLS . xtgls invest m jrket stock, i(company) t(time) panels(hetero) corr(arl) Cross-sectional ime-series FGLS regression Coefficients: gsneralized least squares Panels: hpteroskedastic Correlation: c mmon AR(1) coefficient for all panels (0.8651) 100 Estimated covari inces = 5 Number of obs = 1 Number of groups = 5 Estimated autoco elations * 20 = 3 No. of time periods* Estimated coeffi ientu 119.69 Wald cM2(2) 0.0000 = -B06.0909 Prob > chi2 Log likelihood invest
Coef.
market stock _cons
.0744315 .2874294 18.96238
Std. Err. .0097937 .0475391 17,64943
z
P>[zl
7.60 6.05 -1.07
0.000 0.000 0.283
f.957, Conf . Interval] .0552362 . 1942545 -53.55464
.0936268 .3806043 15.62987
Example If corr(psarl) is specified, each group is assumed to have errors that follow a diffe process. . xtgls invest m irket stock, i(company) t(time) Cross -sect ioaal ;ime-series FGLS regression Coefficients : g sneralized least squares Panels : h HBoskedastic Correlation: p (Del-specific AR(1) Estimated covari inces = 1 Estimated autoco rrelations = 5 Estimated coeffi :ients = 3 Log likelihood
I 1
= -543.1888
invest
Coef.
Std. Err.
market stock _cons
.0934343 .3838814 -10.1246
.0097783 .0416775 34.06675
z 9.56 9.21 -0.30
panels (iid) corr(psarl)
Number of obs = Number of groups = No. of time periods* Wald chi2(2) * Prob > chi2 =
100 5 20 252.93 0.0000
P>|zi
[957, Conf. Interval]
0.000 0.000 0.766
.0742693 .302195 -76.8942
(Continued on next page)
.1125993 .4655677 56.64499
353
xtgls — Estimate panel-data models using GLS
354
Saved Results xtgls saves in e(): Scalars e(N)
e(N_g) e(N_t) e(N-jniss) e(n-cf) e(n_cv) e(n_cr)
number of observations number of groups number of time periods number of missing observations number of estimated coefficients number of estimated covariances number of estimated correlations
e(df) e(ll) e (g_max) e (g_min) e (g_avg) e(ch.i2) e(df_pear)
Macros e(cmd) e(depvar) e (title) e(corr)
xtgls name of dependent variable title in estimation output correlation structure e(vt) panel option e(rhotype) type of estimated correlation variable denoting groups e(ivar)
degrees of freedom log likelihood largest group size smallest group size average group size
x2 degrees of freedom for Pearson
e(tvar) variable denoting time e(wtype) weight type weight expression e(wexp) e(cM2type) Wald; type of model x 2 test e (predict) program used to implement predict e (rho) P
Matrices e(b) e(V)
Functions e(sample)
coefficient vector variance-covariance matrix of the estimators
e (Sigma)
matrix
marks estimation sample
Methods and Formulas xtgls is implemented as an ado-file. The GLS results are given by
--1
--1, •"**"•
Var(/3GLS) = (X'fl
J.
X)
-i
For all our models, the f2 matrix may be written in terms of the Kronecker product:
The estimated variance matrix is then obtained by substituting the estimator £ for JE7, where
The residuals used in estimating £ are first obtained from OLS regression. If the estimation is iterated, then residuals are obtained from the last estimated model. Maximum likelihood estimates may be obtained by iterating the FGLS estimates to convergence (for models with no autocorrelation, corr(O)).
xtgls — Estimate panel-data models using GLS
355 -"•v — 1
Note that the GLS est imates and their associated standard errors are calculated using S . A s Beck and Katz (1995) p int out, the £ matrix is of rank at most min(T, m) when you use the panels(correlated)o tion, so for the GLS results to be valid (not based on a generalized inverse), T must be at least as lai e as m (you need at least as many time period observations as there are panels). Beck and Katz (1995) uggest using OLS parameter estimates with asymptotic standard errors that are correct for correlatioi between the panels. This estimation can be performed with the xtpcse command; see [R] xtpcsc
References Beck, N. and Katz, J. N. 19 '<. What to do (and not to do) with time-series cross-section data. American Political Science Review 89: 634- >47. Davidson, R. and J. G. Mac nnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. Greene, W. H. 2000. Econom trie Analysis. 4th ed. Upper Saddle River, NJ: Prentice-Hall. Grunfeld, Y. and Z. Griliches 1960. Is aggregation necessarily bad? Review of Economics and Statistics 42: 1-13. Judge, G. G.. W. E. Griffiths. . C. Hill, H. Liitkepohl. and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John V ley & Sons. Maddala. G. S. 1992. Introdu tion to Econometrics. 2d edi New York: Macmillan.
Also See Complementary:
[R: Related:
adjust. [R] lincom, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, xtdata, [R] xtdes, [R] xtisum, [R] xttab newey, [R] prais, [R] regress, [R] svy estimators, [R] xtgee, xtpcse, [R] xtreg, [R] xfregar ;j
Background:
[u 16.5 Accessing coefficients and standard errors, [u 23 Estimation and post-estimation commands, xt
Tffle xtintreg — Random-effects interval data regression models
Syntax Random-effects model xtintreg depvar\owet depvaruppeT [varlist]
[weight]
[if exp]
[in range]
[, i(varname') quad(#) noconstant noskip level(#) offset(varname) intreg nolog maximize-options ] by . .. : may be used with xtinttreg; see [R] by. iweights are allowed; see [U] 14.1.6 weight. Note that weights must be constant within panels. rtintreg shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands.
Syntax for predict predict [type] newvarname [if exp] [in range] [, { xb
stdp } nooffset ]
These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample.
Description xtintreg estimates random-effects interval regression models. There is no command for a conditional fixed-effects model, as there does not exist a sufficient statistic allowing the fixed effects to be conditioned out of the likelihood. Unconditional fixed-effects intreg models may be estimated with the intreg command with indicator variables for the panels. The appropriate indicator variables can be generated using tabulate or xi. However, unconditional fixed-effects estimates are biased. Note: xtintreg is slow since it is calculated by quadrature; see Methods and Formulas. Computation time is roughly proportional to the number of points used for the quadrature. The default is quad(12). Simulations indicate that increasing it does not appreciably change the estimates for the coefficients or their standard errors. See [R] quadchk.
Options i(varname'). specifies the variable name that contains the unit to which the observation belongs. You can specify the i() option the first time you estimate, or you can use the iis command to set i() beforehand. After that, Stata will remember the variable's identity. See [R] xt. quad(#) specifies the number of points to use in the quadrature approximation of the integral. The default is quad (12). The number specified must be an integer between 4 and 30, and must be no greater than the number of observations. 356
xtintreg — Random-effects interval data regression models
357
noconstant suppresses le constant term (intercept) in the model. noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be estimate! This model is not displayed but is used as the base model to compute a likelihood-ratio test or the model test statistic displayed in the estimation header. By default, the overall model te; statistic is an asymptotically equivalent Wald test of all the parameters in the regression equ tion being zero (except the constant). For many models, this option can substantially increase stimation time. level (#) specifies the c nfidence level, in percent, for confidence intervals. The default is level (95) or as set by set lev 1; see [u] 23.5 Specifying the width of confidence intervals, off set (varname') sped es that varname. is to be included in the model with its coefficient constrained to be 1. intreg specifies that a ikelihood-ratio test comparing the random-effects model with the pooled (intreg) model should included in the output. nolog suppresses the itejration log. maximize-options contro the maximization process; see [R] maximize. Use the trace option to view parameter converges . Use the Itol (#) option to relax the convergence criterion; the default is le—6 during specific tion searches.
Options for predict xb, the default, calculat the linear prediction, stdp calculates the stan ard error of the linear prediction. nooffset is relevant on y if you specified off set (varname') for xtintreg. It modifies the calculations made by pre ict so that they ignore the offset variable; the linear prediction is treated as Xj f b rather than x + offsetft.
Remarks ^ Example We begin with the da where the wages are ins deflator such that they a the observations are kn known only in an interv We wish to estimate
set nlswork described ih [R] xt and create two fictional dependent variables :ad reported sometimes Js ranges. The wages have been adjusted by a GNP : in terms of 1988 dollars, and have further been recoded such that some of vn exactly, some are left-censored, some are right-censored, and some are 1. random-effects interval regression model of adjusted (log) wages:
358
xtintreg — Random-effects interval data regression models . xtintreg ln_wagel In_wage2 union age grade not_smsa south southXt occ_code, > i(id) noskip intreg nolog Random-effects interval regression Number of obs = 19095 Group variable (i) : idcode Number of groups = 4139 Random effects u_i - Gaussian 1 Dbs per group: min = avg = 4.6 max = 12 LR chi2(7) 3549.46 Log likelihood = -14856.934 Prob > chi 2 0.0000 Coef .
Std. Err.
z
P>lzl
[95% Conf . Interval]
union age grade not_smsa south southXt occ_code _cons
. 1409T46 .012631 .0783789 - . 1333091 -.1218994 .0021033 - . 0185603 .4567546
.0068364 .0005148 .0020912 .0089209 .0121087 .0008314 .001033 .Q32493
20.62 24.53 37.48 -14.94 -10.07 2.53 -17.97 14.06
0.000 0.000 0.000 0.000 0.000 0.011 0.000 0.000
. 1275755 .0116219 .0742802 - . 1507938 - . 145632 . 0004738 - .020585 . 3930695
. 1543737 .01364 . 0824777 - . 1 158243 -.0981669 .0037328 -.0165355 .5204398
/sigma_u /sigma_e
.282881 .2696119
. 0038227 .0015957
74.00 168.96
0.000 0.000
.2753886 .2664843
. 2903734 .2727394
rho
.524003
.0075625
.5091676
. 5388052
Likelihood ratio test of sigma_u=0: chibar2(01)= 6629.90 Prob>=chibar2 = 0.000 Observation summary: 14372 uncensored observations 157 left-censored observations 718 right-censored observations 3848 interval observations The output includes the overall and panel-level variance components (labeled sigsaa_e and sigma_u, respectively) together with p (labeled rho),
i+ which is the proportion of the total variance contributed by the panel-level variance component. When rho is zero, the panel-level variance component is unimportant and the panel estimator is not different from the pooled estimator. A likelihood-ratie test of this is included at the bottom of the output. This test formally compares the pooled estimator (intreg) with the panel estimator.
Q Technical Note The random-effects model is calculated using quadrature. As the panel sizes (or p) increase, the quadrature approximation becomes less accurate. We can use the quadchk command to see if changing the number of quadrature points affects the results. If the results do change, then the quadrature approximation is not accurate and the results of the model should not be interpreted. See [R] quadchk for details and [R] xtprobit for an example. LJ
-
xtintreg — Random-effects interval data regression models
359
tfed Results xtintreg saves in e( Scalars e(N) e (N_g) e(N-ainc) e(N_lc) e(N_rc) e(N_int) e(df_in) e(ll) e(ll_0) e(g_max)
number of number of number oi number of number o1 number of model deg log likelih log likelih largest gro
Macros e(cmd) e (depvar) e (title) e(ivar) e(wtype) e(wexp) e(offsetl)
xtintreg names of < ependent variables title in esti nation output variable At loting groups weight typ : weight exj ression offset
Matrices e (b)
jbservations n-oups jncensored observations eft-censored observations right-censored observations interval observations ses of freedom od od, constant-only model p size
e(g_min) e(g_avg) e(chi2) e(cM2_c) e(rho) e(sigma_u) e(sigma^e) e (N_cd) e (n_quad)
e(chi2type) Wald or LR; type of mode! x 2 t£st e(cM2_ct) Wald or LR; type of model x2 te$t corresponding to e(chi2_c) e(distrib) Gaussian; the distribution of the random effect e (predict) program used to implement predict
coefficient vector
e(V)
variance-covariance matrix of the estimators
! Functions e( sample)
smallest group size average group size x2 x2 for comparison test p panel-level standard deviation standard deviation of €a number of completely determined obs. number of quadrature points
marks estii ation sample
ithods and Formt as xtintreg is impleme ted as an ado-file. Assuming a normal di tribution, Ar(0.
Pi i
/ -oc
—v?j1o? f *** I T*T p/
v^a, \ f j
a _<_
VW
1 \ ' j
^1 ^
where -l/ v / 27ra £ 2 e~ (yi ' I ~ A ''' ) /( 2
F(A-f) ~ i
where C is the set of r observations (ym — . e tot ~ •)• I is the set the cumulative normal d quadrature
/
\
1 _ $ ^^ j
if (yut.y^t] e C
if (yut.y^t) € R
ncensored observations (yiit = t/2if 7^ •). L is the set of left-censored d 2/2it T^ •)> R is the set of right-censored observations (yut ^ • and of interval observations (ym < j/2it-yitt ^ --.y^n ^ •}• an^ $0 's tribution. We can approximate the integral with M -point Gauss-Hermite
fitle xtivreg — Instrumei ital variables and two-stage least squares for panel-data models :j
1
Syntax GLS Random-effects mbdel (varlist-2 = varlist\v') [if exp] [in range
xtivreg depvar [
, re e_c2sls nosa regress i(vamame) theta small first level(#) Between-effects model xtivreg depvar [ varlisti ] (varlistz = vtirlist^) [if exp] [in range level (#)
be ! regress i.(vamame} s§ Fixed-effects model
xtivreg depvar [ \oarlisti ] (varUst-2 = wrltst^) [if exp] [in range] , fe [ regress i(vamame) small first level(#) First-differenced estimator xtivreg depvar
varlisti ] (varlist^ - vtorto;v) [if exp] [in range
, fd [ regress small noconstant first level(#) You must tsset your data before using xtivreg, fd; see [R] tsset. by . . , : may be used will xtivreg; see [R] by. depvar, varlisti, vartisfy. id varlist\v may contain time-series operators; see [U] 14.43 Time-series varlists. xtivreg shares the featured; of all estimation commands; see [U] 23 Estimation and post-estimation commands.
(Continue^ on next page)
361
362
xtivreg — Instrumental variables and two-stage least squares for panel-data models
Syntax for predict For all but the first-differenced estimator predict [type] newvamame [if exp]
[in range]
[, statistic ]
where statistic is xb ue * xbu *u *e
Zitd, fitted values (the default) £j + z/jt, the combined residual ZnS + fii, prediction including effect /2j, the fixed or random error component z?rt, the overall error component
Unstarred statistics are available both in and out of sample; type predict ... if e (sample) . . . if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample even when if e (sample) is not specified.
First-differenced estimator predict [type] newvamame [if exp]
[in range]
[ , statistic ]
where statistic is xb e
Xjb, fitted values for the first-differenced model (the default) it — eit_i, the first-differenced overall error component
e
These statistics are available both in and out of sample; type predict . . . if e (sample) . . . if wanted only for the estimation sample.
Description xtivreg has five different estimators for panel-data models in which some of the right hand side covariates are endogenous. All five of the estimators are two-stage least-squares generalizations of simple panel-data estimators for the case of exogenous variables, xtivreg with the be option requests the two-stage least-squares between estimator, xtivreg with the f e option requests the twostage least-squares within estimator, xtivreg with the re option requests a two-stage least-squares random-effects estimator. There are two implementations, G2SLS due to Balestra and VaradharajanKrishnakumar and EC2SLS due to Baltagi. Since the Balestra and Varadharajan-Krishnakumar G2SLS is computationally less expensive, it is the default. Baltagi's EC2SLS can be obtained by specifying the additional ec2sls option, xtivreg with the fd option requests the two-stage least-squares first-differenced estimator. See Baltagi (1995) for an introduction to panel-data models with endogenous covariates. Baltagi (1995) covers of all the estimators except for the first-differenced estimator. For the derivation and application of the first-differenced estimator, see Anderson and Hsiao (1981).
xtivreg — instrument i! variables and two-stage least squares for panel-data models
363
Options re requests the GLS random-effects estimator, re is the default. be requests the between egression estimator. f e requests the fixed-<•effejcts (within) regression estimator, f d requests the first-'differenced regression estimator. ec2sls requests Baltagi s EC2SLS random-effects estimator instead of the default Balestra and Varadharajan-Krishna cumar estimator. nosa specifies that the Bbltagii-Chang estimators of the variance components are to be used instead of the default adapted Swamy-Aiora estimators. regress specifies that a 1 the covariates are to bi treated as exogenous and that the instrument list is to be ignored. In 01 ler words, specifying regress causes xtivreg to estimate the requested panel data regression nodel of depvar on varllsti and varlisti, ignoring varlist-lv. i (varname} specifies the variable name that contains the unit to which the observation belongs. You can specify the i() option the first time you estimate, or you can use the iis command to set i() beforehand. Aftei that, Stata will remember the variable's identity. See [R] xt. theta, used with xtreg re only, specifies that the output should include the estimated value of 9 used in combining the between and fixed estimators. For balanced data, this is a constant, and for unbalanced data a summary of the values is presented in the header of the output. small specifies that t st. istks should be reported instead of Z statistics, and F statistics instead of chi-squared statistics. noconstant suppresses he constant term (intercept) in the regression, first asks for the first-s tage regressions to be displayed. level (#) specifies the confidence level, in percent^ for confidence intervals. The default is level (95) or as set by set lev 1; see [U] 23.5 Specifying the width of confidence intervals.
Options for predict xb calculates the linear prediiction; that is, Z;t£. |Tiis is the default, ue calculates the predict an of jSj + vn. This is not available after the first-differenced model. xbu calculates the predic ion of Zit8 + Vi, the prediction including the fixed or random component, This is not available fter the first-differenced Imodel. It
u calculates the predictioi of j£j. the estimated fi^ed or random effect. This is not available after the first-differenced mode . i e calculates the prediction of Pjt.
Remarks If you have not read R] xt, please do so. Consider an equation rf the form 'it ~ where
4- X ltt /3 -f
-f vit
(1)
364
xtivreg — Instrumental variables and two-stage least squares for panel-data models
yu is the dependent variable Y« is an 1 x g<± vector of observations on #2 endogenous variables included as covariates, these variables are assumed to be correlated with the v^ JLut is an 1 x k± vector of observations on the exogenous variables included as covariates
7 is a #2
x
1 vector of coefficients
fl is a ki x 1 vector of coefficients S is a K x 1 vector of coefficients, and K = g% + ki Assume that there is a 1 x k% vector of observations on the k% instruments in Xj;*. The order condition is satisfied if k% > 92- Let Xjt = [X-ut Xjitj. xtivreg handles exogenously unbalanced panel data. Thus, define T» to be the number of observations on panel i, n to be the number of panels, and N to be the total number of observations, i.e., N ~ Y^i-i ^ixtivreg offers five different estimators which may be applied to models of the form in equation (1), The first-differenced estimator (FD2SLS) removes the fa by estimating the model in first differences. The within estimator (FE2SLS) estimates the model after sweeping out the fa by removing the panel level means from each variable. The between estimator (BE2SLS) models the panel averages. The two random-effects estimators, G2SLS and EC2SLS, treat the fa as random variables that are independent and identically distributed over the panels. Except for (FD2SLS), all of these estimators are generalizations of estimators in xtreg. See [R] xtreg for a discussion of these estimators in the case of exogenous covariates. While the estimators allow for different assumptions about the fa, all of the estimators assume that the idiosyncratic error term vn has zero mean and is uncorrelated with the variables in X;t. Just as in the case where there are no endogenous covariates, as discussed in [R] xtreg, there are varying perspectives on what assumptions should be placed on the fa. The fa may be assumed to be fixed or random. If they are assumed to be fixed, then the fa may be correlated with the variables in Xit, and the within estimator is efficient within a class of limited information estimators. Alternatively, the m may be assumed to be random. In this case, the fa are assumed to be independent and identically distributed over the panels. If the fa are assumed to be uncorrelated with the variables in Xjt, then the GLS random-effects estimators are more efficient than the within estimator. However, if the fat are correlated with the variables in Xjt, then the random-effects estimators are inconsistent but the within estimator is consistent. The price of using the within estimator is that it is not possible to estimate coefficients on time-invariant variables, and all inference is conditional on the fa in the sample. See Mundlak (1978) and Hsiao (1986) for discussions of this interpretation of the within estimator.
Example The two-stage least-squares first-differenced estimator (FD2SLS) has been used to estimate both "fixed-effect" and "random-effect" models. If the fa are truly fixed-effects, then the FD2SLS estimator is not as efficient as the two-stage least-squares within estimator for finite Tj. Similarly, if none of the endogenous variables are lagged dependent variables, the exogenous variables are all strictly exogenous, and the random effects are i.i.d. and independent of the X.it, then the two-stage GLS estimators are more efficient than the FD2SLS estimator. However, the FD2SLS estimator has been used to obtain consistent estimates when one of these conditions fails. Perhaps most notably, Anderson and Hsiao (1981) used a version of the FD2SLS estimator to estimate a panel-data model with a lagged dependent variable. Arellano and Bond (3991) developed new one-step and two-step GMM estimators for dynamic panel data. See [R] xtabond for a discussion of these estimators and Stata's implementation of them. In their
xtivreg — Instruments I variables and two-stage least squares for panel-data models
365
article. Arellano and Bone! (1991) applied their neiiv estimators to a model of dynamic labor demand that had previously been considered by Layard and Nickell (1986), They also compared the results of from the Anderson-Hsiao estimator. They used data from an unbalanced panel of firms from the United Kingdom. As is conventional, all variables are indexed over the firm i and time t. In this datasst, n^ is the log of employment in firm i inside the U.K. at time t, 1% is the natural log of the real product wage, ka is the natural log of the gross capital stock, and y% is the natural log of industry! output. The model also includes time dummies yri980, yrl981, yr!982, yr!983, and yr!984. In column (e) of Table 5 of Arellano and Bond (1991), the authors present the results they obtained : rom applying one versidn of the Anderson-Hsiao estimator to these data. This example reproduces heir results for the coefjficients. The standard errors are different because Arellano and Bond are us ng robust standard errors. . xtivreg n 12. n 1 0/1) .w 1(0/2). (k ys) yrl981- yr!984 (l.n = 13. n), i(id) fd 471 First-Differenced : f regression Number of obs = 140 Group variable: id Number of groups = R-sq: within = 0 0141 between = 0 9165 overall = 0 9892
,
i
chi2(14) Prob > chi2
corr(u_i, Xb) = 0 9239
d.n
Coef .
Obs per group: min = avg = mar =
Std. Err.
' z
P>|zl
3 3.4 5
122.53 0.0000
[95*/. Conf. Interval]
n
LD L2D
1 422765 -. 645517
1.583053 .1647179
(j).90 -i-00
0.369 0.318
-1.679962 -.4873928
4.525493 . 1582894
Dl LD
-.' 524675 . 627611
.1765733 1.086506
-4.26 ft. 89
0.000 0.376
-1.098545 -1.166752
-.4063902 3.092275
0.028 0.57S 0.627
.0348211 -1.461774 -.4797207
.6095161 .8120187 . 2889314
0.038 0.239 0.555
.0415037 -3.629237 -.745
1.490678 .9054744 1.387599
w
k
Di
i221686
LD L2D
-.: 1248778 -.( 1953947
.1466086 .5800599 .1860883
Dl LD L2D
-i
660906 361881 1212993
.369694 1.156835 .5440403
^,20 4.56 4,49 ] $.07 4.18 |).59
Dl
-.< 1574197
.0430158
4.33
0.182
-.1417291
.0268896
Dl
-.( 1882952
.0706214
-J..25
0.211
-.2267106
.0501203
Dl
-.
063153
.10861
-<j>.98
0.328
-.319187
. 1065563
Dl
-.
172108 1161204
.15196 .0336264
4.77 f).48
0.441 0-632
-.4150468 -.0497861
. 1806253 .082027
ys
i yr!981 yrl982 yr!983 yr!984 _cons
sigma_u sigma_e rho Instrumented: Instruments:
. ji 1069213 .1 1855982 .7 1384993
(fraction of Variance due to u_i) j
L. L2 n w L . H k L . k L2.k ys i.ys L2.ys yr!981 yr!982 yr!983 yr!9S4 L3 <1
366
xtivreg — Instrumental variables and two-stage least squares for panel-data models
0 Example For the within estimator, let's consider another version of the wage equation discussed in [Rj xtreg. The data for this example come from an extract of women from the National Longitudinal Survey of Youth that was described in detail in [R] xt. Restricting ourselves to only time-varying covariates, we might suppose that the log of the real wage was a function of the individuals age, age2, her tenure in the observed place of employment, whether or not she belonged to union, whether or not she lives in metropolitan area, and whether or not she lives in the south. The variables for these are, respectively, age, age2, tenure, union, not_smsa, and south. If we treat all the variables as exogenous, then we could use the one-stage within estimator from xtreg. This would yield . xtreg ln_w age* tenure not_smsa union south ,fe i(idcode) Fixed-effects (within) regression Number of obs Group variable (i) : idcode Number of groups R-sq: within = 0.1333 Dbs per group: mirt between = 0.2375 avg overall = 0.2031 max F(6, 14867) corr(u_i, Xb) = 0.2074 Prob > F ln_wage
Coef .
Std. Err.
age age2 tenure not_smsa union south _cons
.0311984 -.0003457 .0176205 -.0972535 .0975672 -.0620932 1.091612
. 0033902 .0000543 .0008099 .0125377 .0069844 .013327 .0523126
sigma_u sigma_e rho
.3910683 .25545969 .70091004
(fraction of variance due to u_i)
F test that all u_i=0:
t 9.20 -6.37 21.76 -7.76 13.97 -4.66 20.87
F(4133,14867) =
P>lt! 0.000 0.000 0.000 0.000 0.000 0.000 0.000
8.31
= = = = = ss
19007 4134 1 4.6 12 381.19 0.0000
[95'/, Conf . Interval] .0245533 - . 0004522 .0160331 -.1218289 .0838769 -.0882158 .9890729
. 0378436 -.0002393 .0192079 -.072678 .1112576 -.0359706 1.194151
Prob > F = 0.0000
All the coefficients are statistically significant and have the expected signs. Now suppose that we wish to model tenure as a function of union and south, and we believe that the errors in two equations are correlated. Since we are still interested in the within estimates. we now need a two-stage least-squares estimator. The following output shows the command and the results from estimating this model:
(Continued on next page)
xtivreg — Instrume tal variables and two-stage least squares tor panel-data models
367
. xtivreg ln_w, aje* not_smaa (tenure = union south ) , fe i(idcode) Fixed-effects (w tMn) IV regression Number of obs = 19007 Group variable: dcode Number of groups = 4134 R-sq: within = Obs per group: min = 1 avg = between = 0.1304 4.6 overall = 0.0897 max = 12 Wald cM2(4) = 147926.58 corr(u_i, Xb) -0.6843 Prob > cM2 = 0.0000 . ln_wage z [95X Conf . Interval] Coef . Std. Err. P>|zl tenure age age2 not_smsa _cons
.2403531 .0118437 .0012145 .0167178 1.678287
.0373419 .0090032 .0001968 .0339236 .1626657
sigma.u sigma_e rho
70661941 63029359 ,55690561
(fraction of variance due to u_i)
F test that all u_i=0: Instrumented: Instruments:
6.44 ; 1.32 -6.17 -0.49 10.32
F(4133,1486B) =
0.000 0.188 0.000 0.622 0.000
1.44
. 1671643 -.0058023 -.0016003 -.0832069 1.359468
Prob > F
.3135419 .0294897 -.0008286 .0497713 1.997106
= 0.0000
enure ige age2 not_smsa union! south
Although all the coeffi :ients still have the expected signs, the coefficients on age and not_smsa are no longer statistics y significant. Given that these variables have been found to be important in . many other studies, we might want to rethink otir specification. Now suppose that are willing to assume thai the fa are not correlated with the other covariates, so we can estimate a random-effects model. The model is frequently known as the variance-component or error-component model xtivreg has estimators for two-stage least squares one-way error-component models. In the one-way ramework, there are tw0 variance-components to estimate, the variance of the Hi and the component: or the Vit- Since the variance components are unknown, consistent estimates are required to implem ;nt feasible GLS. xtivrfeg offers two choices a Swamy-Arora method and simple consistent estim tors due to Baltagi and jchang (2000). Baltagi and Chang i 1994) derived the Swanjy-Arora estimators of the variance-components for unbalanced panels. By default, xtivreg uses Estimators which extend these unbalanced SwamyArora estimators to the case with instrumental variables. The default Swamy-Arora method contains a degree of freedom c jrrection to improve its | performance in small samples. Baltagi and Chang (2000) used variance c mponent estimators, which are based on the ideas of Amemiya (1971) and Swamy and Arora (19 2), but do not attempt tb make small sample adjustments. These consistent estimators of the varian ce components will be u$ed if the nosa option is specified. Using either estimator of the variance components, xtivreg contains two GLS estimators of the random-effects model. 'hese two estimators differ only in how they construct the GLS instruments from the exogenous and instr mental variables contained in Xlt = [Km ^a]- The default method, G2SLS, which is due to Balestra and Varadharajan-Krishiiakumar. uses the exogenous variables after they have been passed through th : feasible GLS transform! In math, G2SLS uses X*t for the GLS instruments, where X*t is construe :d by passing each variable in Xjt through the GLS transform in equation (4). given in the Meth >ds and Formulas sectio^i. If the ec2sls option is specified, then xtivreg performs Baltagi's EC2S .S. In EC2SLS, the instruments are Xit and Xjt,_where X!( is constructed by passing each of the variables in X,; through the'within transform, and X,t is constructed by passing
368
xlivreg — Instrumental variables and two-stage least squares for panel-data models
each variable through the between transform. The within and between transforms are given in the Methods and Formulas section. Baltagi and Li (1992) showed that although the G2SLS instruments are a subset of those in contained in EC2SLS, the extra instruments in EC2SLS are redundant in the sense of White (1984), Given the extra computational cost, G2SLS is the default. Example Here is the output from applying the G2SLS estimator to this model: . xtivreg ln_w age* not_smsa black /* > */ (tenure = union birth, south black ) , re i(idcode) G2SLS Random-effects regression Number of obs = Group variable: idcode Number of groups = R-sq: within = 0.0664 Obs per group: min = between = 0.2098 avg = overall = 0.1463 max = Wald cbi2(5) corr(u_i, X) = 0 (assumed) Prob > chi2 =
[951/, Conf . Interval]
ln_wage
Coef .
Std. Err.
tenure age age2 not_smsa black _cons
.1391798 .0279649 -.0008357 -.2235103 -.2078613 1.337684
.0078756 . 0054182 .0000871 .0111371 .0125803 .0844988
sigma_u sigma_e rho
.36582493 .63031479 ,25197078
(fraction of variance due to u_i)
Instrumented: Instruments:
z 17.67 5.16 -9.60 -20.07 -16.52 15.83
P>|z|
0.000 0.000 0.000 0.000 0.000 0.000
19007 4134 1 4.6 12 1446.37 0.0000
.123744 .0173454 -.0010063 -.2453386 -.2325183 1 . 172069
.1546157 .0385843 - . 000665 -.2016821 -.1832044 1 . 503299
tenure age age2 not_smsa black union birth_yi• south
Note that we have included two time-invariant covariates, birth_yr and black. All the coefficients are statistically significant and of the expected sign. Applying the EC2SLS estimator yields similar results:
(Continued on next page)
xtivreg — Instrumeital variables and two-stage least squares for panel-data models
369
, xtivreg ln_w s ;e* not_6msa black (tenure = union birth south black ) , re ec2sls iCidccx e) i EC2SLS Random-ei fects regression Number of obs = 19007 Group variable: idcode Number of groups <= 4134 R-sq: within = 0.0898 Obs per group: min = 1 between = 0.2608 avg = 4.6 overall = 0.1926 max = 12 Wald chi2(5) = 2721.92 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0,0000 ln_wage
Coef .
tenure age age2 not_siBsa black _cons
.064822 .0380048 -.0006676 -.2298961 -.1823627 1.110564
.0025647 .0039549 .0000632 .0082993 .0092005 .0606538
sigma^u sigma_e rho
.36582493 .63031479 .25197078
i (fraction <jif variance due to u_i)
Instrumented : Instruments:
Std. Err.
z
P>|z|
25.27 0.000 9.61 0.000 -10.56 0.000 -27.70 0.000 1-19.82 0.000 18.31 0.000
[95X Conf . Interval] .0597953 .0302534 -.0007915 -.2461625 -.2003954 .9916849
.0698486 .0457562 -.0005438 -.2136297 -.16433 1.229443
tenure ; age age2 not.smsa blact union birth_yr south
Estimating the same model as above with the (J2SLS estimator and the consistent variance components estimators yields . xtivreg ln_w ige* not.smsa black (tenure = union birth south black ) , re nosa i( idcode' G2SLS Random-ef: ects regression Number of obs = 19007 Number of groups 4134 Group variable: idcode R-sq: within 0.0664 Dbs per group: min = 1 between 0.2098 : avg = 4.6 overall 0.1463 : max = 12 Wald cM2(S) = 1446.93 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 In.wage
Coef.
tenure age age2 not_smsa black _cons
.1391859 .0279697 -.0008357 -.2235738 -.2078733 1.337522
z . .007873 17.68 .005419 i 5.16 .0000871 .1 -9.60 .0111344 1-20.08 .0125751 -16.53 .0845083 . 15.83
sigma_u sigma_e rho
. 36535633 . 63020883 .2515512
(fraction if variance due to u_i)
Instrumented: Instruments:
Std. Err.
'i
P>|z| 0.000 0.000 0.000 0.000 0.000 0.000
[95X Conf. Interval] .1237552 .0173486 -.0010064 -.2453967 -.2325201 1.171889
tenure : age age2 not_smsa black union birth_yr south
.1546166 .0385909 -.000666 -.2017508 -.1832265 1.503155
370
xtivreg — Instrumental variables and two-stage [east squares for panel-data models
Acknowledgment We thank Mead Over of the World Bank, who wrote an early implementation of xtivreg.
Saved Results xtivreg, re saves in e(): Scalars
e(N)
harmonic mean of group sizes model F (small only) residual degrees of freedom R-squared for within model
e(r2_o) e(r2_b) e(sigma) e(sigma_u) e(sigma_e) e (thta_min) e(thta_5) e(thta_50) e(thta_95) e(thta_max) e(m_p) e(df_m)
R-squared for overall model .R-squared for between model ancillary parameter (gamma, Inormal) panel-level standard deviation standard deviation of £jj minimum 6 6, 5th percentile 8, 50th percentile 0, 95th percentile maximum 9 p-value from model test model degrees of freedom
Macros e(cmd) e(depvar) e(model)
xtivreg name of dependent variable g2sls or ec2sls
e(ivar) e(cM2type) e (predict)
variable denoting groups Wald; type of model x 2 test program used to implement predict
Matrices e(b)
coefficient vector
e(V)
varianee-covariance matrix of the estimators
Functions e (sample)
marks estimation sample
e(N_g) e(df_m) e(g_mas) e(g_min) e(g_avg) e(ehi2) e(rho) e(Tbar)
e(F) e(df_rz) e(r2_w)
number of observations number of groups model degrees of freedom largest group size smallest group size average group size
x2 p
xtivreg, be saves in e(): Scalars
e(N)
e(g_max) e(g_min) e(g_avg) e(Tcon) e(r2) e(r2_w) e(r2_o) e(r2_b)
largess group size smallest group size average group size 1 if T is constant R-squared R-squared for within model R-squared for overall model R-squared for between model
Macros e(cmd) xtivreg e(depvar) name of dependent variable e(model) be
e(ivar) e(predict) e(small)
variable denoting groups program used to implement predict small if specified
Matrices «(b)
e(V)
variance-covariance matrix of the estimators
e(N_g) e(df_m) e(rss) e(df_r) e(chi2) e(F) e(rmse)
number of observations number of groups model degrees of freedom residual sum of squares residual degrees of freedom model Wald F statistic (small only) root mean square error
coefficient vector
Functions e(sample) marks estimation sample
xtivreg — Instrunental variables and two-stage least squares for panel-data models
371
ztivreg, fe saves in e(): Scalars e(N) «OLg>
e(mss) e(tss) e(df_m) e(rss) e(df_x) e(r2) e(r2_a) e(F) e(rmse) e(chi2) e(df_a) e(F_f)
numb r of observations numb i of groups mode sum of squares total ; urn of squares rncxle degrees of freedom residu al sum of squares residu d d.o.f. (small only) H-squ ared adjust ;d R-squared F sta istic (small only) root i >ean square error mode Wald (not small) degra s of freedom for absorbed effect F for Ho: u,=0
Macros e(cmd) xtivi eg e(depvar) name of dependent variable e (model) fe Matrices e(b)
coefft ient vector
e(g_max) e(g_min) e(g_avg) e(rho) e(Tbar) e(Tcon) e(r2_w) e(r2_o) e(r2_b) e(sigjna) e(corr) e(sigma_u) e(sigma_e)
largest group size smallest group size average group size p harmonic mean of group sizes 1 if T is constant H-squared for within model R-squared for overall mode! R-squared for between model ancillary parameter (gaaraa, laormal) corr(«i, Xb) panel-level standard deviation standard deviation of €n
e(ivar) e (predict)
variable denoting groups program used to implement predict
e(V)
variance-covariance matrix of the estimators
Functions e(saaple) marks) estimation sample
(Continued on next page)
372
xtivreg —- Instrumental variables and two-stage ieast squares for panel-data models
xtivreg, fd saves in e(): Scalars e(N)
number of observations number of groups model sum of squares total sum of squares model degrees of freedom residual sum of squares residual d.o.f. (small only) fl-squared adjusted H-squared F statistic (small only) root mean square error model Wald (not small) degrees of freedom for absorbed effect F for HO: ti.=0
e(gjnax) e (g_min) e(g_avg) e(rho) e (Tbar) e(Tcon) e(r2_w) e(r2_o) e(r2_b) e(sigma) e(cojrr) e(sigma_u) e(sigma_e)
Macros e(cmd) e(depvar) e (model)
rtivreg name of dependent variable fd
e(ivar) variable denoting groups e (predict) program used to implement predict
Matrices eCb)
coefficient vector
e(V)
Functions e (sample)
marks estimation sample
e(mss) e(tss)
e(df_m) e(rss)
e(df_r) e(r2)
e(r2_a) e(F) e{mise) e(chi2) e (df -a) e(F_f)
largest group size smallest group size average group size p
harmonic mean of group sizes 1 if T is constant .R-squared for within model •R-squared for overall model R-squared for between model ancillary parameter (gamma, Inormal) corr(tii, Xb) panel-level standard deviation standard deviation of €^
variance-covarianee matrix of the estimators
Methods and Formulas Consider an equation of the form
where ya is the dependent variable Yjt is an 1 x 2 vector of observations on 52 endogenous variables included as covanates, these variables are assumed to be correlated with the i/n Xijj is an 1 x ki vector of observations on the exogenous variables included as covanates 7 is a g2 x 1 vector of coefficients (3 is a k\ x 1 vector of coefficients S is a K x 1 vector of coefficients, where K = g^ + fcj Assume that there is a 1 x £2 vector of observations on the k% instruments in Xait- The order condition is satisfied if k2 > g2, Let Xit = [Xiit X 2it ]. xtivreg handles exogenously unbalanced panel data. Thus, define Tt to be the number of observations on panel i, n to be the number of panels, and N to be the total number of observations, i.e., Ar = ^=1 T,.
xtivreg — Instrumental variables and two-stage least squares for panel-data models
373
xtivreg, fd As the name implie; this estimator obtains its estimates from an instrumental variables regression on the first-differenced data. Specifically, first differencing the data yields M _i) 8 + vit
tt - Ifa-i =
- vitt-i
With the fa removed differencing, we can obtain the estimated coefficients and thek estimated variance-covariance m trix from a standard two-stage least-squares regression of &yn on AZ,t with instruments AX, f . Reported as J?2 wit n is [corr{(Zif - Z^)|, j/it - yJJ .
f ~~_ ^ Reported as R^ bet een is < corr(Zt5, yj \ \ .
(
-
r
Reported as jf?2 overall is < corr(Zit5. yn] \ .
xtreg, fe At the heart of this model is the within transformation. The within transform of a variable w is Wit = "lit -Wi.
(3)
where
and n is the number o groups and Ar is the to(al number of observations in on the variable, The within transforr of equation (2) is
Note that the within tr isform has removed the //$. With the ^ gone, the within 2SLS estimator can be obtained from a tw -stage least-squares regression of y^ on Zjt with instruments X^. Suppose that there ire K variables in Zjt,.| including the mandatory constant. Then, there are K 4- n — 1 parameters timated in the model, andI the VCE for the within estimator is [^v K - nVT + i1/ ^;v% where V/v is the VCE rom the above two-stagt least-squares regression. From the estimate calculated jut is its stan of vn is the regression for the n — 1 estimate
f 8, estimates fo of p,i |are obtained as ^ = yi — Zj5. Reported from the ard deviation and its correlation with Zi<5. Reported as the standard deviation estimated root mean sqiiare error, s2, which is adjusted (as previously stated) means.
Reported as J?2 wi in is the J?2 from the niean-deviated regression. Reported as J?2 be
- ~ _yj i> • ! . is
374
xtivreg — instrumental variables and two-stage least squares for panel-data models 2
Reported as R2 overall is
After passing equation (2) through the between transform, we are left with
where Wi =
( TL" ) S Wit ^ *' t=i
for w e
{y» Z >"}
Similarly, define X» as the matrix of instruments X;t after they have been passed through the between transform. The BE2SLS estimator of equation (4) obtains its coefficient estimates and its VCE, a two-stage least-squares regression of yl on Z{ with instruments Xj in which each average appears T* times. Reported as R2 between is the R2 from the estimated regression. Reported as R2 within is corr{(Zjt - Zi)<5, yn — y^}
.
n
2
Reported as ^? overall is < corr(Zif5, y^) > . xtreg, re
Following Baltagi and Chang (2000), let U = fa + I/it
be the N x 1 vector of combined errors. Then, under the assumptions of the random-effects model.
E(uu'} =
1
diag
where and tjN is a vector of ones of dimension Tj. Since the variance components are unknown, consistent estimates are required to implement feasible GLS. xtivreg offers two choices. The default is a simple extension of the Swamy-Arora method for unbalanced panels. Let u "it ~ Vit ~ Zit<5w be the combined residuals from the within estimator. Let uit be the within transformed uit. Then. Vn Y"Ti u2
xtivreg — Instrurnen al variables and two-stage least squares for panel-data models
Let be the combined residu 1 from the between estimator. Let 10% be the between residuals after they have been passed throu h the between transform. Then,
AT-r
where r = trace where
If the nosa option specified, then the consistent estimators described in Baltagi and Chang (2000) are used. These ire given by
\N-n and
N Note that the default S1 amy-Arora method corttains a degree of freedom correction to improve its performance in small si mples. Given estimates of t e variance components, pi and cr£, the feasible GLS transform of a variable w is (5) where to, =
wit.
and
Using either estimat r of the variance components, xtivreg contains two GLS estimators of the random-effects model, Phese two estimators differ only in how they construct the GLS instruments from the exogenous an< instrumental variables Contained in Xit — [Xi^Xait]. The default method. G2SLS, which is due to Balestra and Varadharajati-Krishnakumar, uses the exogenous variables after they have been passed through the feasible GLS transform. In math. G2SLS uses X* for the GLS instruments, where X is constructed by passing each variable in X though the GLS transform in equation (5). The G2SL estimator obtains its Coefficient estimates and VCE from an instrumental variable regression of ? on Z*t with instruments X*,.
376
xtivreg — Instrumental variables and two-stage least squares for panel-data models
If the ec2sls option is specified, then xtivreg performs Baltagi's EC2SLS. In EC2SLS, the instruments are Xjt and Xjt,_where Xn is constructed by each of the variables in Xj t throughout the GLS transform in (5), and Xit is made of the group means of each variable in X»t. The EC2SLS estimator obtains its coefficient estimates and its VCE from an instrumental variables regression of y*t on Z*t with instruments Xjt and bfXit. Baltagi and Li (1992) showed that although the G2SLS instruments are a subset of those in contained in EC2SLS, the extra instruments in EC2SLS are redundant in the sense of White (1984). Given the extra computational cost, G2SLS is the default. The standard deviation of m + v^ is calculated as * Id2 + a"*. r
I2
Reported as R2 between is < corr(Zt<£, yj > . Reported as R2 within is corr{(Zit — Zj)<5,yu — yi
f -12 Reported as R2 overall is < corr(Zit<5, yu] \ •
References Amemiya, T, 1971. The estimation of the variances is a variance-components model. International Economic Review 12: -1-13. Anderson, T. W. and C. Hsiao. 1981. Estimation of dynamic models with error components. Journal of the American Statistical Association 76: 598-606. Baltagi, B. H. 1995. Econometric Analysis of Panel Data. New York: John Wiley & Sons. Baltagi, B. H. and Y. Chang. 1994. Incomplete panels: A comparative study of alternative estimators for the unbalanced one-way error component regression model. Journal of Econometrics 62: 67-89. - — -. 2000. Simultaneous equations with incomplete panels. Econometric Theory 16: 269-279. Baltagi, B. H. and Qi Li. 1992. A note on the estimation of simultaneous equations with error components. Econometric Theory 8: 113-319. Drukker, D. 2001. A note on the F and LM tests for individual effects after two-stage within regressions. Technical Bulletin forthcoming 2001. Swamy, P. A. V. B. and S. S. Arora. 1972. The exact finite sample properties of the estimators of coefficients in the error components regression mdoels. Econometrica 40: 261-275. White, H. 1984. Asymptotic Theory for Econometricians. New York: Academic Press.
Also See Complementary:
[R] adjust, [R] lincom, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, [R] xtdata, [R] xtdes, [R] xtsum, [R] xttab
Related:
[R] xtreg, [R] xtregar; [R] xtabond, [R] xtgee, [R] xtintreg, [R] xttobit
Background:
[u] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [R] xt
Title xtiogit — Fixed-effe ts, random-effects, and population-averaged logit models
1
Syntax Random-effects mode] xtiogit depvar [va rlist] [wdgft?] [if exp] [in range] [, re i(varaame) or quad(#) DOcons ant noskip level(#) offsetCvflmame) nolog maximize-.options ]
Conditional fixed-effects model xtiogit depvar [va rlist] [weight] [if exp] [in range] , fe [ i(varname) or noskip level(# offset (varname) nolog maximize-options ] Population-averaged model xtiogit depvar [va rlist] [weight] [if exp] [in range] , pa [ i(vamame) or robust noconstant lev sl(#) offset(vamamg) nolog xtgee-jyptions maximize-options by . .. : may be used with rtlogit; see [R] by. iweights. fweights, and p eights are allowed for the population-averaged model and iweights are allowed for the fixed-effects and random rffects models; see [U] 14.1.6 weight. Note that weights must be constant within panels. xtiogit shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands.
Syntax for predict Random-effects
model
predict [type] neMvarname [if exp] [in range] [, [ xb
puO
stdp }
nopffset I -™""™~ j Fixed-effects model predict [type] neMvarname [if exp] [inj range] [, { p
xb j stdp }
nopffset Population-averaged moftel predict [type] nev vamame [if exp] [in| range] [, | mu | rate | xb j stdp j nooffset 1 These statistics are available both in and out of sample; type predict ... if e (sample) . . . if wanted only for the estimation sample. Note that the predicted probibiJity for the fixed-effects fiodel is conditional on there being only one outcome per group. See [R] clogit for detiils. 3J77
378
xtlogit — Rxed-effects, random-effects, and population-averaged logit models
Description xtlogit estimates random-effects, conditional fixed-effects, and population-averaged logit models. Whenever we refer to a fixed-effects model, we mean the conditional fixed-effects model. Note: xtlogit, re is slow since it is calculated by quadrature; see Methods and Formulas. Computation time is roughly proportional to the number of points used for the quadrature. The default is quad(12). Simulations indicate that increasing it does not appreciably change the estimates for the coefficients or their standard errors. See [R] quadchk. By default, the population-averaged model is an equal-correlation model; xtlogit assumes corr(exchangeable). See [R] xtgee for details on how to fit other population-averaged models. See [R] logistic for a list of related estimation commands,
Options re requests the random-effects estimator, re is the default if none of re, f e, and pa are specified. f e requests the fixed-effects estimator. pa requests the population-averaged estimator. i(varaame) specifies the variable name that contains the unit to which the observation belongs. You can specify the i() option the first time you estimate, or you can use the iis command to set i() beforehand. After that, Stata will remember the variable's identity. See [R] xt. or reports the estimated coefficients transformed to odds ratios, i.e., eb rather than 6. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated, or may be specified at estimation or when replaying previously estimated results. quad(#) specifies the number of points to use in the quadrature approximation of the integral. The default is quad (12). See [R] quadchk. noconstant suppresses the constant term (intercept) in the model. noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be estimated. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. offset (varname) specifies that varname is to be included in the model with its coefficient constrained to be 1. robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the IRLS variance estimator; see [R] xtgee. This alternative produces valid standard errors even if the correlations within group are not as hypothesized by the specified correlation structure. It does, however, require that the model correctly specifies the mean. As such, the resulting standard errors are labeled "semi-robust" instead of "robust". Note that although there is no cluster () option, results are as if there were a cluster() option and you specified clustering on i ( ) . nolog suppresses the iteration log.
xtlogit — Fixed-ejects, random-effects, and population-averaged logit models
379
xtgee .joptions specifies any other options allowed by xtgee for family (binomial) link (logit) such as corrQ; see [R] xtgee. maximize-options control the maximization process; see [R] maximize. Use the trace option to view parameter convergence. fee the ltol(#) option to relax the convergence criterion: the default is le—6 during specification searches.
Options for predict xb calculates the linear prediction. This is the default for the random-effects model. p calculates the predicted probability of a positive outcome conditional on one positive outcome within group. This is the default for the fixed-efffccts model. mu and rate both calculate the predicted probability of depvar. nm takes into account the offsetQ. and rate ignores those a ijustrnents. mu and rate are equivalent if you did not specify offset (). mu is the default for the population-averaged model. puO calculates the probabi ity of a positive outcome, assuming that the random effect for that observation's panel is zeijo (v = 0). Note that this may not be similar to the proportion of observed outcomes in the group. stdp calculates the standard error of the linear prediction. noof f set is relevant only if you specified offset (vdrname") for xtlogit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as x,tb rather than Xjtb + offse
Remarks xtlogit is a convenience command if you want the population-averaged model. Typing . xtlogit . . . , pa
is equivalent to typing . xtgee
fam: ly(binomial) link(logit) corr(exchangeable)
It is also a convenience command if you want the fixed-effects model. Typing . xtlogit . . . , fe i( >amame)
is equivalent to typing . clogit ..., group(
'amame") ,,.
Thus, also see [R] xtgee and [R] clogit for information about xtlogit. By default, or when rej is specified, xtlogit estimates a maximum-likelihood random-effects model.
Example You are studying unionization of women in the United States and are using the union dataset; see [R] xt. You wish to est mate a random-effects mode] of union membership:
380
xtlogit — Fixed-effects, random-effects, and population-averaged logit models . xtlogit union age grade not_smsa south southXt, i(id) nolog Random-effects logit Number of obs = Group variable (i) : idcode Number of groups = Random effects u_i - Gaussian Obs per group: min = avg = max = Wald chi2(5) = Log likelihood = -i0556.294 Prob > chi2 union
Coef .
age grade not_smsa south southXt _cons
.0092401 .0840066 -.2574574 -1.162854 .0237933 -3.2S016
,0044368 .0181622 .0844771 .1108294 .0078548 .2622898
/Insig2u
1.669888
sigma_u rho
2.304685 .8415609
Std. Err.
z
P>lzl
26200 4434 1 5.9 12 221.95 0.0000
[95f/. Conf. Interval] .0005441 .0484094 -.4230294 -1.370075 . 0083982 -3.764238
.0179361 .1196038 -.0918854 - . 9356323 .0391884 -2.736081
.0430016
1 . 585607
1.75417
.0495526 .0057337
2.209582 .8299971
2.403882 .852478
2. OS 4.63 -3.05 -10.40 3.03 -12.39
0.037 0.000 0.002 0.000 0.002 0.000
Likelihood ratio test of rho=0: chibar2(01) = 5978.89 Prob >= chibar2 = 0.000
The output includes the additional panel-level variance component. This is parameterized as the log of the standard deviation ln(ov) (labeled Insig2u in the output). The standard deviation av is also included in the output labeled sigma_u together with p (labeled rho). P=
which is the proportion of the total variance contributed by the panel-level variance component. When rho is zero, the panel-level variance component is unimportant and the panel estimator is not different from the pooled estimator. A likelihood ratio test of this is included at the bottom of the output. This test formally compares the pooled estimator (logit) with the panel estimator. As an alternative to the random-effects specification, you might want to fit an equal-correlation logit model:
(Continued on next page)
xtlogit — Fixed-e Ffects, random-effects, find population-averaged logit models xtlogit union age grade not.smsa south scHithXt, i(id) pa Iteration 1: tolerai ce = .07495101 Iteration 2: tolerai ce = .00626455 Iteration 3: tolerai ce «= .00030986 Iteration 4: tolerai ce « .00001432 Iteration 5: tolerai ce = 6.699e-07 GEE population-averaged model Number of obs = idcode Group variable: Number of groups = logit Link: I Dbs per group: min = binomial avg = Family: exchangeable max = Correlation: Wald c.hi2(5) = Prob > chi2 = Scale parameter: union
Coef.
Std. Err.
age grade not_smsa south southXt _cons
.00 )3241 .05 J5076 -.12 J4955 -.72 T0863 .01 51984 -2. Jllll
. 0024988 .0108311 .0483137 .0676522 .0046586 . 16439
z 2.13 5.49 -2.54 -10.76 3.83 -13.03
P>lz| 0.033 0.000 0.011 0.000 0,001 0.000
381
26200 4434 1 5.9 12 233.60 0.0000
C95'/, Conf . Interval] .0004265 .0382791 -.2171887 -.8594861 .0062638 -2.313709
.0102216 .0807361 -.0278024 - . 5946865 .024133 -1.708512
1> Example xtlogit with the pa option allows a robust option, so we can obtain the population-averaged logit estimator with the rob: st variance calculation by typing . xtlogit union age |;rade not_smsa south soiithlt, i(id) pa robust nolog Number of obs = 26200 GEE population-averaged model Group variable: idcode Number of groups = 4434 Link: logit Dbs per group: min = 1 Family: binomial avg = 5.9 Correlation: exchangeable max = 12 Wald chi2(5) = 152.01 Scale parameter: 1 Prob > chi2 = • 0.0000 (standard errors ladjusted for clustering on idcode) union
{oef .
age grade not_smsa south southXt _cons
.005 3241 .055 5076 -.12; 4955 -.727 0863 .Old 1984 -2.C 1111
Semi -robust Std. Err. .0037494 . 0133482 .0613646 . 0870278 .006613 .2016405
4
P>lzl
1.42 4.46
0.156 0.000 0.046 0.000 0.022 0.000
-2.do -8.3"5 2.30 -9.97
[957, Conf. Interval] -.0020246 .0333455 - . 2427678 -.8976577 .0022371 -2.406319
.0126727 .0856697 -.0022232 -.5565149 .0281596 -1.615902
These standard errors are sonewhat larger than those obtained without the robust option.
382
xtlogit — Fixed-effects, random-effects, and population-averaged logit models
Finally, we can also fit a fixed-effects model to these data (see also [R] clogit for details): . xtlogit union age grade uot_smsa south southXt, i(id) fe note: multiple positive outcomes within groups encountered, note: 2744 groups (1416B obs) dropped due to all positive or all negative outcotaes. Iteration 0: log likelihood = -4541.9044 Iteration 1: log likelihood = -4511.1353 Iteration 2: log likelihood = -4511.1042 Conditional fixed-effects logit Number of obs Group variable (i) : idcode Number of groups
12035 1690 2 7.1 12
Obs per group: min = avg max = LR chi2(5) Prob > chi2
Log likelihood = -4511.1042 union
Coef .
Std. Err.
z
P>lzl
age grade not_smsa south southXt
.0079706 .0811808 .0210368 -1.007318 .0263495
.0050283 .0419137 .113154 . 1500491 .0083244
1.59 1.94 0.19 -6.71 3.17
0.113 0.053 0.853 0.000 0.002
[951/, Conf -.0018848 - . 0009686 -.2007411 -1.301409 .010034
78.16 0.0000
=
Interval] .0178259 . 1633302 .2428146 -.7132271 . 0426649
Saved Results xtlogit, re saves in e(): Scalars e(N) e(N_g) eCdf_m)
e(g_max) e (g_min)
number of observations number of groups model degrees of freedom log likelihood log likelihood, constant-only model log likelihood, comparison model largest group size smallest group size
Macros e(cmd) xtlogit e(depvar) name of dependent variable e(title) title in estimation output e(ivar) variable denoting groups e(wtype) weight type e (wexp) weight expression e(offset) offset
e(g_avg) e(chi2) e(chi2_c) e(rho) e(sigma_u) e(N_cd) e(n_quad)
average group size X*
X2 for comparison test p
panel-level standard deviation number of completely determined obs. number of quadrature points
e(chi2type) Wald or LR; type of model x test e(chi2_ct) Wald or LR: type of model x 2 test corresponding to e(chi2_c) e(distrib) Gaussian; the distribution of the random effect e (predict) program used to implement predict
Matrices e(b)
Functions e (sample)
coefficient vector
marks estimation sample
e(V)
variance-covariance matrix of the estimators
xtlogit — Fixedn ffects, random-effects, and population-averaged logit models xtlogit, pa
333
saves in sO:
Scalars e( deviance) e(chi2_dev) e(dispers) e(chi2_dis) e(tol) e(dif) e(phi)
deviance X2 test of deviance deviance dispersion X2 test of deviance dispersion target tolerance achieved tolerance scale parameter
rtgee xtlogit name of d pendent variable binomial logit; lin c function correlation structure
e( scale) e(ivar) e(vcetype) e(chi2type) e (offset) e( predict)
x2, dev, phi, or #; scale parameter variable denoting groups covariance estimation method Wald; type of model x~ <est offset program used to implement predict
coefficient vector variance-c ^variance matrix of the estii mtors
e(R)
estimated working correlation matrix
number of observations number of groups model degi ;es of freedom log likelihc od log likelihc od, constant-only model
e(g_max) e(gjnin) e(g_avg) e(chi2)
largest group size smallest group size average group size
Macros e(cmd) e(cmd2) e(depvar) e(title) e(ivar)
clogit xtlogit name of d pendent variable title in est nation output variable de loting groups
e(offset) offset e(wtype) weight type e(wexp) weight expression e(cM2type) LR; type of model \~ test e (predict) program used to implement predict
Matrices e(b)
coefficient rector
eCV)
Functions e(saaple)
marks esti lation sample
• 00 e(U_g>
e(df_m) e(g_jnax) e(g_min) e(g_avg) e(cM2) e(dl_pear) Macros e(cmd) e(cmd2) e(depveir) e (family) e (link) e(corr) Matrices e(b) e(V) Functions e( sample)
xtlogit, fe Scalars e(N) e(N_g) e(df_m)
e(ll) e(ll_0)
number of observations number of groups model deg ees of freedom largest gro p size smallest g oup size average gi up size
x2 degrees of freedom for Pearson x2
marks esti nation sample
saves in a():
X2
variance-covariance matrix of the estimators
384
xtlogit — Fixed-effects, random-effects, and population-averaged logit models
Methods and Formulas xtlogit is implemented as an ado-file. xtlogit reports the population-averaged results obtained by using xtgee, family (binomial) link(logit) to obtain estimates. The fixed-effects results are obtained using clogit. See [R] xtgee and [R] clogit for details on the methods and formulas. Assuming a normal distribution, N(Q,ff%), for the random effects i/j, we have that Prfolxi) - I
e
^V | f[F(zit0 +
J-oo V^TTcr,,
|^*
where
^.j-f1 * I 1
p—
otherwise
and we can approximate the integral with M-point Gauss-Hermite quadrature
/
-
m=l
where the w^ denote the quadrature weights and the a^ denote the quadrature abscissas. The log-likelihood L, where p = ff^/(cr^ + 1), is then calculated using the quadrature
»t
/"
H
E -- 1° where Wi is the user-specified weight for panel i; if no weights are specified, uii — 1. The quadrature formula requires that the integrated function be well-approximated by a polynomial. As the number of time periods becomes large (as panel size gets large).
is no longer well-approximated by a polynomial. As a general rule of thumb, you should use this quadrature approach only for small to moderate panel sizes (based on simulations. 50 is a reasonably safe upper bound). However, if the data really come from random-effects logit and rho is not too large (less than, say, .3), then the panel size could be 500 and the quadrature approximation would still be fine. If the data are not random-effects logit or rho is large (bigger than, say, .7), then the quadrature approximation may be poor for panel sizes larger than 10. The quadchk command should be used to investigate the applicability of the numeric technique used in this command.
xtlogit — Fixed-offects, random-effects, and population-averaged logit models
385
jferences Conway, M. R. 1990. A randofa effects model for binary data. Biometrics 46: 317-328. Liang. K.-Y. and S, L. Zeger, 1)986. Longitudinal data analysis using generalized linear models. B/ometri/ta 73: 13-22. Neuhaus. J. M. 1992. Statists. al methods for longitudinal and clustered designs with binary responses. Statistical Methods in Medical Reseai ch 1: 249-273. Neuhaus, J. M., J. D. Kalbfleis ch, and W. W. Hauck. 1991. A comparison of cluster-specific and population-averaged approaches for analyzing o rrelated binary data. International Statistical Review 59: 25-35. Pendergast J. F., S. J. Gange M. A. Newton. M. J. Linbstrom, M. Palta, and M. R. Fisher. 1996. A survey of methods for analyzing clus ered binary response data. International Statistical Review 64: 89-118.
'Also See Complementary:
[R] adjust, [R] lincom, [R] rhfx, [R] predict, [R] quadchk, [R] test, [R] :estnl. [R] vce, [R] xtdate, [R] xtdes, [R] xtsum, [R] xttab
Related:
[R] :logit, [R] logit. [R] xtcl^g, [R] xtgee, [R] xtprobit
Background:
[u] 16.5 Accessing coefficients and standard errors, [u] 23 Estimation and post*estimation commands. 23.11 Obtaining robust variance estimates, [R] rt
[U]
Title xtnbreg — Fixed-effects, random-effects, & population-averaged negative binomial models
Syntax Random-effects and conditional fixed-effects overdispersion models xtnbreg depvar [varlist] [weight] [if exp] [in range] [, { re | fe } i irr no const ant constraints (numlist) noskip exposure (varname) offset (varname) level (#) nolog maximize-.options Population-averaged model xtnbreg depvar [varlist] [weight] [if exp] [in range] , pa [ i(varname) irr robust no const ant exposure (varname} offset (vamame} level(#) nolog xtgee-options maximize-options ] by . . . : may be used with xtnbreg; see [R] by. iweights, fweights, and pweights are allowed for the population-averaged model and iweights are allowed in the random-effects and fixed-effects models; see [U] 14.1.6 weight. Note that weights must be constant within panels. xtnbreg shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands.
Syntax for predict Random-effects and conditional fixed-effects overdispersion models predict [type] newvarname [if exp] [in range] [, { xb ] stdp | nuO
iruO }
nooffset Population-averaged model predict [type] newvarname [if exp] [in range] [, { mu
rate
xb
stdp }
nooffset 1J — ~ ™ ~ These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample.
(Continued on next page)
386
xtnbreg — Fixed-effects, n ndom-effects, & population-averaged negative binomial models
387
Description xtnbreg estimates randomi-effects overdispersioh models, conditional fixed-effects overdispersion models, and population-avf aged negative binomial ttiodels. Here "random-effects" and "fixed-effects" apply to the distribution c the dispersion parameter, and not to the x/3 term in the model. In the random-effects and fixed-i^ffects overdispersion mddels, the dispersion is the same for all elements in the same group (i.e., ments with the same value of the i() variable). In the random-effects model, the dispersion van s randomly from group to group such that the inverse of the dispersion has a Beta(V. s) distributio In the fixed-effects model, the dispersion parameter in a group can take on any value, since a com itional likelihood is used in which the dispersion parameter drops out of the estimation. By default, the popul? tion-averaged model is;an equal-correlation model; xtnbreg assumes corr(exchangeable). e [R] xtgee for details on this option to fit other population-averaged models.
Options re requests the random-ef ects estimator, re is the default if none of re, fe. and pa are specified. f e requests the conditiona fixed-effects estimator, pa requests the population averaged estimator. i (vamame} specifies the ariable name that contains the unit to which the observation belongs. You can specify the i() op ion the first time you eitimate, or you can use the iis command to set i() beforehand. After hat, Stata will remembef the variable's identity. See [R] xt, irr reports exponentiated :oefficients eb rather than coefficients 6. For the negative binomial model, exponentiated coefficiei ts have the interpretation of incidence rate ratios. noconstant suppresses tf e constant term (intercept) in the model. constraints (numlist') s ecifies by number the linear constraints to be applied during estimation. The default is to perfora unconstrained estimation. Constraints are specified using the constraint command; see [R] cons raint. See [R] reg3 for the use of constraints in multiple-equation contexts. constraints (numlisf. may not be specified with pa. noskip specifies that a ful maximum-likelihood model with only a constant for the regression equation be estimated. This con tant-only model is used as the base model to compute a likelihood-ratio \'2 statistic for the mod el test. By default, the model test uses an asymptotically equivalent Wald \ 2 statistic. For many nodels. this option can substantially increase estimation time. exposure(vamame} and offset(vaniame') are different ways of specifying the same thing, exposureO specifies a variable that reflects the amount of exposure over which the depvar events were observed for each observation; \n(yamame) with its coefficient constrained to be 1 is entered into the regression eq ation. off set () specifies a variable that is to be entered directly into the regression equation with its coefficient constrained to be 1; thus, exposure is assumed to be ,,varname
level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set leve ; see [U] 23.5 Specifying the width of confidence intervals. robust (pa only) specific: that the HuberAVhite/saiidwich estimator of variance is to be used in place of the IRLS variance es imaton see [R] xtgee. This alternative produces valid standard errors even if the correlations with n group are not as hypothesized by the specified correlation structure. It does, however, require hat the model correctly specifies the mean. As such, the resulting standard
388
xtnbreg — Fixed-effects, random-effects, & population-averaged negative binomial models
errors are labeled "semi-robust" instead of "robust". Note that although there is no cluster () option, results are as if there were a cluster () option and you specified clustering on i(). nolog suppresses the iteration log. xtgee -joptions specifies any other options allowed by xtgee for family (nbinom) link (log); see [R] xtgee. maximize ^options control the maximization process; see [R] maximize. Use the trace option to view parameter convergence.
Options for predict xb calculates the linear prediction. This is the default for the random-effects and fixed-effects models. stdp calculates the standard error of the linear prediction. nuO calculates the predicted number of events, assuming itj = 0. iruO calculates the predicted incidence rate, assuming «j = 0. mu and rate both calculate the predicted probability of depvar. am takes into account the of f set (), and rate ignores those adjustments, mu and rate are equivalent if you did not specify offsetO. mu is the default for the population-averaged model. noof f set is relevant only if you specified offset (varname) for xtnbreg. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as Xjtb rather than Xjtb -f offset,^,
Remarks xtnbreg is a convenience command if you want the population- averaged model. Typing . xtnbreg . . ., . . . pa exposure(time) is equivalent to typing . xtgee ..., ... family (nbinom) link(log) corr( exchangeable) exposure (time)
Thus, also see [R] xtgee for information about xtnbreg. By default, or when re is specified, xtnbreg estimates a maximum-likelihood random-effects overdispersion model. .
t> Example You have (fictional) data on injury "incidents" incurred among 20 airlines in each of 4 years. (Incidents range from major injuries to exceedingly minor ones.) The government agency in charge of regulating airlines has run an experimental safety training program and, in each of the years, some airlines have participated and some have not. You now wish to analyze whether the "incident" rate is affected by the program. You choose to estimate using random-effects negative binomial regression because the dispersion might vary across the airlines because of unidentified airline-specific reasons. Your measure of exposure is passenger miles for each airline in each year. (Continued on next page)
xtnbreg—Fixed-effects, n ndom-effects, & population-averaged negative binomial models , xtnbreg i_cnt inp rog, i( airline) exposure (pmiles) irr nolog Random-effects nega tive binomial Number of obs Number of groups Group variable (i) : airline Random effects u_i - Beta Obs per group: min avg max Wald cM2(l) Log likelihood =: - 265.38202 Prob > chi2
= = = = = =
80 20 4 4.0 4 2.04 0.1532
P>|zl
[9S1/, Conf . Interval]
0.153
.8030206
1 . 035027
.951781 .4709033
2.929535 2.345098
6.660448 4 . 191005
115.0735 12.36598
18.71892 10.4343
780.9007 66.08918
Std. Err.
i_cnt
IRR
inprog pmiles
911673 (exj osure)
.0590277
/ln_r An_s
4. 794991 3,268052
r s
i; 0.9033 2E .26013
z -i.43
Likelihood ratio test vs. pooled: chibar2^01) =
389
19.03 Prob>=chibar2 = 0.000
In the output above, the / ln_r and /ln_s lines refer to ln(r) and ln(s), where the inverse of the dispersion is assumed to fallow a Beta(r, s) distribution. The output also includes a likelihood-ratio test which compares the p, nel estimator with the pooled estimator (i.e., a negative binomial estimator with constant dispersion). You find that the incidbnce rate for accidents is not significantly different for participation in the program and that the pan^l estimator is significantly different from the pooled estimator. We may alternatively estimate a fixed-effects overdispersion model: . xtnbreg i_cut injfrog, i (airline) exposure (piniles) irr fe nolog Conditional fixed-'effects negative binomial Number of obs = Group variable (i) airline Number of groups = Obs per group: min = avg = max = Maid chi2(l) Log likelihood = • 174.25143 Prob > cM2 = i.cnt inprog pniiles
IRR Std. Err. 9062669 (expo sure)
.0613917
P>izi
-1.45 0.146
(Continued on next page)
80 20 4 4.0 4
2.11 0.1463
[95X Conf. Interval] .793587
1.034946
390
xtnbreg — Fixed-effects, random-effects, & population-averaged negative binomial models
!> Example Rerunning our previous example in order to fit a robust equal-correlation population-averaged model, . xtnbreg i_cnt inprog, i(airline) exposure(pmiles) eform robust pa nolog GEE population-averaged model Number of obs = 80 Group variable: airline Number of groups = 20 Link: log Obs per group: min = 4 Family: negative binomial(k=l) avg = 4.0 Correlation: exchangeable max = 4 Wald chi2(l) = 1.28 Scale parameter: 1 Prob > chi2 = 0.2571 (standard errors adjusted for clustering on airline) i_cnt
IRR
inprog pmiles
.927275 (exposure)
Semi-robust Std. Err. .0617857
z
-1.13
P>!z|
[95'/. Conf . Interval]
0.257
.8137513
1 . 056636
We may compare this with a pooled estimator with cluster robust variance estimates: . nbreg i_cnt inprog, exposure(pmiles) robust cluster(airline) irr nolog Negative binomial regression Number of obs = 80 Wald chi2(l) = 0.60 Log likelihood = -274.55077 Prob > chi2 = 0.4369 (standard errors adjusted for clustering on airline) Robust Std. Err.
z
P>lzl
[951/, Conf. Interval]
.9429015 (exposure)
.0713091
-0.78
0.437
.8130032
1.093555
/Inalpha
-2.835089
.3351784
-3.492027
-2.178151
alpha
.0587133
.0196794
.0304391
.1132507
i_cnt
IRR
inprog pmiles
(Continued on next page)
xtnbreg — Fixed-effects, 'andom-effects, & population-averaged negative binomial models
391
Saved Results xtnbreg, re saves ii e():
i
1 § 1 i 1 I 1 1 1
I 1 I 1 1 I I I II
1 1
I
I
Scalars e(N)
number of observations number of estimated parameters number of jpations number of dependent variables number of >roups smallest gr up size average grt up size largest groi ) size log likelihc od log likelihc od, constant-only model
e(ll_c) e(df_jn) e(chi2) e(p) e(cM2_c) e(r) e(s) e(ic) e(rc)
Macros e(cmd) e(cmd2) e(depvar) e(title) e(ivar) e(wtype) e(Bezp) e (method) e(user)
xtnbreg xtn_re name of ck pendent variable title in esti nation output variable de oting groups weight tyrx weight exp ession estimation nethod name of li elihood-evaluation program
e(opt) type of optimization 2 e(chi2type) Wald or LR; type of model x test 2 e(chi2_ct) Maid or LR; type of model x test
Matrices e(b)
coefficient /ector
e(V)
e(k) e (k_eq) e(k_dv) e(N_g) e (g_min) e(g_avg) e(g_max) e(ll) e(ll_0)
e(offset) e(distrib) e (predict) e(cnslist)
Functions
e (sample) marks estir lation sample
1 *
\ i }
0
f
(Continued tin next page)
log likelihood, comparison model model degrees of freedom model x2 model significance X2 for comparison test value of r in Beta(r.s) value of a in Beta(r.s) number of iterations return code
corresponding to e(chi2_c) offset Beta; the distribution of the random effect program used to implement predict constraint numbers variance-covariance matrix of the estimators
xtnbreg — Fixed-effects, random-effects, & population-averaged negative binomial models
392
rtnbreg, fe saves in e(): Scalars e(N) eCk) e (k_eq) e(k_dv) e(N^g) e(g_min) e(g_avg) e(g_max)
number of observations number of estimated parameters number of equations number of dependent variables number of groups smallest group size average group size largest group size
e(ll) e (11_0) e(df_m) e (cb.12) e(p) e(ic) e(rc)
log likelihood log likelihood, constant-only model model degrees of freedom model x2 model significance number of iterations return code
xtnbreg xtn_fe name of dependent variable title in estimation output variable denoting groups weight type weight expression
e(method) e(user) e(opt) e(chl2type) e (off set) e (predict) e(cnslist)
requested estimation method name of likelihood-evaluator program type of optimization Wald or LR; type of model x2 test offset program used to implement predict constraint numbers
Matrices e(b)
coefficient vector
e(V)
variance-covariance matrix of the estimators
Functions e (sample)
marks estimation sample
Macros e(cmd) e(cmd2) e(depvar) e(title) e(ivar) e(wtype) e(wexp)
xtnbreg, pa saves in e(): Scalars e(N) e(N_g) e(g_min) e(g_avg) e (g-jnax) e(df_m) e(chi2) e(df_pear)
number of observations number of groups smallest group size average group size largest group size model degrees of freedom model x2 degrees of freedom for Pearson x 2
e (deviance) e(chi2_dev) e(dispers) e(chl2_dis) e(tol) e(dif) e(phi)
deviance X2 test of deviance deviance dispersion X2 test of deviance dispersion target tolerance achieved tolerance scale parameter
xtgee xtnbreg name of dependent variable negative binomial(A;=l) log; link function correlation structure x2, dev, phi, or #; scale parameter
e(ivar) e(vcetype) e(chi2type) e(offset) e(nbalpha) e (predict)
variable denoting groups covariance estimation method Wald; type of model x 2 test offset
coefficient vector estimated working correlation matrix
e(V)
Macros e(cmd) e(cmd2) e(depvar) e(family) e(link) e(corr) e(scale)
Matrices e(b) e(R) Functions e(sample)
marks estimation sample
a
program used to implement predict
variance-covariance matrix of the estimators
xtnbreg — Fixed-effects random-effects, & population-averaged negative binomial models
393
Methods and Forir ulas xtnbreg is implemented as an ado-file. xtnbreg, pa report the population-averagel results obtained by using xtgee, family (nbreg) link (log) to obtain e imates. See [R] xtgee for details on the methods and formulas. For the random -effi ;ts and fixed-effects oterdispersion models, we let yn be the count for the ith observation in the zth group. We begin with the model j/»t ! 7tt ~ Poisson(7it), where 'Y-Jrf' /<$$) with \n — exp(xit|9 +offset^) and 8i is the dispersion parameter. This Jit 1 t ' yields the model u
Looking at within-g::roip effects only, this specification yields a negative binomial model for the ith group with dispersi n (variance divided by ^the mean) equal to 1 -f <J»; i.e., constant dispersion within group. Note tha this parameterization of the negative binomial model differs from the default parameterization of nbi eg, which has dispersioi equal to 1 -f aexp(x/? + offset); see [R] nbreg. For a random-effect overdispersion model, M'e allow 5i to vary randomly across groups; namely, we assume that 1/(1 + 5j) ~ Beta(r, s). The jdjint probability of the counts for the ith group is • - , Yint = yin J =
yu til
it
n
r(r r(r)r( s )r(r The resulting log likeli ood is -s) + lnr[ r + 5] Aifc] -f-lnr/s + ^ fc=i -^ ^ fc=i
In L =
- In F I r -t~ s
fc-i
fc=i
1=1
where A it = f offsettt) and Wi is the weight for the tth group. For the fixed-effect overdispersion model, we condition the joint probability of the counts for each group on the sum of the counts for the grdup (i.e., the observed |^"=i y»« il = I/il, •
The conditional log lik lihood is In L —
InF i-l
+E
-InF
394
xtnbreg — Fixed-effects, random-effects, & population-averaged negative binomial models
See Hausman et al. (1984) for a more thorough development of the random-effects and fixed-effects models. Note that Hausman et al. (1984) use a 6 that is the inverse of the S we have used here. Also, see Cameron and Trivedi (1998) for a good textbook treatment of this model.
References Cameron, A. C, and Trivedi, P. K. 1998. Regression Analysis of Count Data. New York: Cambridge University Press. Hausman, I, B. H. Hall, and Z. Griliches. 1984. Econometric models for count data with an application to the patents-R & D relationship. Econometrica 52: 909-938. Liang. K.-Y. and S. L. Zeger. 1986. Longitudinal data analysis using generalized linear models. Biomettika 73: 13-22.
Also See Complementary:
[R] adjust, [R] constraint, [R] lincom, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, [R] xtdata, [R] xtdes, [R] xtsum, [R] xttab
Related:
[R] nbreg, [R] xtgee, [R] xtpois
Background:
[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [R]xt
Title xtpcse -- OLS or Pr s-Winsten models witrj panel-corrected standard errors
1
Syntax xtpcse depvar [varl t] [weight] [if exp] [in range] [, correlation(corr) hetonly independent noc nstant casewise pairwise rhotype (ca/c) oak npl detail level(#) ] where corr is one of { i dependent | arl and calc is one of { reg ess
freg
psarlj
tscorr
dwj
xtpcse is for use with time- ries data: see [R] tsset. You must tsset your data before using xtpcse. by . . . : may be used with 3 pcse; see [R] by. depvar and varlist may conta
time-series operators; see [!U] 14.4.3 Time-series varlists.
aweights and iweights are lowed; see [U] 14.1.6 weight. Note that weights must be constant within panel when option correlationCarl or option correlation(pearl) is specified; weights must be constant within time periods when disturbances are assumed correlated between panels; that is, when neither option independent nor option faetonly is specifi xtpcse shares the features o all estimation commands; see [U] 23 Estimation and post-estimation commands.
Syntax for predict predict [type] nev\ arname [if exp] [in range] [, { xb
stdp j
These statistics are available oth in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample.
Description xtpcse produces pan -corrected standard errctr (PCSE) estimates for linear cross-sectional timeseries models where the p ameters are estimated by OLS or Prais-Winsten regression. When computing the standard errors and t variance-covariance estimates, the disturbances are, by default, assumed to be heteroskedastic am contemporaneously corrilated across panels, See [R] xtgls for the neralized least squares estimator for these models.
Options correlateon(ro/r) spe ifies the form of assumed autocorrelation within panels, correlat ion(indep ndent). the default, specifies that there is no autocorrelation. correlationCarl) lecifies that within panelf, there is first-order autocorrelation AR(1) and that the coefficient of the R(l) process is common to all the panels. 395
396
xtpcse — OLS or Prais-Winsten models with panel-corrected standard errors
correlation(psarl) specifies that within panels, there is first-order autocorrelation and that the coefficient of the AR(1) process is specific to each panel, psarl stands for panel-specific AR(1). bet only and independent specify alternate forms for the assumed covariance of the disturbances across the panels. If neither is specified, the default is to assume that the disturbances are heteroskedastic (each panel has its own variance) and that the disturbances are contemporaneously correlated across the panels (each pair of panels has their own covariance). This is the standard model for PCSE. net only specifies that the disturbances are assumed to be panel-level heteroskedastic only and that there is no assumed contemporaneous correlation across panels. independent specifies that the disturbances are assumed to be independent across panels; that is, there is a single disturbance variance common to all observations. noconstant suppresses the constant term (intercept) in the model. casewise and pairwise specify how missing observations in unbalanced panels are to be treated when estimating the interpanel covariance matrix of the disturbances. The default is casewise selection. casewise specifies that the entire covariance matrix is computed only on the observations (time periods) that are available for all panels. If an observation has missing data, all observations of that time period are excluded when estimating the covariance matrix of disturbances. Specifying casewise ensures that the estimated covariance matrix will be of full rank and be positive definite. pairwise specifies that for each element in the covariance matrix, all available observations (time periods) that are common to the two panels contributing to the covariance be used to compute the covariance. Options casewise and pairwise have an effect only when the panels are unbalanced and neither hetonly nor independent is specified. rhotype (ca/c) specifies the method to be used to calculate the autocorrelation parameter. Allowed strings for calc are regress regression using lags (the default) f reg regression using leads dw Durbin-Watson calculation tscorr time series autocorrelation calculation All the calculations are asymptotically equivalent and all are consistent; this is a rarely used option. nmk specifies standard errors are to be normalized by N - 7c, where k is the number of parameters estimated, rather than A1", the number of observations. Different authors have used one or the other normalization. Greene (2000, 600) recommends N and notes that whether you use N or N — k does not make the variance calculation unbiased in these models. npl specifies that the panel-specific autocorrelations be weighted by Tj — 1 rather than the default TJ when estimating a common p for all panels, where Tj is the number of observations in panel i. This option has an effect only when panels are unbalanced and option correlate on (arl) is specified. detail specifies that a detailed list of any gaps in the series be reported. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals.
xtpcse — OLS or Prais-Winsten models With panel-corrected standard errors
397
Options for predict xb, the default, calculates A e linear prediction, stdp calculates the standan error of the linear prediction.
Remarks xtpcse is an alternative Ifo feasible generalized least squares (FGLS)—see [R] xtgls—for estimating linear cross-sectional time- ssnes models when the disturbances are not assumed to be independent and identically distributed id). Instead, the disturbances are assumed to be either heteroskedastic across panels or heteroskedhstic and contemporanedusly correlated across panels. The disturbances may also be assumed to be autocorrelated within panel and the autocorrelation parameter may be either constant across panel or different for each pdnel. We can write such models as +
where i = 1 . . . . , m is the number of units (or panels), t — 1,..., Tj, and T{ is the number of time periods in panel i, and e,-t i a disturbance that may be autocorrelated along t or contemporaneously correlated across i. This model can also be written panel-by-panel as Y2 '
rXi1
£2
Lc,
-Ym-
For a model with heteroske|astic disturbances and contemporaneous correlation but with no autocorrelation, the disturbance co ariance matrix is assumfed to be 0*22122
mm •"• •"•mm -
where |
121).
;
398
xtpcse — OLS or Prais-Winsten models with panel-corrected standard errors
xtgls produces full FGLS parameter and variance-covariance estimates. These estimates are conditional on the estimates of the disturbance covariance matrix and are conditional on any autocorrelation parameters that are estimated; see Kmenta (1997), Greene (2000), Davidson and MacKinnon (1993), or Judge et al. (1985). Both estimators are consistent as long as the conditional mean (x;t/3) is correctly specified. If the assumed covariance structure is correct, FGLS estimates produced by xtgls are more efficient. Beck and Katz (1995) have shown, however, that the full FGLS variance-covariance estimates are typically unacceptably optimistic (anti-conservative) when used with the type of data analyzed by most social scientists—10 to 20 panels and 10 to 40 time periods per panel. They show that the OLS or Prais-Winsten estimates with PCSEs have coverage probabilities that are closer to nominal. Since the covariance matrix elements a^ are estimated from panels i and j using observations that have common time periods, estimators for this model achieve their asymptotic behavior as the Tj's approach infinity. This is in contrast to the random- and fixed-effects estimators that assume a different model and are asymptotic in the number of panels m; see [R] xtreg for the random- and fixed-effects estimators. Although xtpcse will allow other disturbance covariance structures, the term PCSE, as used in the literature, refers specifically to models that are both heteroskedastic and contemporaneously correlated across panels, with or without autocorrelation.
0 Example Grunfeld and Griliches (1960) performed an analysis of a company's current year gross investment (invest) as determined by the company's prior year market value (mvalue) and the prior year's value of the company's plant and equipment (kstock). The dataset includes 5 companies over 20 years, from 1935 through 1954, and is a classic dataset for demonstrating cross-sectional time-series analysis. Greene (2000) reproduces the dataset. To use xtpcse, the data must be organized in "long form"; that is, each observation must represent a record for a specific company at a specific time; see [R] reshape. In the Grunfeld data, company is a categorical variable identifying the company and year is a variable recording the year. Here are the first few records: . list in 1/5 company 1. 1 2. 1 3. 1 4. 1 5. 1
year 1935 1936 1937 1938 1939
invest 317,6 391.8 410.6 257.7 330.8
mvalue 3078.5 4661.7 5387.1 2792.2 4313.2
kstock 2.8 52.6 156.9 209.2 203.4
time 1 2 3 4 5
To compute PCSEs, Stata must be able to identify the panel to which each observation belongs and be able to match the time periods across the panels. We tell Stata how to do this matching by specifying the time and panel variables using tsset; see [R] tsset. Since the data are annual, we specify the yearly option. . tsset company year, yearly panel variable: company, 1 to 10 time variable: year, 1935 to 1954
We can obtain OLS parameter estimates for a linear model of invest on mvalue and kstock while allowing the standard errors (and variance-covariance matrix of the estimates) to be consistent when the disturbances from each observation are not independent. We want specifically for the standard errors to be robust to each company having a different variance of the disturbances and to each company's observations being correlated with those of the other companies through time.
xtpcse — OLS or Prais-Winsten models With panel-corrected standard errors
399
This model is estimated n Stata by typing . xtpcse invest mval^e kstock Linear regression, o rrelated panels corrected Group variable: coi ipany Time variable: yeiir Panels: co: related (balanced) Autocorrelation: no autocorrelation Estimated covariance: Estimated autocorreli tions = = Estimated coefficien' s
55 0 3
Panel-corrected i !oef . Std. Err. mvalue kstock _cons
.11 15622 .23i (6785 -42. '1437
.0072124 .0278862 6.780965
*
16 .02 8 .jj7 -6 .So
standard errors (PCSEs) Number of oba = 200 Number of groups = 10 Obs per group: min = 20 avg = 20 max = 20 R-squared = 0.8124 637.41 Wald chi2(2) 0.0000 Prob > chi2
P>lz|
[957. Conf. Interval]
0.000 0.000 0.000
. 1296983 . 101426 .2853345 . 1760225 -56.00482 -29.42392
£> Example xtgls will produce moi e efficient FGLS estimates of the models parameters, but with the disadvantage that the standard error estimates are conditional on the estimated disturbance covariance, Beck and Katz (1995) argui : that (he improvement ifi power using FGLS with such data is small and that the standard error estimates from FGLS are unadceptably optimistic (anti-conservative). The FGLS model is estimated by typing . xtgls invest mvalu s kstock, panels(correlated) Cross-sectional time •series FGLS regression Coefficients: gener ilized least squares heter >skedastic with cross-sectional correlation Panels: Correlation: no au ;ocorrelatioto Estimated covariance5 55 Number of obs = Estimated autocorrelations 0 Number of groups = Estimated coefficientts 3 No. of time periods* Wald chi2(2) = -879.4274 Log likelihood Prob > chi2 invest
loef .
ravalue .11 27515 kstock .22 51176 _cons -39. 34382
Std. Err.
.0022364 .0057363 1.717563
jz
P>lzl
50.42
0.000
38.90 -23.20
0.000 0.000
200 10 20 3738.07 0.0000
[957. Conf. Interval] .1083683 .2118746 -43.21018
.1171347 . 2343605 -36.47746
The coefficients between the two models are very close; the constants differ substantially, but we are generally not interested in the constant. As Beck and Katz observed, the standard errors for the FGLS model are 50 to 1009 > smaller than those for the OLS model with PCSE. If we were also concerned about autocorrelation of the disturbances, we could specify correlation(arl) to obtain a molel with a common AR(1) parameter along with our other concerns about the disturbances.
400
xtpcse — OLS or Prais-Winsten models with panel-corrected standard errors . xtpcse invest mvalue kstock, correlation(arl) (note: estimates of rho outside {-1,1} bounded to be in the range {-1,1}) Prais-Winsten regression, correlated panels corrected standard errors (PCSEs) Group variable: company Number of obs = 200 Time variable: year Number of groups = 10 Panels: correlated (balanced) Qbs per group: min = 20 Autocorrelation: common AR(1) avg = 20 max = 20 Estimated covariances = 55 R-squared = 0.6051 Estimated autocorrelations » 1 Wald chi2(2) = 93.71 Estimated coefficients = 3 Prob > chi2 = 0.0000 Panel-corrected Coef . Std. Err. mvalue kstock _cons
.0950157 .306005 -39.12569
rho
. 9059774
z
.0129934 .0603718 30.50355
P>|z|
7.31 0.000 5.07 0.000 -1.28 0.200
[957. Conf . Interval] .0695492 . 1876784 -98.91154
. 1204822 .4243317 20.66016
The estimate of the autocorrelation parameter is high, .926, and the standard errors are larger than for the model without autocorrelation, which is to be expected if there is autocorrelation.
I> Example Let's estimate panel-specific autocorrelation parameters and change the method of estimating the autocorrelation parameter to the one typically used to estimate autocorrelation in time-series analysis. . xtpcse invest mvalue kstock, correlation (psarl) rhotype(tscorr) Prais-Winsten regression, correlated panels corrected standard errors (PCSEs) Group variable: company 200 Number of obs = Time variable: year 10 Number of groups = Panels: correlated (balanced) 20 Obs per group: min = Autocorrelation: panel-specific AR(1) 20 avg = 20 max = 55 0 . 8895 Estimated covariances R-squared = Estimated autocorrelations 444 . 53 Wald chi2(2) 10 0 . 0000 Estimated coefficients Prob > chi2 = 3 Panel-corrected Coef. Std. Err. mvalue kstock _cons rhos =
.1052613 .3386743 -58.18714
.0086018 .0367568 12.63687
.5135627
.87017
z
12.24 9.21 -4.60 .9023497
P>lz| 0.000 0.000 0.000 .63368
[95X Conf. Interval] .0884021 .2666322 -82.95496
.1221205 .4107163 -33.41933
.8571502 .. .
.8752707
Beck and Katz (1995, 121) make a case against estimating panel-specific AR parameters as opposed to a single AR parameter for all panels.
xtpcse — 015 or Prais-Winsten models with panel-corrected standard errors
401
> Example We can also diverg from PCSEs to estimate standard errors that are panel-corrected, but only for panel-level heterost elasticity; that is, each company has a different variance of the disturbances, Allowing also for auto< orrelation, we would type . rtpcse invest nvalue kstock, correlation(arl) hetonly (note: estimates of rho outside {-1,1} bounded to be in the range {-1,1}) Prais-¥insten regression, heteroskedastic panels corrected standard errors Group variable: company Number of obs = 200 Time variable: 10 year Number of groups = Panels: heteroskedastic (balanced) 20 Obs per group: min = Autocorrelat i on : common AR(1) avg = 20 20 max = 0.6051 Estimated covari ances = 10 R-squared = 91.72 Estimated autoco relations = 1 Wald chi2(2) 0.0000 = 3 Estimated coeffi ients Prob > chi2 =
Coef. mvalue kstock _cons
.0950157 .306005 39.12569
rho
.9059774
Het-corrected; fetd. Err.
z
P>iz|
.0130872 .061432 26.16935
7.26 4.98 -1.50
0.000 0.000 0.135
[957. Conf . Interval] .0693653 .1856006 -90.41666
. 1206661 .4264095 12.16529
With this specificatic n, we are NOT obtaining; what are referred to in the literature as PCSEs. These standard errors are in t ic same spirit as PCSEs, but are from the asymptotic covariance estimates of OLS without allowing f >r contemporaneous correlation.
(Continued on next page)
402
xtpcse — OLS or Prais-Winsten models with panel-corrected standard errors
Saved Results xtpcse saves in e(): Seal arc e(N) e(N_g) e(n_cf) e(n_cv) e(n_cr) e(mss) e(df) e(df_m) e(rss) e(r2)
number of observations number of groups number of estimated coefficients number of estimated covariances number of estimated correlations model sum of squares degrees of freedom model degrees of freedom residual sum of squares H-squared
e(rmse) e(g_max) e(g_jnin) e (g_avg) e(rc) e(chl2) e(p) e(K_gaps) e(n_sigma)
root mean square error largest group size smallest group size average group size return code
x2 significance number of gaps obs used to estimate elements of Sigma
Macros e(cmd) xtpcse e(depvar) name of dependent variable e(title) title in estimation output e(panels) contemporaneous covariance structure e(corr) correlation structure e(rhotype) type of estimated correlation e(ivar) variable denoting groups variable denoting time e(tvar)
e(vcetype) covariance estimation method e(chi2type) Wald; type of model x 2 test e (predict) program used to implement predict e(rno) p e(cons) no const ant or "" e(missmeth) casewise or pairwise e (balance) balanced or unbalanced
Matrices e(b) e(V)
e (Sigma) e(rhomat)
Functions e(sample)
coefficient vector variance-covariance matrix of the estimators
X matrix vector of autocorrelation parameter estimates
marks estimation sample
Methods and Formulas If no autocorrelation is specified, the parameters (3 are estimated by OLS; see [R] regress. H autocorrelation is specified, the parameters ft are estimated by Prais-Winsten; see [R] prais. When autocorrelation with panel-specific coefficients of correlation is specified (option correlation(psarl)), each panel-level pi is computed from the residuals of an OLS regression across all panels; see [R] prais. When autocorrelation with a common coefficient of correlation is specified (option correlation(arl)), the common correlation coefficient is computed as Pl+ P2
h pr
m where p± is the estimated autocorrelation coefficient for panel i and m is the number of panels. The covariance of the OLS or Prais-Winsten coefficients is (X'X)~ 1 X / r?X(X'X)~ 1 where f2 is the full covariance matrix of the disturbances.
xtpcse — OLS er Prais-Winsten models with panel-corrected standard errors
403
When the panels are balanced, we can write 17 |as
where £ is the m by m r. anel -by-panel covariancd matrix of the disturbances; see Remarks. xtpcse estimates the e ements of H as y, _ |€i e j
where Cj and 6j are the res duals for panels i and ^respectively, that can be matched by time period, and where Tij is the numper of residuals between'the panels i and j that can be matched by time period. When the panels are balanced (each panel has the same number of observations and all time periods are common to all panels), then T^ = T, where T is the number of observations per panel, When panels are unba anced, xtpcse by default uses casewise selection, where only those residuals from time perk that are common to all panels are used to compute Sy. In this case, = T*, where T* is t|he number of time periods common to all panels. When pairwise is specified, each S^ is computed using all observations that can be matched by time period between the panels i and j.
Acknowledgment We would like to thank Nathaniel Beck, Department of Political Science, University of California at San Diego, for his man) helpful comments.
References Beck. N. and J. N. Katz. 1995 What to do (and not to do) with time-series cross-section data. American Political Science Review 89: 634-64 Davidson. R. and J. G. MacKi non. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. Greene. W. H. 2000. Econometric ic Analysis. 4th ed. Upper Saddle River, NJ: Prentice-Hall. Gmnfeld, Y. and Z. Griliches. 960. Is aggregation necessarily bad? Review of Economics and Statistics 42: I -13. Judge. G. G.. W. E. Griffiths, R C. Hill, H. Liitkepohl, and Ti-C. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wil :v & Sons. Kmenta. J. 1997. Elements of I 'eonometrics. 2d ed. Ann Arbor: University of Michigan Press.
Also See Complementary:
[R] adjust, [R] lincom. [R] mix, [R] predict, [R] test, [R] testnl, [R] vce, [R] xtdata, [R] xtdes, [R] xtsiirn, [R} xttab
Related:
[R] njewey , [R] prais, [R] regrjess, [R] svy estimators, [R] xtgee, [R] xtgls, [R] xireg , [R] xtregar
Background:
[u] 1 S.5 Accessing coefficients and standard errors. [UJ2 3 Estimation and post-dstimation commands, [R]x
Title Xtpois — Fixed-effects, random-effects, and population-averaged Poisson models
Syntax Random-effects model xtpois depvar [varlist] [weight] [if exp] [in range] [, re i(varname) irr quad(#) constraints(numlist) noskip noconstant ].evel(#) normal offset(vamame) nolog maximize-.options ] Conditional fixed-effects model xtpois depvar [var/ist] [vvez'g/t/j [if exp] [in range] , fe [ i(varname') irr constraints (numlist) noskip level(#) offset (varname) nolog maximize-options ] Population-averaged model xtpois depvar [varlist] [weight] [if exp] [in range] , pa [ i (varname) irr noconstant level(#) robust offset(varname) nolog xtgee-joptions maximize ^options ] by . , . : may be used with xtpois; see [R] by. iweiglits, fweights, and pweights are allowed for the population-averaged model and iweights are allowed in the random-effects and fixed-effects models; see [U] 14,1.6 weight. Note that weights must be constant within panels, xtpois shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands.
Syntax for predict Random-effects and fixed-effects models predict [type] newvamome [if exp] [in range] [, { xb
stdp
nuO j iruO }
nooffset ] Population-averaged model predict [type] newvarname [if exp] [in range] [, { mu
rate
xb
stdp }
nooffset These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample.
404
xtpois — Fixed-efft cts, random-effects, an& population-averaged Poisson models
405
Description xtpois estimates rand m-effects, conditional fixed-effects, and population-averaged Poisson models. Whenever we refer tc a fixed-effects model we mean the conditional fixed-effects model. Note: xtpois, re noi mal is slow since it is calculated by quadrature; see Methods and Formulas. Computation time is rougfr y proportional to the nujjnber of points used for the quadrature. The default is quad(12). Simulation? indicate that increasing; it does not appreciably change the estimates for the coefficients or their standard errors. See [R] quadchk By default, the popu tion-averaged model ifc an equal-correlation model; xtpois assumes corr(exchangeable). S e [R] xtgee for information on how to fit other population-averaged models.
Options re, the default, requests e random-effects estimator. f e requests the fixed-effe s estimator, pa requests the populatio averaged estimator. i(varname) specifies the ariable name that contains the unit to which the observation belongs. You can specify the i() o ion the first time you estimate, or you can use the iis command to set i() beforehand. After lat, Stata will remember the variable's identity. See [R] xt. irr reports exponents atec coefficients eb rather than coefficients 6. For the Poisson model, exponentiated coefficients have he interpretation of incidence rate ratios. quad(#) specifies the nu her of points to use in fie quadrature approximation of the integral. This option is relevant only you are estimating a random-effects model; if you specify quad(#) with pa, the quad(#) is ig ired. The default is quad(l ). The number specified must be an integer between 4 and 30, and must be no greater than the umber of observations. constraints (numlist} The default is to perforr command: see [R] con constraints (nutnlis
ecifies by number the linear constraints to be applied during estimation, unconstrained estimation. Constraints are specified using the constraint raint See [R] reg3 for the use of constraints in multiple-equation contexts, may not be specified dim pa.
noskip specifies that a equation be estimated a likelihood-ratio test the overall model tes in the regression equ; substantially increase
ull maximum-likelihood; model with only a constant for the regression This model is not displayed but is used as the base model to compute )r the model test statistij: displayed in the estimation header. By default, statistic is an asymptotically equivalent Wald test of all the parameters on being zero (except the constant). For many models, this option can timation time.
noconstant suppresses e constant term (intercept) in the model. level (#) specifies the co fidence level, in percent, for confidence intervals. The default is level (95) or as set by set leve ; see [u] 23.5 Specifying the width of confidence intervals. normal specifies that the mdom effects follow a normal distribution instead of a gamma distribution. offset(varname) to be 1.
specii s that varname is to be included in the model with its coefficient constrained
406
xtpois — Fixed-effects, random-effects, and population-averaged Poisson models
robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the IRLS variance estimator; see [R] xtgee. This alternative produces valid standard errors even if the correlations within group are not as hypothesized by the specified correlation structure. It does, however, require that the model correctly specifies the mean. As such, the resulting standard errors are labeled "semi-robust" instead of "robust". Note that although there is no cluster () option, results are as if there were a cluster() option and you specified clustering on i(). nolog suppresses the iteration log. xtgee^options specifies any other options allowed by xtgee for family (poisson) link (log); see [R] xtgee. maximize-options control the maximization process; see [R] maximize. Use the trace option to view parameter convergence. Use the ltol(#) option to relax the convergence criterion; the default is le—6 during specification searches.
Options for predict xb calculates the linear prediction. This is the default for the random-effects and fixed-effects models. mu and rate both calculate the predicted probability of depvar. mu takes into account the off set(), and rate ignores those adjustments, mu and rate are equivalent if you did not specify of f set(). mu is the default for the population-averaged model. stdp calculates the standard error of the linear prediction. nuO calculates the predicted number of events assuming Ui — 0. iruO calculates the predicted incidence rate assuming Ui = 0. noof f set is relevant only if you specified of f set(vamame) for xtpois. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as x,jb rather than x^b + offset^.
Remarks xtpois is a convenience command if you want the population-averaged model. Typing . xtpois . . ., . . . pa exposure(time) is equivalent to typing . xtgee
family(poisson) link(log) corr(exchangeable) exposure(time)
Thus, also see [R] xtgee for information about xtpois. By default, or when re is specified, xtpois estimates a maximum-likelihood random-effects model.
t> Example You have data on the number of ship accidents for 5 different types of ships (McCullagh and Nelder 1989, 205). You wish to analyze whether the "incident" rate is affected by the period in which the ship was constructed and operated. Your measure of exposure is months of service for the shipand in this model we assume that the random effects are distributed as gamma(#. 9}.
xtpois — Flxed-effee s, random-effects, and population-averaged Poisson models
407
. xtpois accident op.75_79 co_65_69 co_70_74 co_75_79, i(ship) ex(service) irr > nolog Random-effects Poiss )n Group variable (i) : ship
Number of obs
=
34
Number of groups
=
5
Random effects u_i - Gamma
Obs per group: min = avg = max =
6 6.8 7
Wald chi2(4) Prob > chi2
Log likelihood = -71.811217
=
50.90 0.0000
accident
IRR
Std. Err.
fc
P>lz|
[957, Conf . Interval]
op_75_79 co_65_69 co_70_74 co_75_79 service
1.4 36305 2.0 32643 2.3 56853 1.6 U913 (expo sure)
. 1734005 .304083 . 3999259 .3811398
3.24 4.74 5.05 2.14
0.001 0.000 0.000 0.033
1 . 162957 1.515982 1 . 690033 1.04174
1.848777 2.72512 3.286774 2 . 58786
/Inalpha
-2.3 58406
.8474597
-4.029397
-.7074155
alpha
.09 36298
.0793475
.0177851
.4929165
Likelihood ratio test of alpha=0: chibar2(01) =
10.61 Prob>=chibar2 = 0.001
In the output above, the alpha output line refers to a — 1/6. The output also includes a likelihood-ratio test of a — 0, which compares the panel estimator with the pooled (Poisson) estimator. You find that the incidence rate for accidents is significantly different for the periods of construction and operation of the ships, i nd that the random-effects model is significantly different from the pooled model. We may alternatively fit a fixed-effects specification instead of a random-effects specification: . xtpois accident op_75_79 co_65_69 co_70_7|4 co_75_79, i(sbip) ex(service) irr fe Iteration 0: Iteration 1: Iteration 2: Iteration 3: Iteration 4:
log likelihood log likelihood log likelihood log likelihood log liikelihood
= -80.738973 = -54.857546 = -54.641897 = -54.641859 = -54.641859
Conditional fixed-effects Poisson Group variable (i) : ship
Number of obs Number of groups
= =
34 5 6 6.8 7
Log likelihood = -E 4.641859
Obs per group: min = avg = max = = tfald chi2(4) = Prob > chi 2
48.44 0 . 0000
accident
IRR
Std. Err.
z
P>lzl
[957, Conf. Interval]
op.75_79 co_65_69 co_70_74 co_75_79 service
1.468831 2 . 008003 2. 26693 l.E 73695 (expc sure)
.1737218 . 3004803 . 384865 , 3669393
3.25 4.66 4,82 1.194
0.001 0.000 0.000 0.052
1 . 164926 1.497577 1.625274 .9964273
1.852019 2 . 692398 3.161912 2.485397
408
xtpols — Fixed-effects, random-effects, and population-averaged Poisson models
Both of these models estimate the same thing. The difference will be in efficiency, depending on whether the assumptions of the random-effects model are true. Note that we could have assumed that the random effects followed a normal distribution, A^O, crj), instead of a gamma distribution, and obtained . xtpois accident op_75_79 co_65_69 co_70_74 co_75_79, i(ship) ex(service) irr > normal nolog Number of obs = 34 Randons-effects poisson Group variable (i) : ship Number of groups = 5 Random effects u_i - Gaussian Obs per group: min = 6 avg = 6.8 max = 7 LR chi2(4) = 55.93 Log likelihood = -74.225924 Prob > ch!2 = 0.0000 accident
IRR
Std. Err.
z
P>!zl
[95*/, Conf. Interval] 1.166134 1.511119 1.675975 1.044831
1.853505 2.715958 3.257299 2.576159
-2.526919
-.3263217
op_75_79 co_65_69 co_70_74 co_75_79 service
1.470182 2 . 025867 2.336483 1 . 640625 (exposure)
. 1737941 .3030042 .3960786 .3777041
3.26 4.72 5.01 2.15
0.001 0 . 000 0.000 0.032
/Insig2u
-1.42662
.5613872
-2.54
0.011
sigma_u rho
.4900195 . 1936258
. 1375453 .0876621
Likelihood ratio test of rho=0:
chibar2(01) =
. 2826744 .0739925
.8494545 .4191359
11.78 Prob>=chibar2 = 0.000
The output includes the additional panel-level variance component. This is parameterized as the log of the standard deviation ln{ov) (labeled Insig2u in the output). The standard deviation av is also included in the output labeled sigma_u together with p (labeled rho),
which is the proportion of the total variance contributed by the panel-level variance component. When rho is zero, the panel-level variance component is unimportant and the panel estimator is not different from the pooled estimator. A likelihood-ratio test of this is included at the bottom of the output. This test formally compares the pooled estimator (poisson) with the panel estimator. In this case, rho is not zero so a panel estimator is indicated.
(Continued on next page)
xtpois — Fixed-effec :s, random-effects, and population-averaged Poisson models
409
Example Rerunning our previous example in order to fit a robust equal-correlation population-averaged model, . rtpois accident op_75_79 co_65_69 co_70_74 co_75_79, i(ship) ex(service) > pa robust ef ortn Iteration 1: Iteration 2: Iteration 3: Iteration 4: Iteration 5: Iteration 6:
toleraz ee toleran ce toleran ce tolerau ce tolerac ce toleras ce
= = = = = =
.04083192 .00270188 .00030663 .00003466 3.891e-06 4.359e-07
GEE population-avera ;ed model Group variable: ship log Link: Family: Poisson Correlation: exchangeable Scale parameter:
Number of obs = Number of groups = Obs per group: min = avg = max = Wald chi2(3) = Prob > chi2 =
1
34 5 6 6.8 7 181.55 0.0000
(standard errors adjusted for clustering on ship) accident
IRR
op_75_79 co_65_69 co_70_74 co_75_79 service
1.4 33299 2.0 38477 2.6 i3467 1.8 T6656 (expo sure)
Semi -robust Std. Err.
•z
P>lz|
[95'/. Conf. Interval]
. 1197901 . 1809524 .4093947 . 33075
4.88 8.02 6.28 3.57
0.000 0.000 0.000 0.000
1.266153 1.712955 1.951407 1.328511
1.737685 2.425859 3 . 580962 2.650966
We may compare this with a pooled estimator with cluster robust variance estimates: . poisson accident op_75_79 co_65_69 co_70_74 co_75_79, ex(service) robust > cluster(ship) irr Iteration 0: Iteration 1 : Iteration 2: Iteration 3: Iteration 4:
log 1 ikelihood = -147.37993 log 1 ikelihood = -80.372714 log 1 ikelihood = -80.116093 log 1 ikelihood = -80.115916 log 1 ikelihood = -80.115916
Poisson regression
Number of obs Wald chi2(3) Prob > chi2
Log likelihood = -80 .115916
34 32.59 0.0000
(standard errors adjusted for clustering on ship) accident
IRR
op_75_79 co_65_69
1 47324 2. 25914 2.E 50138 2.C 21926 (expe sure)
co_70_74 co_75_7"9
service
Robust Std. Err. . 1287036 .2850531 .6213563 .4265285
z
P>|zl
[951/, Conf. Interval]
4.44
0.000 0.000 0.000 0.001
1.2414 1 . 634603 1 . 868384 1.337221
5.62
4.84 3.34
1.748377 2.764897
4.378325 3.057227
xtpois — Fixed-effects, random-effects, and population-averaged Poisson models
410
Saved Results xtpois, re saves in e(): Scalars
e(N) e(k)
number of observations number of variables number of equations number of dependent variables number of groups model degrees of freedom log likelihood log likelihood, constant-only model log likelihood, comparison model largest group size
e(g_min) e(g_avg) e(rc) e(ehi2) e(chi2_c) e(p) e(ic) e(n_quad) e( alpha)
Macros e(cmd) e (depvar) e(title) e(ivar) e(wtype) e(wexp) e(user) e(offsetl)
xtpois name of dependent variable title in estimation output variable denoting groups weight type weight expression name of likelihood-evaluator program offset
e(opt) e(chi2type) e(chi2_ct)
e (predict) e(cnslist)
Matrices e(b)
coefficient vector
e(V)
Functions e( sample)
marks estimation sample
e(k_eq) e(k_dv) e(N_g) e(df_m) 6(11)
e (g_max)
e(distrib)
(Continued on next page)
smallest group size average group size return code
x2 X2 for comparison test significance number of iterations number of quadrature points
type of optimization Wald or LR; type of model x2 test Wald or LR; type of model x2 test corresponding to e(chi2_c) Gamma; the distribution of the random effect program used to implement predict constraint numbers variance-covariance matrix of the estimators
xtpois — FIxed-effi sets, random-effects, and population-averaged Poisson models
411
xtpois, re normal aves in e ( ) : Scalars e(N) e(N_g) e(df_m) e(ll) e(ll_0)
e(ll_c) e(g_max) e(g_min) Macros e(csid) e(depvar) e (title) e(ivar) e(wtype) e(weip) e(offsetl)
number o : observations number o f groups model de ;rees of freedom log likelil iOod log likelil ood, constant-only model
e(g_avg) e(chi2) e(chi2_c) e (rho) e(sigma_u) e(N_cd) e (n_quad)
log likeli] iood, comparison model largest gr >up size smallest \ roup size
coefficien vector
Functions e( sample)
marks esl imation sample
x2
X2 for comparison test P panel-level standard deviation number of completely determined obs. number of quadrature points
e(chi2type) Wald or LR; type of model x2 test e(chi2_ct) Wald or LR; type of model x3 test corresponding to e(chi2_c) e(distrib) Gaussian; the distribution of the random effect e (predict) program used to Implement predict e(cnslist) constraint numbers
xtpois name of lependent variable title in es timation output variable c enoting groups weight ty je weight ej pression offset
Matrices e(b)
average group size
e(V)
variance-covariance matrix of the estimators
e (g_max) e(g_jnin) e (g_avg) e(rc) e(cU2) e(p) e(ic) e (n_quad)
largest group size smallest group size average group size return code
e(user) e(opt) e(chi2type) e(offsetl) e (predict) e(cnslist)
name of likelihood-evaluator program type of optimization LR; type of model x 2 test offset program used to implement predict constraint numbers
e(V)
variance-covariance matrix of the estimators
xtpois, fe saves in .0: Scalars e(N) e(k) e(k_eq) e(k_dv) e(N_g) e(df_m) e(ll) e(ll_0)
number o • obsen'ations number o • variables number o ' equations number o : dependent variables number o f groups model de trees of freedom log likeli ood log likeli iood. constant-only model
Macros e(crad) xtpois e(depvar) name of iependent variable e (title) title in e! timation output e(ivar) variable ( enoting groups e(ytype) e(wexp)
weight ty 56 weight e) pression
Matrices e (b)
coefficien vector
Functions e (sample)
marks es imation sample
.
x2 significance number of iterations number of quadrature points
412
xtpois — Fixed-effects, random-effects, and population-averaged Poisson models
xtpois, pa saves in e(): Scalars e(N) e(N_g) e(df-m) e(g_max) e(g_min) e(g_avg) e(chi2) e(df_pear)
number of observations number of groups model degrees of freedom largest group size smallest group size average group size
e (deviance) e(cbi2_dev) e (diapers) e(chi2_dis) e(tol) e(dif) e(phi)
x2
deviance xz test of deviance deviance dispersion x 2 test of deviance dispersion target tolerance achieved tolerance scale parameter
degrees of freedom for Pearson
Macros e(cmd) e(cmd2) e(depvar) e(family) edink) e(corr)
xtgee xtpois name of dependent variable Poisson log; link function correlation structure
e (scale) x2, dev, phi, or #; scale parameter e(ivar) variable denoting groups e(vcetype) covariance estimation method e(chi2type) Wald; type of model x2 test e (off set) offset e (predict) program used to implement predict
Matrices e(b) e(R)
coefficient vector estimated working correlation matrix
e(V)
Functions e(sample)
marks estimation sample
variance-covariance matrix of the estimators
Methods and Formulas xtpois is implemented as an ado-file. xtpois, pa reports the population-averaged results obtained by using xtgee, family (poisson) link (log) to obtain estimates. See [R] xtgee for details on the methods and formulas. While Hausman, Hal] and Griliches (1984) is the seminal article on the random-effects and fixed-effects models, Cameron and Trivedi (1998) has a good textbook treatment. For a random-effects specification, we know that
exp < -.
exp
t-1
t-1
where Aj t = exp(x^/3). We may rewrite the above as
exp < - ^(Xitft) Vit
We now assume that e; follows a gamma distribution with expected value equal to 1 such that
xtpois — Fixed-effect
random-effects, and population-averaged Poisson models
e *A
413
1
1
< E * £- "V-
£VA*
v^™<
-r-Mij
r
11 ') I\t=i
"
t=1
m
.1 i
t(9) \
t=i
The log likelihood (assur ng Gamma(#. Q) heterogeneity) is then derived using
e
Ui =
fl = exp(xit0}
fit
such that the log likelihood nay be written as
Oil log
t=i + log(l
w
n,
«)E^ t—l
t=l
- IE yit
log
I E Xit
where u'j is the user-specifi d weight for panel z; if no weights are specified, u>j = 1. Alternatively, if we assu e a normal distribution, JV(0, er^), for the random effects z/i, we have that
where
= exp|-exp(i rt /3+ 4)4- (xi and we can approximate th integral with M-point (Jauss-Hermite quadrature
414
xtpois — Fixed-effects, random-effects, and population-averaged Poisson models
where the w*m denote the quadrature weights and the a*m denote the quadrature abscissas. The log-likelihood L, where p = cr^/((T^ 4- 1) is then calculated using the quadrature
t=l
The quadrature formula requires that the integrated function be well-approximated by a polynomial. As the number of time periods becomes large (as panel size gets large),
is no longer well-approximated by a polynomial. As a general rule of thumb, you should use this quadrature approach only for small to moderate panel sizes (based on simulations, 50 is a reasonably safe upper bound). However, if the data really come from random-effects poisson and rho is not too large (less than, say, .3), then the panel size could be 500 and the quadrature approximation would still be fine. If the data are not random-effects poisson or rho is large (bigger than, say, ,7), then the quadrature approximation may be poor for panel sizes larger than 10. The quadchk command should be used to investigate the applicability of the numeric technique used in this command. For a fixed-effects specification, we know that = j/it) = exp{- exp(o!i + xlt = —Fexp{- exp(cti)
Since we know that the observations are independent, we may write the joint probability for the observations within a panel as rij
— J/il) • • • > Yim — yirii) — I -*-j
=
:exp{ — e:
•?/ - J I
fV exp(x it /3)w« exp < - exp(a,), il- yit t=i '
^exp(ittp) + on 2^ yit
and we also know that the sum of nt Poisson independent random variables each with parameter A is distributed as Poisson with parameter n^A, so we have
-f a,
y
xtpois — Fixed-effects, random-effects, and population-averaged Poisson models
415
So, the conditional 1 kelihood is conditioned on the sum of the outcomes in the set (panel). The appropriate function is riven by
n
-exp
£(».••
-
]T exp(xif /
-exp < - exp(a;
which is free of a. So, the conditional Ibg-likelihood is given by
L ---- log t=i
t=l
E t=i
n t=i
« T t—1
References Cameron. A. C.. and Trivecjii, P. K. 1998. Regression Analysis of Count Data. New York: Cambridge University Press. Greene. W. H. 2000. Econ- metric Analysis. 4th ed. Upjjer Saddle River, NJ: Prentice-Hall. Hausman, J., B. H. Ha!L and Z. Griliches. 1984. Econometric models for count data with an application to the patents-R & D relation ihip. Economeirica 52: 909-938. Liang. K.-Y, and S. L. Zeg< r. 1986. Longitudinal data artalysis using generalized linear models. Biometrikn 7?: 13-22. McCullagh, P. and J. A. N :)der. 1989. Generalized Linear Models. 2d ed. London: Chapman & Hall.
418
xtprobit — Random-effects and population-averaged problt models
Note: xtprobit, re, the default, is slow since it is calculated by quadrature; see Methods and Formulas. Computation time is roughly proportional to the number of points used for the quadrature. The default is quad (12). Simulations indicate that increasing it does not appreciably change the estimates for the coefficients or their standard errors. See [R] quadchk. By default, the population-averaged model is an equal-correlation model; xtprobit assumes the within-group correlation structure corr(exchangeable). See [R] xtgee for information on how to fit other population-averaged models. See [R] logistic for a list of related estimation commands.
Options re requests the random-effects estimator, re is the default if neither re nor pa is specified, pa requests the population-averaged estimator. i(varacme) specifies the variable name that contains the unit to which the observation belongs. You can specify the i() option the first time you estimate, or you can use the iis command to set i() beforehand. After that, Stata will remember the variable's identity. See [R] xt. quad(#) specifies the number of points to use in the quadrature approximation of the integral. This option is relevant only if you are estimating a random-effects model; if you specify quad(#) with pa, the quad(#) is ignored. The default is quad(12). The number specified must be an integer between 4 and 30, and must be no greater than the number of observations. noconstant suppresses the constant term (intercept) in the model. noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be estimated. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. offset (varname) specifies that varname is to be included in the model with its coefficient constrained to be 1. robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the IRLS variance estimator; see [R] xtgee. This alternative produces valid standard errors even if the correlations within group are not as hypothesized by the specified correlation structure. It does, however, require that the model correctly specifies the mean. As such, the resulting standard errors are labeled "semi-robust" instead of "robust". Note that although there is no cluster () option, results are as if there were a cluster() option and you specified clustering on i(). nolog suppresses the iteration log. xtgee ^.options specifies any other options allowed by xtgee for family (binomial) link(probit). such as corr(); see [R] xtgee. maximize-.options control the maximization process; see [R] maximize. Use the trace option to view parameter convergence. Use the ltol(#) option to relax the convergence criterion; the default is le-6 during specification searches.
xtpr<|bit — Random-effects ajnd population-averaged problt models
419
Options for predict xb calculates the linear prediction. This is the default for the random-effects model. puO calculates the probability of a positive outcome, assuming that the random effect for that observation's panel is zbro (v — 0). Note that this may not be similar to the proportion of observed outcomes in the group.! stdp calculates the standard error of the linear prediction. mu and rate both caleulatjs the predicted probability of depvar. mu takes into account the off set(), and rate ignores those] adjustments, mu and raije are equivalent if you did not specify of f set (5. mu is the default for the population-averaged mpdel. nooffset is relevant only if you specified offsei(vamame') for xtprobit. It modifies the calculations made by predi ct so that they ignore tjie offset variable; the linear prediction is treated as x,fb rather than x^Jb + offset^.
Remarks xtprobit is a convenience command if you want the population-averaged model. Typing . xtprobit . . . , pa L . .
is equivalent to typing . xtgee ..., ... family(binomial) link(propit) corr(exchangeable) I
:
i
Thus, also see [R] xtgee fi>r information about xtprobit. By default, or when r$ is specified, xtprobiti estimates a maximum-likelihood random-effects model.
D» Example
i
if
1
You are studying unionization of women in thd United States and are using the union dataset; see [R] xt. You wish to esjtimate a random-effects tnodel of union membership:
(Continued oil next page)
420
xtprobit — Random-effects and population-averaged probit models . xtprobit union age grade not_smsa south southXt, i(id) nolog Random-effects probit Number of obs Group variable (i) : idcode Number of groups Random effects u_i - Gaussian Obs per group: min avg max Wald chi2(5) Log likelihood = -10561.065 Prob > cM2 union
Coef.
Std. Err.
age grade not_smsa south southXt _oons
. 0044483 .0482482 -.1370699 -.6305824 .0131853 -1.846838
.0025027 .0100413 .0462961 .0614827 .0043819 . 1458222
/Insig2u
.5612193
sigma_u rho
1.323937 .6367346
z
P>|z|
=
26200 4434 1 5.9 12 218.90 0.0000
[9Sy, Conf . Interval] - . 000457 . 0285677 -.2278087 -.7510863 . 004597 -2 . 132644
.0093535 .0679287 -.0463312 -.5100785 .0217737 -1.561032
.0431875
.4765733
.6458653
.0285888 .0099894
1.269073 .6169384
1.381172 . 6560781
1.78 4.80 -2.96 -10.26 3.01 -12.67
0.076 0.000 0.003 0.000 0.003 0.000
= ~ = = =
Likelihood ratio test of rho=0: chibar2(01) = 5972.49 Prob >= chibar2 = 0.000
The output includes the additional panel-level variance component. This is parameterized as the log of the variance ln(<7JJ) (labeled Insig2u in the output). The standard deviation av is also included in the output labeled sigma_u together with p (labeled rho),
cr,,
which is the proportion of the total variance contributed by the panel-level variance component. When rho is zero, the panel-level variance component is unimportant and the panel estimator is not different from the pooled estimator. A likelihood ratio test of this is included at the bottom of the output. This test formally compares the pooled estimator (probit) with the panel estimator.
Q Technical Note The random-effects model is calculated using quadrature. As the panel sizes (or p) increase, the quadrature approximation becomes less accurate. We can use the quadchk command to see if changing the number of quadrature points affects the results. If the results do change, then the quadrature approximation is not accurate and the results of the model should not be interpreted.
(Continued on next page)
xtpijobit — Random-effects land population-averaged probjt models
421
. quadchk, nooutpit Refitting model qi^adQ = 8 Refitting model qijiadO = 16 Quadrature check Fitted quadrature 12 piints
Comparison quadrature 8 points
Comparison quadrature 16 points
-106^ 1.065 :
-10574.78 -13.714764 .00129862
-10555.853 5.2126898 -.00049358
Difference Relative difference
union: age
.00444829 ! ;
.00478943 .00034115 .07669143
.00451117 . 00006288 .01413662
Difference Relative difference
union: grade
.04? 24822
.05629525 .00804704 . 16678412
.04411081 - . 00413741 -.0857525
Difference Relative difference
union: not_saisa
-.137 06993
- . 1314541 .00561584 -.04097061
-.14109796 -.00402803 .02938665
Difference Relative difference
union: south
-.63CJ58241 i \
-.62309654 .00748587 -.01187136
- . 64546968 -.01488727 . 02360876
Difference Relative difference
union: scuthXt
.01^18534 ' ' i - 1 . 8^68379 i i i .561(21927
.01194434 -.001241 -.09411977
.01341723 . 00023189 . 01758658
Difference Relative difference
-1.8066853 .0401526 -.02174127
Difference Relative difference
Log likelihood
union : _cons Insig2u: .cons __—
™™
,„
„
„ — ——
lj
-1.9306422 -.08380426 .04537716 .49078989 -.07042938 -.12649352 -...-
;
. 58080961 . 01959034 Difference .03490674 Relative difference
,.• -
Note that the results obtained for 12 quadrature points were closer to the results using 18 points than to the results using 6 points. However, since the convergence point seems to be sensitive to the number of quadrature poijnts, we should not use this output. Since there is no alternative method for calculating the random-eitfects model in Stata, we; may either fit a different model or use a different command. We should ncJt use the output of a rajidom-effects specification when there is evidence that the numeric technique for calculating the model is not stable (as shown by quadchk). A subjective rule of tiumb is that the relative] differences in the coefficients should not change by more than 1% if the quadrature technique is stable. See [R] quadchk for details. The important point to remember is tha: when the quadrature technique is not stable, you cannot merely increase the number of quadrature] points to fix the problem.
1> Example As an alternative to th^ random-effects specification, we can fit an equal-correlation probit model:
422
xtprobit — Random-effects and population-averaged probit models . xtprobit union age grade not_smsa south southXt, i(id) pa Iteration 1: tolerance = .04796083 Iteration 2: tolerance = .00352657 Iteration 3: tolerance = .00017886 Iteration 4: tolerance = 8.654e-06 Iteration 5: tolerance = 4.150e-07 Number of obs GEE population-averaged model Group variable: idcode Number of groups Link: probit Obs per group: min Family: binomial avg Correlation: exchangeable max Wald chi2(5) Scale parameter: Prob > chi2 1 union
Coef .
Std. Err.
age grade not_smsa south southXt _cons
.0031597 .0329992 -.0721799 -.409029 .0081828 -1.184799
.0014678 .0062334 .0275189 .0372213 . 002545 .0890117
z 2.15 5.29 -2.62 -10.99 3.22 -13.31
P>|zl 0.031 0.000 0.009 0.000 0.001 0.000
= = = = = =
26200 4434 1 5.9 12 241.66 0 . 0000
[951/, Conf . Interval] .0002829 .020782 -.1261159 -.4819815 .0031946 -1,359259
. 0060366 .0452163 -.0182439 -.3360765 .0131709 -1.01034
> Example In [R] probit, we showed the above results and compared them with probit, robust cluster (). xtprobit with the pa option allows a robust option (the random-effects estimator does not allow the robust specification), so we can obtain the population-averaged probit estimator with the robust variance calculation by typing , xtprobit union age grade not_smsa south southXt, i(id) pa robust nolog GEE population-averaged model Number of obs = 26200 Group variable: idcode Number of groups = 4434 Link: probit Qbs per group: min = 1 Family: binomial avg = 5.9 Correlation: exchangeable max = 12 Wald chi2(5) = 154.00 Scale parameter: 1 Prob > chi2 = 0.0000 (standard errors adjusted for clustering on idcode) union
Coef.
age grade not_smsa south southXt _cons
.0031597
.0329992 -.0721799 -.409029 .0081828 -1.184799
Semi-robust Std, Err. .0022027 .0076631 . 0348772 .0482545 .0037108 .116457
z 1.43 4.31 -2.07 -8.48 2.21 -10.17
P>lz| 0.151 0.000 0.038 0.000 0.027 0.000
[957. Conf. Interval] -.0011575 .0179797 -.140538 -.5036061 .0009097 -1.413051
.007477 .0480186 -.0038218 -.3144519 .0154558 - . 9565479
These standard errors are similar to those shown for probit, robust cluster() in [R] probit.
xtprolbit — Random-effects aN population-averaged probit models
t> Example
423
;
In a previous examplej we showed how quadthk indicated that the quadrature technique was numerically unstable. Her4 we present an example where the quadrature is stable. In this example, we ha4e (synthetic) data on whether workers complain to managers at a fast-food restaurant. The covariates' are age (in years of thfc worker), grade (years of schooling completed by worker), south (equal} to 1 if the restaurant ig located in the South), tenure (the number of years spent on the job by (the worker), gender (of; worker), race (of work), income (in thousands of dollars by the restaurapt), genderm (gender of* manager), burger (equal to 1 if the restaurant specializes in hamburgers), and chicken (equal to 1 if the restaurant specializes in chicken). The model is given by \ . xtprobit complain age grade south tenure^ gender race income genderm > burger chicken, i(person) nolog Random-effects probit Number of obs = Group variable (i) ;: person Number of groups = Random effects u_i Gaussian Obs per group: min = avg = max = Wald chi2(10) Log likelihood = L2E74.115 Prob > chi2 complain
I Coef .
age grade south tenure gender race income genderm burger chicken _cons
-.0 X33157 -.0 111126 -.0 346366 -.3 536063 .0 367994 .0 334963 -.2 111629 .1 J06497 -.0 516544 .0 535842 -1. 123845
.0762518 .0647727 .0723753 .0550447 .0734595 .065773 .0730126 .0557133 .0729739 .0567645 .0330159
/Insig2u
-1.D30313
sigma_u rho
.5b74072 .2^30235
Std. Err.
z j -Oj.OO -Oj.63 -0.48 -6;. 97 0,91 lj.50 -2.89 2|.35 -0!.B4 1.14 -34,04
P>|zl
5952 1076 3 5.5 8 65,03 0.0000
[95X Conf . Interval] - . 1497664 -.1680648 -.1764895 -.4914919 -.0771786 -.0258168 -.354265 .0214535 - . 2046806 -.0457122 -1.188555
.1491351 .0858396 . 1072163 -.2757206 .2107775 . 1928094 -.0680607 .2398458 .0813718 .1728806 -1.059136
. 1292422
-1.283623
-.7770027
.0386051 .0250526
. 5263382 .2169342
.6780723 .3149662
Likelihood ratio te|st of rho=0: chibar2(01)
0.997 0.526 0.632 0.000 0.363 0.134 0.004 0.019 0.398 0.254 0.000
166.88 Prob >= chibar2 = 0.000
Again, we would like 'to check the stability ojf the quadrature technique of the model before interpreting the results. Given the estimate of p an|l the small size of the panels (between 3 and 8). we should find that the quadrature technique is numerically stable.
(Continued o/jj nexf page)
424
xtprobit — Random-effects and poputation-averaged probit models . quadchk, nooutput Refitting model quadf) = 8 Refitting model quadO * 16 Quadrature check Fitted Comparison Comparison quadrature quadrature quadrature 12 points 8 points 16 points Log likelihood
-2674.115
complain: age
-.00031569
complain: grade
-.04111263
complain : south
- . 03463663
complain: tenure
-.38360629
complain : gender
. 06679944
complain: race
.0834963
complain: income
-.21116286
complain : genderm
. 13064966
complain: burger
-.06165444
complain; chicken
.0635842
complain: _cons
-1.1238455
Insig2u: _cons
-1.0303127
-2574.1293 -.01424246 B.533e-06
-2574.1164 -.00132578 5.150e-07
Difference Relative difference
-.00013111 .00018459 -.58470325
-.00031858 -2. 891e-06 .00915831
Difference Relative difference
-.04100666 .00010597 -.0025776
-.04111079 1 . 839e-06 -.00004474
Difference Relative difference
-.03469524 -.00005861 . 0016922
- . 03462929 7.341e-06 -.00021193
Difference Relative difference
-.38351811 .00008818 -.00022987
-.38360047 5.820e-06 -.00001517
Difference Relative difference
. 06655282 -.00024662 -.003692
.0668029 3.455e-06 .00005172
Difference Relative difference
.08340258 - . 00009373 -.00112252
.08349099 -5.311e-06 -.00006361
Difference Relative difference
-.21100203 .00016083 -.00076164
-.21115631 6.556e-06 -.00003105
Difference Relative difference
. 1305605 -.00008916 -.00068243
. 13064386 -5.804e-06 - . 00004442
Difference Relative difference
-.06168062 -.00002618 .0004246
- . 06164754 6.903e-06 -.00011195
Difference Relative difference
.06359848 .00001428 . 00022456
.06358665 2 . 452e-06 .00003856
Difference Relative difference
-1.1237932 .00005231 -.00004654
-1.1238278 .00001769 -.00001574
Difference Relative difference
-1.0317132 -.00140056 .00135936
-1.0304345 -.0001218 .00011821
Difference Relative difference
xtprbbH — Random-effects and population-averaged probit models I
425
I
The relative differences £re all very small between the default 12 quadrature points and the result with 16 points. We havelonly one coefficient that; has a large relative difTerence between the default 32 quadrature points andj 8 quadrature points. In looking again at the absolute differences, we see also that the absolute differences between 12 and 16 quadrature points were also small. We conclude that the Quadrature technique is sjable. We may wish to rerun the above model with quad (16) or even higher (but we do not have to since the results will not significantly differ) and interpret those results forj oar presentation. i
:
<J
Saved Results xtprobit, re saves |n e(): Scalars e(H) e(N_g) e(df_m) 6(11)
eCg-max) e(g-jnin) Macros e(cmd) e(depvar) e(title) e(ivar) e(wtype) e (wexp) e (off set)
number of) observations number of) groups model degbes of freedom log likelihpod log likelih'XxJ, constant-only model log likelih')od, comparison model largest grojup size smallest g^oup size
e(g_avg) e(chi2) e(chi2_c) e(rho) e(sigma_u) e(N_cd) e(n_quad)
;
x2 X1 for comparison test p panel-level standard deviation number of completely determined obs. number of quadrature points
e(chi2type) tfald or LR; type of model x2 test e(chi2_ct) Wald or LR; type of model x2 test corresponding to e(chi2_c) e(distrib) Gaussian; the distribution of the random effect e (predict) program used to implement predict
xtprobiti name of dependent variable title in estimation output variable denoting groups weight typj; weight expression offset
average group size
i
Matrices e(b)
coefficient i vector
Functions e( sample)
marks estimation sample
;
e(V)
(Continued oh next page)
variance-covariance matrix of the estimators
426
xtprobit — Random-effects and population-averaged probit models
xtprobit, pa saves in e(): Scalars e(N) e(N^)
e(dfjn) e(g_max) e(g_min) e(g_avg) e(chi2) e(cjf_pear) Macros e(cmd) e(cmd2) e(depvar) e(family) e(link) e(corr)
number of observations number of groups model degrees of freedom largest group size smallest group size average group size X2 degrees of freedom for Pearson x 2
e (deviance) e(cM2_dev) e(dispers) e(cM2_dis) e(tol) e(dif) e(phi)
deviance X 2 test of deviance deviance dispersion X2 test of deviance dispersion target tolerance achieved tolerance scale parameter
xtgee xtprobit name of dependent variable binomial probit; link function correlation structure
e(scale) e(ivar) e(vcetype) e(cM2type) e(offset) e (predict)
x2, dev, phi, or #; scale parameter variable denoting groups covariance estimation method Wald; type of model x 2 test offset program used to implement predict
coefficient vector variance-covariance matrix of the estimators
e(R)
estimated working correlation matrix
Matrices e(b) e(V)
Functions e(sample)
marks estimation sample
Methods and Formulas xtprobit is implemented as an ado-file. xtprobit reports the population-averaged results obtained by using xtgee, family (binomial) link (probit) to obtain estimates. Assuming a normal distribution, JV(0,<7^), for the random effects i^, we have that
where F(xit{3 + vi] =
+ Vi]
1
xitfi + i>t)
if y^t + 0 otherwise
(where $ is the cumulative normal distribution), and we can approximate the integral with M-point Gauss-Hermite quadrature M
• CO
where the w^ denote the quadrature weights and the a^ denote the quadrature abscissas. The log-likelihood L where p - al/(
xtpr^bft — Random-effects and population-averaged probit models
427
where Wi is the user-specified weight for panel i; ;if no weights are specified, W{ = 1. The quadrature formul* requires that the integrated function be well-approximated by a polynomial. As the number of time pdriods becomes large (as panel size gets large),
is no longer well-approxijnated by a polynomial. As a general rule of thumb, you should use this quadrature approach only Ifor small to moderate pahel sizes (based on simulations, 50 is a reasonably safe upper bound). Howeyer, if the data really coifie from random-effects probit and rho is not too large (less than, say, .3), ihen the panel size couldj be 500 and the quadrature approximation would still be fine. If the data ai e not random-effects probit or rho is large (bigger than, say, .7), then the quadrature approximation nay be poor for panel siies larger than 10. The quadchk command should be used to investigate the applicability of the numeric technique used in this command.
References
i Conway. M. R. 1990. A randtjm effects model for binary data. Biometrics 46: 317-328.
Guilkey. D. K. and J. L. MJirphy. 1993. Estimation and' testing in the random effects probit model. Journal of Econometrics 59: 301-317' i Liang. K.-Y. and S. L. Zeger. 1J986. Longitudinal data analysis using generalized linear models. Biometrika 73: 13-22. Neuhaus. J. M. 1992. Statist) ;al methods for longitudinal: and clustered designs with binary responses. Statistical Methods in Medical Research 1: 249-273. j Neuhaus. J. M, J, D. Kalbfleiich, and W. W. Hauck. 1991: A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. InfernafionaJ Statistical Review 59: 25-35. Pendergast, J. F, S. J. Gange, M. A. Newton, M. J. Lirpstrom, M. Palta, and M. R. Fisher. 1996. A survey of methods for analyzing clusered binary response data. Internafiona/ Statistical Review 64: 89-118.
Also See Complementary:
[R] ^djust, [R] lincom, [R] iljifx, [R] predict, [R] quadchk, [R] test, [R] jtestnl, [R] v«, [R] xtdatb, [R] xtdes, [R] xtsum, [R] xttab
Related:
[R] >r0bk, [R] xtclog, [R] xtkee, [R] xtlogit
Background:
[u] 16.5 Accessing coefficients and standard errors, [U] [U]
U3 Estimation and post^stimation commands, J3.ll Obtaining robust Variance estimates,
Title xtrchh — Hildreth-Houck random coefficients model
Syntax xtrchh depvar varlist [if exp] [in range] [, ±(varname) t(varname') level(#) offset (.varname) noconstant maximize-options ] by ... : may be used with xtrcbli; see [R] by. xtrchh shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands.
Syntax for predict predict [type] newvarname [if exp] [in range] [, { xb | stdp } nooffset ] These statistics are available both in and out of sample; type predict . . . if e (sample) . . . if wanted only for the estimation sample,
Description xtrchh estimates the Hildreth-Houck random coefficients linear regression model.
Options i (varname') specifies the variable that contains the unit to which the observation belongs. You can specify the i() option the first time you estimate, or you can use the iis command to set i() beforehand. After that, Stata will remember the variable's identity. See [R] xt. t (varname') specifies the variable that contains the time at which the observation was made. You can specify the t() option the first time you estimate, or you can use the tis command to set t() beforehand. After that, Stata will remember the variable's identity. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23,5 Specifying the width of confidence intervals. offset (varname) specifies that varname is to be included in the model with its coefficient constrained to be 1. noconstant suppresses the constant term (intercept) in the regression. maximize^.options control the maximization process; see [R] maximize. Use the trace option to view parameter convergence. Use the ltol(#) option to relax the convergence criterion; the default is le—6 during specification searches.
42B
xtrchh — HHdreth-Houck random coefficients model
429
Options for predict ; xb calculates the linear prediction. stdp calculates the standkrd error of the linear prediction. noof f set is relevant onb if you specified of f set [vamame') for xtrchh. It modifies the calculations made by predict so hat they ignore the offset variable; the linear prediction is treated as Xitb rather than Xj
Remarks In random coefficients models, we wish to trejat the parameter vector as a realization (in each panel) of a stochastic pro :ess. Interested readers should see Greene (2000) for information on this and other panel-data models.
> Example
|
i Greene (2000, 592 and CD) reprints data from ft classic study of investment demand by Grunfeld and Griliches (1960). In [R.] xtgls, we use this dat^set to illustrate many of the possible models that may be estimated with the xtgls command. Whjle the models included in the xtgls command offer considerable flexibil.ty, they all assume that: there is no parameter variation across firms (the cross-sectional units). j ;
To take a first look at t le assumption of parameter constancy, we should reshape our data so that we may estimate a simultmeous equation model u£ing sureg; see [R] sureg. Since there are only 5 panels here, it is not too difficult. . reshape wide inv< st market stock, KtinUf) j (company) (note: j = 1 2 3 4 5) long
Data Number of obs. j Number of variable^ j variable (5 values) xij variables: |
wide
100 !-> 5 !-> company •->
20 16 (dropped)
invest -> in vest 1 invest2 .. invests .market -> marketl markets .. marketB stock !-> stockl stock2 .. stockB ;l
. sureg ( invest 1 mjrketl atockl) (invest2 :market2 stock2) (invests markets st > (invest4 market4 stock4) (invests markers stocks) Seemingly unrelated regression Equation invest1 invest2 invests invest4 invests
t bs Farms 20 20 20 20 20
RMSE
"R-sq"
chi2
P
84.94729 12.36322 26.46612 9.742303 95.85484
0.9207 0.9119 0.6876 0.7264 0.4220
261.3219 207.2128 46.88498 59.14585 14,9687
0.0000 0.0000 0.0000 0.0000 0.0006
430
xtrchh — HHdreth-Houck random coefficients model Coef .
Std. Err.
z
P>izl
[957, Conf . Interval]
invest 1 market 1 stockl _cons
.120493 .3827462 -162.3641
.0216291 .032768 89.45922
5.57 11.68 -1.81
0.000 0.000 0.070
.0781007 .318522 -337.7009
. 1628853 .4469703 12.97279
invest2 market2 stock2 _cons
.0695456 .3085445 .5043112
.0168975 .0258635 11.51283
4.12 11.93 0.04
0.000 0.000 0.965
.0364271 .2578529 -22 . 06042
.1026641 .3592362 23.06904
invests markets stocks _cons
.0372914 . 130783 -22.43892
.0122631 . 0220497 25.51859
3.04 5.93 -0.88
0.002 0.000 0.379
.0132561 . 0875663 -72 . 45443
.0613268 .1739997 27.57659
invest4 market4 stock4 _cons
.0570091 .0415065 1.088878
.0113623 .0412016 6.258805
5.02 1.01 0.17
0.000 0.314 0.862
.0347395 -.0392472 -11.17815
.0792788 .1222602 13.35591
invests markets stockB _cons
. 1014782 .3999914 85 , 42324
.0547837 . 1277946 111.8774
1.85 3.13 0.76
0.064 0.002 0.445
-.0058958 .1495186 -133.8525
.2088523 .6504642 304.6989
Here we instead estimate a random coefficients mode! . use invest, clear . xtrchh invest market stock, i(company) t(time) Hildreth-Houck random-coefficients regression Number of obs = Group variable (i) : company Number of groups = Dbs per group: min = avg max = Wald cM2(2) Prob > chi 2 invest
Coef.
market stock _cons
.0807646 . 2839885 -23.58361
Std. Err.
.0250829 .0677899 34.55547
Test of parameter constancy:
z
P>lz|
3.22 4.19 -0.68
0.001 0.000 0.495
ch!2(12)
603.99
100 5 20 20.0 20 17.55 0.0002
[957. Conf. Interval] • .0316031 .1511229 -91.31108
.1299261 .4168542 44.14386
Prob > chi2 = 0.0000
Just as a subjective examination of the results of our simultaneous-equation model does not support the assumption of parameter constancy, the test included with the random coefficients model also indicates that assumption is not valid for these data. With large panel datasets, obviously we would not want to take the time to look at a simultaneous-equations model (aside from the fact that our doing so was very subjective).
xtrchh — Hlldreth-Houck random coefficients model
431
Saved Results xtrcth saves in e(); Scalars e(N)
e(df_m) e(g_max) e(g_min) Macros e(cmd) e(depvar) e(title) c(ivar)
number cf observations number cf groups model de >rees of freedom largest group size smallest group size
ei(g_avg) e(chi2) e(chi2_c) e|(df _cM2)
ztrchh
e(chi2type) Wald; type of model x2 test e(offset) offset ei(predict) program used to implement predict
name of iependent variable title in estimation output variable
Matrices e(b) e (Sigma)
coefficient vector £ matrijj
Functions eCsftmple)
marks estimation sample
average group size x2 x2 f°r comparison test degrees of freedom for mode!
variance-covariance matrix of the estimators
e!(V)
]
Methods and Formulas xtrchh is implemented |as an ado-file. \ • In a random coefficient^ model, the parameter heterogeneity is treated as stochastic variation. Assume that we write where i = 1, . . . . m, and
i I
is the coefficient vectoij (k x 1) for the ith cross-sectional unit such that
Our goal is to find 3 a"d The derivation of the e; timator assumes that thd cross-sectional specific coefficient vector /3i is the outcome of a random process with mean vector! ft and covariance matrix JT.
where E(UJ) = 0 and
MJX; = <7i 4- x: i
;
The covariance matrix for the panel-specific coefficient estimator f3l can then be written as 1
V, + r= (X^
where
V2 = ^
We may then compute a weighted average of the ppel-specific coefficient estimates as i
where t=l
W, =
(r+V%]
432
xtrchh — Hildreth-Houck random coefficients model
such that the resulting GLS estimator is a matrix-weighted average of the panel-specific (OLS) estimators. To calculate the above estimator (3 for the unknown J1 and'Vj parameters, we use the two-step approach suggested by Swamy (1971): H<-K
/3j = OLS panel — specific estimator „. _ K>I. *v%
-=IV' i=l
t—1
The two-step procedure begins with the usual OLS estimate of /3. With an estimate of /3, we may proceed by obtaining estimates of V$ and F (and, thus, W»), and then obtain an updated estimate of/3. XV.
Swamy (1971) further points out that the matrix Fmay not be positive definite and that since the second term is of order l/(mT), it is negligible in large samples. A simple and asymptotically expedient solution is to simply drop this second term and instead use — rat
As a test of the model, we may look at the difference between the OLS estimate of /3 ignoring the panel structure of the data and the matrix-weighted average of the panel-specific OLS estimators. The test statistic suggested by Swamy (1971) is given by
where
/ \i=l
/
i-l
Johnston (1984) has shown that the test is algebraically equivalent to testing
HQ : 0i = (32 = • • • = 0m in the generalized (groupwise heteroskedastic) xtgls model, where V is block diagonal with zth diagonal element Hi.
References Greene, W. H. 2000. Econometric Analysis, 4th ed. Upper Saddle River, NJ: Prentice-Hall. Grunfeld, Y. and Z, Griliches. 1960. Is aggregation necessarily bad? Review of Economics and Statistics 42: 1-13. Hardin, J. W. 1996. sg62: Hildreth-Houck random coefficients model. Stafa Technical Bulletin 33: 21-23. Reprinted in Stata TedinicaJ Bulletin Reprints, vol. 6, pp. 158-162. Hildreth, C. and C. Houck. 1968. Some estimators for a linear model with random coefficients. Journal of the American Statistical Association 63: 584-595.
xtrchh — Hitdreth-Houck random coefficients model Johnston, J. 1984. Ecoaometri : Methods. New York: McGfaw-Hill. i Swamy, P. 1970. Efficient inference in a random coefficient, regression model. Econometrics 38; 311-323. . 1971. Statistical Meren ;e in Random Coefficient Regression Models. New York: Springer-Verlag.
Also See Complementary:
M adjust, [R] Hncom, [R] mfx, [R] predict, [R] test, [R] testnl [R] vce, [R] xtdata, [R] xtdes, [R] xt^um, [R] xttab
Related:
[R] Ixtgee, [R] xtgls, [R] xtp4se, [R] xtreg, [R] xtregar
Background:
[u] 16.5 Accessing coefficients and standard errors, 23 Estimation and post-estimation commands, [R] xt
433
Title xtreg — Fixed-, between-, and random-effects, and population-averaged linear models
I
Syntax GLS Random-effects model xtreg depvar [varlist] [if exp] [, re i(vamam) sa theta level(#) xttestO xthausman Between-effects
model
xtreg depvar \varlist\ [if exp] , be [ i(varname) wls level (#) ] Fixed-effects model xtreg depvar [varlist] [if exp] , fe [ i(varaame) level (#) ] ML Random-effects model xtreg Jepvar [var/«r] [vmg/tf] [if exp] , mle [ i(varname') noconstant level(#) ] Population-averaged model xtreg depvar [varlisl] [weight] [if exp] , pa [ i(varname) noconstant level(#) offset (varname) xtgee-.options 1 by . . . : may be used with xtreg; see [R] by. iweights, fweights, and pweights are allowed for the population-averaged model and iweights are allowed for the maximum likelihood (ML) random-effects model; see [U] 14.1.6 weight. Note that weights must be constant within panels. xtreg shares the features of al! estimation commands; see [U] 23 Estimation and post-estimation commands.
(Continued on next page)
434
xtreg — Fixed-, between* and random-effects, arid population-averaged linear models
435
Syntax for predict For all but the, population-leveraged model predict [type] newv&rname [if exp] [in wnge] [, statistic nooifset where statistic is
j
I xb s ;dp u5 * x 5u *u *e i
i x;b, fitted valuis (the default) standard error ofn the fitted values Ui + en, the co|nbined residual x;b + tit, prediction including effect Ui, fhe fixed or random error component Cit, the overall £rror component '
Unstarred statistics are available!both in and out of sample; tjrpe predict . . . if e(sample) . . . if wanted only for the estimation sample. Starred statistics are calculated oily for the estimation sample even when if e(sample) is not specified. j
Population-averaged model predict \typel newvarname [if exp] [in fiange] [, | mu rate | xb | stdp | nooffset These statistics are available both in and out of sample: type predict . . . if e(sample) . . . if wanted only for the estimation sample.
Description xtreg estimates cross-: ;ectional time-series regression models. In particular, xtreg with the be option estimates random-elfects models using the rjetween regression estimator: with the f e option. fixed-effects models (using the within regression estimator); and with the re option, random-effects models using the GLS estimator (producing a matkx-weighted average of the between and within results). (Also see [R] xtdata for a faster way to estimate fixed- and random-effects models.) i xttestO. for use afterlxtreg, re. presents tbi Breusch and Pagan (1980) Lagrange multiplier test for random effects, a est that Var(fj) — 0. xthausman, for use aftf r xtreg, re. presents Hausman's (1978) specification test. If one believes the model is correctly specified and the test returns ja significant result, then this can be interpreted as evidence that the random effects, i/j, and the regressdrs, Xft, are correlated. Thus, under the assumption of correct specification, xt lausman tests the appropriateness of the random-effects estimator xtreg, re applied to these data.
Options re requests the GLS randohi-effects estimator, re ik the default. be requests the between n ;gression estimator.
]
f e requests the fixed-effeqts (within) regression estimatoi.
438
xtreg — Fixed-, between-, and random-effects, and population-averaged linear models
More is required to justify the between estimator of equation (2), but the conditioning on the sample is not assumed since i/j +?» is treated as a residual. Newly required is that we assume i/j and x; are uncorrelated. This follows from the assumptions of the OLS estimator but is also transparent: Were i>i and x, correlated, the estimator could not determine how much of the change in y^ associated with an increase in Xj, to assign to (3 versus how much to attribute to the unknown correlation. (This, of course, suggests the use of an instrumental-variable estimator, z~{, that is correlated with x~i but uncorrelated with i>i, but that approach is not implemented here.) The random-effects estimator of equation (4) requires the same no-correlation assumption. In comparison with the between estimator, the random-effects estimator produces more efficient results, albeit ones with unknown small- sample properties. The between estimator is less efficient because it discards the over-time information in the data in favor of simple means; the random-effects estimator uses both the within and the between information. All of this would seem to leave the between estimator of equation (2) with no role (except for a minor, technical part it plays in helping to estimate a^ and o\ , which are used in the calculation of 0 on which the random-effects estimates depend). Let us, however, consider a variation on equation (1): yit = a + Xift + fat ~ Xj)/32 + v » + c«
(0
In this model, we postulate that changes in the average value of x for an individual have a different effect from temporary departures from the average. In an economic situation, y might be purchases of some item and x income; a change in average income should have more effect than a transitory change. In a clinical situation, y might be a physical response and x the level of a chemical in the brain; the model allows a different response to permanent rather than transitory changes. The variations of equations (2) and (3) corresponding to equation (!') are i^a + Xift + z/i+Si
(2')
fat - 17i) = (Xit - Xi)/32 + (e« - c;)
(3')
That is, the between estimator estimates /31 and the within /32, and neither estimates the other. Thus. even when estimating equations like (1), it is worth comparing the within and between estimators. Differences in results can suggest models like (!') or, at the least, some other specification error. Finally, it is worth understanding the role of the between and within estimators with regressors that are constant over time or constant over units. Consider the model 0")
yit - a
This model is the same as (1) except that we explicitly identify the variables that vary over both time and i (x;t, such as output or FEV); variables that are constant over time (s,:, such as race or sex); and variables that vary solely over time (z t , such as the consumer price index or age in a cohort study). The corresponding between and within equations are z/33-+ Vi + et zt + e -?
(3 )
In the between estimator of equation (2"), no estimate of /33 is possible because z is a constant across the i observations; the regression-estimated intercept will be an estimate of a + z"Aj- On the other hand, it is able to provide estimates of f31 and j32. It is able to estimate effects of factors that are constant over time, such as race and sex. but to do so, it must assume that ?/,; is uncorrelated with those factors.
i
xtreg — Fixed-,
between-, and random-effects, and population-averaged linear models
439
The within estimator or equation (3 ), like thq between estimator, provides an estimate of @i, but provides no estimate c (32 for time-invariant factors. Instead, it provides an estimate of /33, the effects of the time-varying factors. The between estimator can also provide estimates u, for V{. More correctly, the estimator is an estimator of 14 4- f^- Thus, u^ is an estimator of V{ only if there are no time-invariant varic Wes in the model. If th^re are time-invariant variables, Uj is an estimate of i>i plus the effects of tf time-invariant variable)?.
Assessing goodness of fit is a popular measu •e of goodness of fit in Ordinary regression. In our case, given a and (3 estimates of a and j3, we can assess the goodness of fit with respect to equation (1), (2), or (3). The prediction equations are respectively =a
Xit/3 (2"
= (Sit -Si) !=(xft-Xj)
(3"
xtreg reports "/?-squareds " corresponding to these;three equations. .H-squareds is in quotes because the /?-squareds reported dc not have all the properijies of the OLS R2. The ordinary properties R2 include being equal to the squared correlation between y and y and being equal to the fraction < the variation in y explained by y — formally defined as Var(t/)/Var(y). The identity of the definite ns is due to a special property of the OLS estimates; in general, given a prediction y for y, the squ •ed correlation is not equal to the ratio of the variances and the ratio of the variances is not require to be less than 1. xtreg reports R2 values calculated as correlation^ squared, calling them R2 overall, corresponding to equation (I'"); R2 betv een, corresponding to equation (2'"); and R2 within, corresponding to equation (3 ). In fact, you can think of each of th|se three numbers as having all the properties of ordinan' /?2s if you bear i mind that the predictioh being judged is not y~n, y±, and yit, but from the regression ya — from the regression yf = 72yt-; and ^ylt from yu In particular, xtreg, be obtains its estimates by performing OLS on equation (2), and therefore its reported R2 between is an rdinary R2. The other tivo reported R2s are merely correlations squared or, if you prefer, ,R2s from the second-round regressions yu = 711^ and y,t = 7iaj/ it . •xtreg, fe obtains its e timates by performing dLS on equation (3), so its reported R2 within is an ordinary J?2. As with b , the other J?2s are conjelations squared or, if you prefer, /?2s from the second-round regressions y = 722 2/t and. as with be, yit = 723 yit. xtreg. re obtains its es mates by performing OLJ> on equation (4); none of the R2s corresponding to equations (!'"), (2'"), o (3W) correspond directly to this estimator (the "relevant" R2 is the one corresponding to equation )). All three reported R !*s are correlations squared or. if you prefer, from second-round regressions
440
xtreg — Fixed-, between-, and random-effects, and population-averaged linear models
xtreg and associated commands > Example Using the nlswork dataset described in [R] xt, we will model ln_wage in terms of completed years of schooling (grade), current age and age squared, current years worked (experience) and experience squared, current years of tenure on the current job and tenure squared, whether black, whether resides in an area not designated an SMSA (standard metropolitan statistical area), and whether resides in the South. Most of these variables are in the data, but we need to construct a few: . gen age2 = age"2 (24 missing values generated) . gen ttl_exp2 = ttl_exp"2 , gen tenure2 = tenure~2 (433 missing values generated) . gen byte black = race==2
To obtain the between-effects estimates, we use xtreg, be: . xtreg ln_w grade age* ttl_exp* tenure* black Between regression (regression on group means) Group variable (i) : idcode R-sq: within = 0.1591 between = 0.4900 overall = 0.3695 sd(u_i + avg(e_i.))=
.3036114
ln_vage
Coef .
Std. Err.
grade age age 2
.0607602 .0323158 - . 0005997 .0138853 .0007342 .0698419 -.0028756 - .0564167 - . 1860406 -.0993378 .3339113
.0020006 .0087251 . 0001429 . 0056749 . 0003267 .0060729 .0004098 .0105131 .0112495 .010136 .1210434
ttl_exp
ttl_exp2 tenure tenure 2 black not_smsa south _cons
t
30.37 3.70 -4.20 2.45 2.25 11.50 -7.02 -5.37 -16.54 -9.80 2.76
not_smsa south, be Number of obs Number of groups Obs per group: min avg max F(10,4686) Prob > F P>!t|
0.000 0.000 0.000 0.014 0.025 0.000 0.000 0.000 0.000 0.000 0.006
= = = = =
28091 4697 1 6.0 15 450.23 0.0000
[95V. Conf . Interval] . 0568382 .0152105 - . OOQ'8799 .0027598 .0000936 .0579361 - . 0036789 - . 0770272 -.2080949 -.1192091 .0966093
.0646822 .0494211 - . 0003194 .0250108 .0013747 .0817476 - . 0020722 -.0358061 -.1639862 -.0794665 .5712133
The between-effects regression is estimated on person averages, so it is the "n = 4697" that is relevant, xtreg, be reports the "number of observations" and group-size information; to wit: describe in [R] xt showed that we have 28,534 "observations"—person-years, really—of data. Taking the subsample that has no missing values in ln_wage, grade, . . . , south leaves us with 28,091 observations on person-years, reflecting 4,697 persons each observed for an average of 5.98 years. In terms of goodness of fit, it is the R2 between that is directly relevant; our R2 is .4900. Ifhowever, we use these estimates to predict the within model, we have an R2 of .1591. If we use these estimates to fit the overall data, our R2 is .3695. The F statistic is a test that the coefficients on the regressors grade, age, . . . , south are all jointly zero. Our model is significant.
xtreg — Fixed-, between- , and random-effects, artd population-averaged linear models
441
The root mean square irror of the estimated regression, which is an estimate of the standard deviation of i/j + ?;, is .30 56. In terras of our coefficients.„ we find that each year !of schooling increases hourly wages by 6.1 %; that age increases wages up to a e 26.9 and thereafter decreases them (because quadratic ax2+bx+c turns over at x — -b/2a, which for our age and age2 coefficients is .0323158/(2 x .0005997) » 26.9); that total experience increa ;es wages at an increasing rate (which is surprising and bothersome); that tenure on the current job ir creases wages up to a tehure of 12.1 years and thereafter decreases them; things held constant, (approximately) 5.6% below that of nonblacks that wages of blacks are, (approximately because I ack is an indicator variable); that residing in a nonSMSA (rural area) reduces wages by 18.6%; a'nd that residing in the South reduces wages by 9.9%.
Example To estimate the same mbdel with the fixed-effects estimator, we specify the f e option. . xtreg ln_w grade ige* ttl_exp* tenure* black not_smsa south, fe Fixed-effects (with: n) regression Number of obs Group variable (i) ^ idcode Number of groups R-sq: within =0.1 727 Obs per group: min between =0.; 505 avg overall = 0.; 625 max F(8,23386) Prob > F corr(u_i, Xb) = 0 . : 936 ln_wage
Coef .
grade age age2 ttl_exp ttl_ezp2 tenure tenure2 black not_smsa south _cons
(dr< pped) .0; 59987 -,( 00723 .0; 34668 .0( 02163 .OJ 57539 -.05 19701 (drc pped) -.06 90108 -.06 06309 1 03732
sigma_u sigma_e rho
,35£ 62203 .29C 68923 .59< 46283
F test that all u_i=
Std. Err.
it
.0033864 .0000533 .0029653 .0001277 . 0018487 . 000125
10J63 -13^58
.0095316 .0109319 .0485546
P>|tl
= = = = = -
28091 4697 1 6.0 15 610.12 0.0000
[951/, Conf . Interval]
1,|69 19^34 -15.J76
0.000 0.000 0.000 0.090 0.000 0.000
.0293611 -.0008274 . 0276545 -.0000341 .0321303 -.0022151
.0426362 -.0006186 .039279 .0004666 .0393775 -.0017251
-9.J34 -5J55 21.36
0.000 0.000 0.000
- . 1076933 -.0820582 .9421497
- . 0703282 - . 0392036 1.13249
11.; 29
(fraction of variance due to u_i) F(4696,23386)
5.13
Prob > F = 0.0000
The observation summary < t the top is the same as for the between-effects model, although this time it is the "Number of obs" that is relevant. 2
Our three jf?2s are not tob different from those reported previously; the R within is slightly higher 2 (.1727 vs . 1591) and the R between a little lower |(.3505 vs .4900), which is as expected since the 2 2 between estimator maximi2 es R between and the !within estimator R within. In terms of overall fit, these estimates are som jwhat worse (.2625 vs .$695). xtreg, fe is able to pr >vide estimates of av anfl or€, although how you interpret these estimates depends on whether you ar using xtreg to estimate a fixed-effects model or random-effects model, To clarify this fine point, in the fixed-effects model, 't>f are formally fixed—they have no distribution.
442
xlreg — Fixed-, between-, and random-effects, and population-averaged linear models
If you subscribe to this view, think of the reported av as merely an arithmetic way to describe the range of the estimated but fixed i/j. If> however, you are employing the fixed-effects estimator of the random-effects model—as is likely—then .355622 is an estimate of „, or it would be if there were no dropped variables in the estimation. In our case, note that both grade and black were dropped from the model. They were dropped because they do not vary over time. Since grade and race are time-invariant, our estimate U{ is an estimate of f i plus the effects of grade and race and so our estimate of the standard deviation is based on the variation in z/j, grade, and race. On the other hand, had race and grade been dropped merely because they were collmear with the other regressors in our model, «j would be an estimate of i/j and .3556 would be an estimate of av. (ztsum and xttab allow determining whether a variable is time-invariant; see [R] xtsum and [R] xttab.) Regardless of the status of our estimator Ui, our estimate of the standard deviation of e^ is valid (and, in fact, is the estimate that would be used by the random-effects estimator in producing its results). Our estimate of the correlation of «j with x^ suffers from the problem of what u^ measures. We find correlation, but cannot say whether this is correlation of i/^ with x^ or merely correlation of grade and race. In any case, the fixed-effects estimator is robust to such correlation and the other estimates it produces are unbiased. So, while this estimator produces no estimates of the effects of grade and race, it does predict that age has a positive effect on wages up to age 24.9 years (as compared with 26.9 years estimated by the between estimator); that total experience still increases wages at an increasing rate (which is still bothersome); that tenure increases wages up to 9.1 years (as compared with 12.1); that living in a nonSMSA reduces wages by 8.9% (as compared with a more drastic 18.6%); and that living in the South reduces wages by 6.1% (as compared with 9.9%).
l> Example Reestimating our log-wage model with the random-effects estimator, we obtain
(Continued on next page)
xtreg — Rxed-, betwec
and random-effects, and population-averaged linear models
443
. xtreg ln_w gradi age* ttl_exp* tenure* black not_smsa south, re Random-effects GLi regression Group variable (i : idcode
Number of obs Number of groups
= =
28091 4697
R-sq: within = ( 1715 between = 1 4784 overall = ( 3708
Dbs per group: min = avg = max =
1 6.0 15
Random effects u_: - Gaussian corr(u_i, X) = 0 (assumed)
Wald chi2{10) Prob > chi2
=
9244.74 0.0000
ln_«age
Coef .
grade age age2 ttl_exp ttl_exp2 tenure tenure2 black not_smsa south _cons
0646499 0368059 0007133 0290208 0003049 0392519 0020035 053053 1308252 0868922 2387207
.0017812 ,0031195 .00005 .002422 .0001162 .0017554 .0001193 .0099926 .0071751 .0073032 .049469
5790526 9068923 4045273
(fraction of variance due to u_i)
-
-
signa_u sigmB.e rho
-
Std. Err.
z
Se.so il.80 -i4.27 il.98 2.62 i2.36 -16.80 -^5.31 -i8. 23 -il,90 ;4.83
P>lz| 0.000 0.000 0.000 0.000 0.009 0.000 0.000 0.000 0.000 0.000 0.000
[957, Conf . Interval] .0611589 .0306918 -.0008113 .0242739 .000077 .0358113 - . 0022373 -.0726381 - . 1448881 -.1012062 . 1417633
.0681409 .0429201 -.0006153 .0337678 .0005327 .0426925 -.0017697 -.0334679 -.1167622 -.0725781 .3356781
According to the R2s, I is estimator performs \yorse within than the within/fixed-effects estimator and worse between than he between estimator, ai it must, and slightly better overall. We estimate that av and x is zero.
.2579 and a( is .2907 and, by assertion, assume that the correlation of v
All that is known ab ut the random-effects estimator is its asymptotic properties, so rather than reporting an F statistic f< overall significance, xtfeg, re reports a x2- Taken jointly, our coefficients are significant. Also reported is a ssurrtnary of the distribution of $,:, an ingredient in the estimation of equation (4). 0 is not a constant in th case because we observe women for unequal periods of time. In terms of interpreta on, we estimate that schooling has a rate of return of 6.5% (compared with 6.1% between and no timate within); that the!increase of wages with age turns around at 25.8 years (compared with 26 9 between and 24.9 witbjin); that total experience yet again increases wages increasingly; that the eff ct of job tenure turns arrjund at 9.8 years (compared with 12.1 between and 9.1 within); that being b ick reduces wages by 5.3% (compared with 5.6% between and no estimate within); that living in a lonSMSA reduces wages.; 13.1% (compared with 18.6% between and 8.9% within); and that living the South reduces wages 8.7% (compared with 9.9% between and 6.1% within).
Example Alternatively, we cou i have estimated this raiidom-effects model using the maximum likelihood estimator:
444
xtreg — Fixed-, between-, and random-effects, and population-averaged linear models . xtreg ln_w grade age* ttl_exp* tenure* black not_smsa south, mle Fitting constant-only model: Iteration 0: log likelihood = -13690.161 Iteration 1: log likelihood = -12819.317 Iteration 2: log likelihood = -12662.039 Iteration 3: log likelihood = -12649.744 Iteration 4: log likelihood * -12649.614 Fitting full model: Iteration 0: log likelihood = -8922.145 Iteration 1: log likelihood = -8853.6409 Iteration 2: log likelihood * -8853.4255 Iteration 3: log likelihood = -8853.4254 Random-effects ML regression Number of obs = Group variable (i) : idcode Number of groups = Random effects u_i ~ Gaussian Oba per group: min = avg = max = LR chi2(10) = Log likelihood = -8853.4254 Prob > ch!2
28091 4697 1 6.0 15 7592.38 0.0000
ln_wage
Coef.
Std. Err.
z
P>|zl
grade age age2 ttl_exp ttl_exp2 tenure tenure 2 black not_smsa south _cons
.0646093 .0368531 -.0007132 .0288196 . 000309 .0394371 -.0020052 -.0533394 - . 1323433 -.0875599 .2390837
.0017372 . 0031226 .0000501 .0024143 .0001163 .0017604 .0001195 .0097338 .0071322 .0072143 .0491902
37.19 11.80 -14.24 11.94 2.66 22.40 -16.77 -5.48 -18.56 -12.14 4.86
0.000 0.000 0.000 0.000 0.008 0.000 0.000 0.000 0.000 0.000 0.000
.0612044 . 030733 -.0008113 ,0240877 .0000811 ,0359868 -.0022395 -.0724172 -.1463221 -.1016998 . 1426727
.0680142 .0429732 -.000615 .0335515 .0005369 .0428875 -.0017709 -.0342615 -.1183644 -.0734201 . 3354947
/sigma_u /sigma_e
.2485556 .2918458
.0035017 .001352
70.98 215.87
0.000 0.000
.2416925 .289196
.2554187 .2944956
rho
.4204033
. 0074828
.4057959
.4351212
[95V. Conf . Interval]
Likelihood ratio test of sigma_u=0: chibar2(01)= 7339.84 Prob>=chibar2 = 0.000
The estimates are very nearly the same as those produced by xtreg, re—the GLS estimator. For instance, xtreg, re estimated the coefificient on grade to be ,0646499; xtreg, mle estimated .0646093; and the ratio is .0646499/.0646093 = 1.001 to three decimal places. Similarly, the standard errors are nearly equal: .0017812/.0017372 = 1.025. Below we compare all 11 coefficients:
Estimator xtreg, mle (ML) xtreg, re (GLS)
Coefficient ratio mean min. max.
mean
1. 1, 1. .997 .987 1.007
1, 1. 1. 1.006 .997 1.027
SE ratio min. max.
I
xtreg — Fixed-, between
, and random-effects, and population-averaged linear models
445
I> Example We could also have est mated this model using jthe population-averaged estimator: . rtreg ln_w grade ige* ttl_erp* tenure* black not.smsa south, i( idcode) pa Iteration Iteration Iteration Iteration
1: 2: 3: 4:
tolera ice tolera ice tolera ice tolera ice
= = = =
.0310561 .00074898 .0000147 2.880e-07
GEE population-aver iged model Group variable: idcode Link: identity1 Family: Gaussian Correlation: exchangeable Scale parameter:
. 1436709^
In.wage
*oef .
Std. Err.
grade age age2 ttl_exp ttl_ezp2 tenure tenure2 black not_SBsa south _cons
.01 45427 1 36932 -.01 07129 .0: 84878 .01 03158 .0! 97468 -.( 02008 -.0! 38314 -.I; 47788 -.Of 85969 .1'i 96286
.0016829 .0031509 .0000506 .0024169 .0001172 .0017779 .0001209 .0094086 . 0070543 .0071132 .0491465
t
z 38*35 11*72 -14*10 11*79 2*69 22*36 -16i61 -5*72 -19*11 -12J46 4*88
Number of obs Number of groups Obs per group: min avg max Wald chi2(10) Prob > chi2 P>lzl 0.000 0.000 0.000 0.000 0.007 0.000 0.000 0.000 0.000 0.000 0.000
= = = = = = =
28091 4697 1 6.0 15 9598.89 0.0000
[957, Conf . Interval] .0612442 .0307564 -.0008121 .0237508 .000086 .0362621 -.0022449 -.072272 - . 1486049 -.1025386 . 1433034
.0678412 .0431076 -.0006138 .0332248 . 0005456 .0432315 -.0017711 -.0353909 - . 1209526 -.0746552 .3359539
These results differ from th >se produced by xtreg, re and xtreg, mle. Coefficients are larger and standard errors smaller, xt .•eg, pa is simply anothjer way to run the xtgee command. That is, we would have obtained the s me output had we typed . xtgee ln_w grade e* ttl_exp* tenure* black not.smsa south, i(idcode) (oufput omitted because I s the same as above)
See [R] xtgee. In the langu correlation structure and correlation structures as w produce robust estimates o
ge of xtgee, the randorrf-effects model corresponds to an exchangeable dentity link, and xtiee has the advantage that it will allow other '1. Let us stay with the irandom-effects model, however, xtgee will also variance, and we re-esttmated this model that way by typing
. xtgee ln_v grade ;e* ttl_exp* tenure* black not_smsa south, i(idcode) robust (output omitted, coeffide 's the same, standard errors different)
In the previous exampl we presented a table comparing xtreg, re with xtreg, mle. Below we add the results from th estimates shown and the ones we did with xtgee, robust:
Estimator xtreg, ml' (ML) xtreg, re (GLS) xtreg, pa (PA) xtgee, ro ist (PA)
Coefficient ratio mean mil. max. .997 1.060 1.060
.987 1.007 .847 1.317 .847 1.317
SE ratio mean min. max. 1. 1. 1. 1.006 .997 1.027 .853 .626 .986 1.306 .957 1.545
446
xtreg — Fixed-, between-, and random-effects, and population-averaged linear models
So which are right? This is a real dataset and we do not know. However, in the example after the next we will present evidence that the assumptions underlying the xtreg, re and xtreg, mle results are not met. Our suspicion is that the rtgee, robust results are more correct because those standard errors do not hinge on the assumptions. <J
t> Example After xtreg, re estimation, xttestO will report a test of i/j = 0 in case we had any doubts: . xttestO Breusch and Pagan Lagrangian multiplier test for random effects: ln_wage[idcode,t] * Xb + u[idcode] + e[idcode,t] Estimated results: Var sd = sqrtCVar)
Test:
ln_wage e u Var(u) = 0
. 2283326 .0845002 .0665151
.4778416 . 2906892 .2579053
chi2(l) = 14779.98 Prob > chi2 = 0.0000
Example More importantly, after xtreg, re estimation, xth.ausmart will perform the Hausman specification test. If our model is correctly specified and if Vi is uncorrelated with x^, then the (subset of) coefficients that are estimated by the fixed-effects estimator and the same coefficients that are estimated here should not statistically differ: . xthausman Hausman specification test • uoeii.Lcients Fixed Random In.wage Effects Effects
Difference
age .0359987 .0368059 -.0008073 age2 -.000723 -.0007133 -9,68e-06 ttl_exp .0334668 . 0290208 . 0044459 ttl_exp2 .0002163 . 0003049 - . 0000886 tenure .0357539 .0392519 - . 003498 tenure 2 -.0020035 -.0019701 .0000334 not_smsa -.0890108 -.1308252 .0418144 south -.0606309 -.0868922 .0262613 Test: Ho: difference in coefficients not systemat chi2(
8) = (b-B)'[S-(-l)](b-B), S = (S.fe - S_re) 149.43 Prob>chi2 = 0.0000
We can reject the hypothesis that the coefficients are the same. Before turning to what this means, note that xthausman listed the coefficients estimated by the two models. It did not. however, list grade and race, xthausman did not make a mistake; in the Hausman test, one compares only the coefficients estimated by both techniques.
xtreg — Fixed-, between-, and random-effects, arid population-averaged linear models
447
What does this mean': We have an unpleasant choice: we can admit that our model is misspecified—that we hav ; not parameterized it <j;orrectly—or we can hold to our specification being correct, in which cas : the observed differences must be due to the zero-correlation of Vi and the Xjt assumption. <1
Q Technical Note I
We can also mechanical y explore the underpinnings of the test's dissatisfaction. In the comparison table, note that it is the coefficients on not_smsa djnd south that exhibit the largest differences. In equation (l'), we showed low to decompose a m^del into within and between effects. Let us do that with these two variables;, assuming that changes' in the average have one effect while transitional changes have another: . egen avgnsmsa = m«an(not_smsa), by(id) . gen devnsma = not.smsa - avgnsmsa (8 missing values g« nerated) . egen avgsouth = msan(south), by(id) . gen devsouth = soi th - avgsouth (8 missing values gene:rrated) . xtreg ln_w grade zge* ttl.exp* tenure* black avgnsm devnsm avgsou devsou Random-effects Group variable R-sq: within between overall Random effects corr(u_i, X)
GLS i egression (i) idcode = 0. 723 = 0. 809 = 0. 737
Number of obs Number of groups
u_i
Wald chi2(12) Prob > chi2
Obs per group: min = avg = max =
Gaussian 0 (assumed)
ln_wage
Coef.
grade age age2 ttl_exp ttl_exp2 tenure tenure 2 black avgnsmsa devnsma avgsouth devsouth _cons
.0 .31716 .0 175 196 -.01 107248 .0 !86543 .01 103222 .0, !94423 -.0 120081 -.0 S45936 „ •« 533237 -.0 S87596 -.1 111235 -.0 .98538 .2 182987
.0017903 .0031186 .00005 .0024207 .0001162 .001754 .0001192 .0102101 .0109339 .0095071 .0088789 .0109054 .0495778
sigma_u sigma_e rho
.2 i79182 .29 168923 .44 !47745
(fraction of variance due to u_i)
Std. Err,
= =
z 35*29 12J 03 -14J50 11,84
2*77 22149 -16j85 -5 1 35 -16;77 -9134 -10^24 -5L49 5,41
P>lz| 0.000 0.000 0.000 0.000 0.006 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
28091 4697 1 6.0 15 9319.56 0.0000
[95% Conf . Interval] .0596627 .0314072 -.0008228 .0239098 .0000945 . 0360044 -.0022417 - . 074605 -.2047537 -.1073931 -.1204858 -.081228 .171128
. 0666805 .043632 -.0006269 .0333989 .0005499 .0428801 -.0017746 -.0345821 -.1618937 -.070126 -.0817611 - . 0384797 .3654694
We will leave the reinterprfetation of this model to you except to note that if we were really going to sell this model, we would iave to explain why the between and within effects are different. Focusing on residence in a nonSMSA we might tell a story about rural folk being paid less and continuing to get paid less when they move o the SMSA. As such, however, it is just a story. Given our cross-sectional time-series data, we could reate variables to measure this (an indicator for moved from nonSMSA to
448
xtreg — Fixed-, between-, and random-effects, and population-averaged linear models
SMSA) and to measure the effects. In our assessment of this model, we should think about women in the cities moving to the country and their relative productivity in a bucolic setting. In any case, the Hausman test now is . xthausman Hausman specification test —
ln_wage age age2 ttl_exp ttl_exp2 tenure tenure2 devnsma devsouth Test: Ho:
UOH-L-Li tJ.eilL,b
Fixed Effects
Random Effects
Difference
.0359987 .0375196 -.0015209 -.000723 -.0007248 1.84e-06 .0334668 . 0286543 .0048124 .0002163 . 0003222 -.0001059 . 0357539 . 0394423 -.0036884 -.0019701 -.0020081 .000038 -.0890108 - . 0887596 -.0002512 -.0606309 - . 0598538 -.0007771 difference in coefficients not systematic cni2( 8) = (b-B}'[S-(-l)3(b-B), S = (S_fe - S_re) 92.52 Prob>cM2 = 0.0000
We have mechanically succeeded in greatly reducing the x2, but not by enough. The major differences now are in the age, experience, and tenure effects. We already knew this problem existed because of the ever-increasing effect of experience. More careful parameterization work than simply including squares needs to be done. Q
Acknowledgments We thank Richard Goldstein, who wrote the first draft of the routine that estimates random-effects regressions, and Badi Baltagi and Manuelita Ureta of Texas A&M University, who assisted us in working our way through the literature.
(Continued on next page)
xtreg — Fixed-, betweer -, and random-effects, 4nd population-averaged linear models
449
Saved Results xtreg, re saves in el ): Scalars e(N) number of bservations e(K_g) number of iroups e(df_m) model degr ses of freedom e(g_mai) largest grov p size e(g_jnin) smallest gr< up size e(g_avg) average gro jp size e(cM2) x2 e(rho) p e(Tbar) harmonic n can of group sizes e(Tcoa) 1 if T is c instant e(r2_tf) fl-squared : or within model Macros e(cmd) xtreg e(depvar) name of de rendent variable e (model) re e(ivar) variable de oting groups Matrices e(b) e(theta) e(V) Functions e (sample)
coefficient ector e variance-cc variance matrix of the estimator
e(r2_o) e(r2_b) e(sigma) e(sigma_u) e(sigma_e) e(thta_min) e(thta_5) e(thta_50) e(thta_S5) e(thta_max)
jR-squared for overall model H-squared for between model ancillary parameter (gamma, Inonnal) panel-level standard deviation standard deviation of £jt minimum 9 0, 5th percentile 6, 50th percentile 6, 95th percentile maximum 6
je(cM2type) Wald; type of model x2 test e(sa) Swamy-Arora estimator of the variance components (sa only) e(predict) program used to implement predict e(Vf ) e(bf )
VCE for fixed-effects model coefficient vector for fixed-effects model
marks estimation sample
xtreg, be saves in e ( ) : Scalars e(N) e(N_g) e(mss) e(df_m) e(rss) e(df_r) e(r2)
e(r2_a) e(F) e(riBse)
number of observations number of groups model ; um of squares model ( egrees of freedom residua sum of squares residua degrees of freedom U-squai :d adjuster .R-squared F statis tic root m« an square error
Macros e(cmd) xtreg e(depvar) name o dependent variable e (model) be Matrices e(b)
coefficit nt vector
Functions e (sample)
marks stimation sample
e(ll) e(ll_0) e(g_max) e(g_min) e(g_avg) e (Tbar) e (Icon) e(r2_w) e(r2_o) e(r2_b)
log likelihood log likelihood, constant-only mode! largest group size smallest group size average group size harmonic mean of group sizes 1 if T is constant .R-squared for within model J?-squared for overall model .R-squared for between mode!
^e(ivar) variable denoting groups e (predict) program used to implement predict
e(V)
variance-covariance matrix of the estimators
450
xtreg — Fixed-, between-, and random-effects, and population-averaged linear models
rtreg, fe saves in e(): Scalars eCN) e(N_g) e(mss) e(tss) e(df_m) e(rss) e(df_r) e(r2) e(r2_a) e(F) e(rmse)
e(df_a)
number of observations number of groups model sum of squares total sum of squares model degrees of freedom residual sum of squares residual degrees of freedom A-squared adjusted J?~squared F statistic root mean square error log likelihood log likelihood, constant-only model degrees of freedom for absorbed effect
Macros e(cmd) xtreg e(depvar) name of dependent variable e (model) f e Matrices e(b)
coefficient vector
Functions e (sample)
marks estimation sample
largest group size smallest group size average group size
e(g_max) e(g_min) e(g_avg) e(rho) e (Tbar) e(Tcon) e(r2_w) e(r2_o) e(r2_b) e(sigma) e(corr) e(sigma_u) e(sigma_e) e(F_f)
harmonic mean of group sizes 1 if T is constant K-squared for within model fi-squared for overall model fi-squared for between model ancillary parameter (gamma, Inormal) corr(«i, Xb) panel-level standard deviation standard deviation of (.n F for ui=0
e(ivar) e (predict)
variable denoting groups program used to implement predict
e(V)
variance-covariance matrix of the estimators
P
xtreg, mle saves in e(): Scalars e(N) e(N_g) e(df_m) eCll)
e(g_max)
number of observations number of groups model degrees of freedom log likelihood log likelihood, constant-only model log likelihood, comparison model largest group size
e(g_min) e(g_avg) e(chi2) e(cM2_c) e(rho) e(sigma_ji) e(sigma_e)
Macros e(cmd) e(depvar) e(title) e (model) e(ivar) e(wtype)
xtreg
e(wexp)
name of dependent variable title in estimation output ml variable denoting groups weight type
e(cM2type) e(chi2_ct) e(distrib) e (predict)
Matrices e(b)
coefficient vector
e(V)
Functions e (sample)
marks estimation sample
smallest group size average group size x2
X 2 for comparison test panel-level standard deviation standard deviation of 6n weight expression Wald or LR; type of model x 2 test Wald or LR; type of model x 2 test corresponding to e(chi2_c) Gaussian; the distribution of the re program used to implement predict variance-covariance matrix of the estimators
xtreg — Fixed-, between-, and random-effects, antl population-averaged linear models
451
xtreg, pa saves in e() Scalars e(N) number of o servations e(N_g) number of g oups e(df_m) model degra of freedom e(g_msa) largest group size e(g_min) smallest grot p size e(g_avg) average grou size e(chi2) x2 e(df _pear) degrees of ft eedom for Pearson x2
e (deviance) e(chi2_dev) e(dispers) e(chi2_dis) e(tol) e(dif) e(phi)
deviance x2 test °f deviance deviance dispersion x2 test of deviance dispersion target tolerance achieved tolerance scale parameter
x2, dev, phi, or #; scale parameter variable denoting groups covariance estimation method tfald; type of model x 2 test deviance dispersion offset program used to implement predict
Macros e(cmd) e(cnd2) e(depvar) e (model) e (family) e(link) e(corr)
xtgee xtreg name of dep ndent variable pa Gaussian identity; 1 nk function correlation si ructure
e (scale) e(ivar) e(vcetype) e(chi2type) e(disp) e( off set) e (predict)
Matrices e(b) e(R)
coefficient v< ctor estimated we rking correlation matrix
e(V)
Functions e (sample)
marks estirm don sample
variance-covariance matrix of the estimators
Methods and Formuk S The mode) to be estimat
:dis
wt = «+x i 4f ^ + e« jor ? = 1
n and for ea -hi 1 = 1
T of wih ich Ti periods are actually observed.
xtreg, fe xtreg, fe produces est mates by running OLS oil (Vit ~ }
where ?/, = St— i ya/Ti,
;
4. (f-, — f- 4- 77^ 4- f -f f ) = a + (xft - Xj k_ Av\fl /M i lfc!* fct ' V } i e n
and similarly, y — £1 Z*f yit/( Ti}- The covariance matrix of the estimators is adjusted for th : extra n — 1 estimated hieans, so results are the same as using OLS on equation (1) to estimate i/j lirectly. From the estimates a a d /3. estimates it, of ^ are obtained as ut = yl - a - Xi/3. Reported from the calculated u, is its standard deviation and its correlation with x,-/3. Reported as the standard deviation of &n is the regres ion's estimated root mean square error, s, which is adjusted (as previously stated) for the n - \ estima ed means. Reported as R2 within i the R- from the mean-deviated regression.
452
xtreg — Fixed-, between-, and random-effects, and population-averaged linear models
Reported as R2 between is Reported as R2 overall is corr(xjt/3, ya)2. xtreg, be xtreg, be estimates the following model: j/i = a
Estimation is via OLS unless T, is not constant and the wls option is specified. Otherwise, the estimation is performed via WLS. The estimation is performed by regress for both cases, but in the case of WLS, [aweight^Ti] is specified. Reported as R2 between is the R3 from the estimated regression. Reported as R2 within is eorr{(xjt - x\)/3, y^ — y^ . Reported as R2 overall is corr(xit/3,yit)2. xtreg, re The key to the random-effects estimator is the GLS transform. Given estimates of the idiosyncratic component, cr2, and the individual component,
= Zit rri
where z^ = ^ ]Tt i zit and
Given an estimate of &i, one transforms the dependent and independent variables, and then the coefficient estimates and the variance-covariance matrix come from an OLS regression of y*t on x*t and the transformed constant 1 — 0j. Stata has two implementations of the Swamy-Arora method for estimating the variance components. They produce exactly the same results in balanced panels and share the same estimator of a 2 . However, the two methods differ in their estimator of cr2 in unbalanced panels. We call the first "cr2^ and the second S^SA- Both estimators are consistent, however, a2SA has a more elaborate adjustment for small samples than cr^-. (See Baltagi (1995), Baltagi and Chang (1994), and Swamy and Arora (1972) for derivations of these methods.) Both methods use the same function of within residuals to estimate the idiosyncratic error component cre. Specifically,
e
N-n-K + l
where x
xtreg — Fixed-, between-, and random-effects, and population-averaged linear models
453
in estimates of the coefficients and N — Y^ Ti. Here the intuition is and aw and flw are the wifun straightforward, after passin the within residuals through the within transform only the idiosyncratic errors are left. The default method for estimating a% is
-= max where
and 0b are coefficient esi mates from the between regression and T is the harmonic mean of i.e., ~ n
This estimator is consiste for a\ and is computationally less expensive than the second method, Here the intuition is that the um of squared residual sj from the between model estimate a function of both the idiosyncratic comp< nent and the individual ^component. Using our estimator of a* we can remove the idiosyncratic corhponent:, leaving only tht desired individual component. The second method is th Swamy-Arora method; for unbalanced panels derived by Baltagi and Chang (1994). This method s based on the same intuition but it has a more precise small sample adjustment. Using this meth d, auSA
- (n - K)ff, N -tr
0.
where tr = trace {(X'PXJ^X'ZZ'X}
P = diag
* T-» i
Z = diag Btr. X is the N x K matrix of ovariates, including the tonstant, and ifi is a Ti x 1 vector of ones. The estimated coefficien a r ,3r) an^ tneir covjariance matrix Vr are reported together with the previously calculated qu ntities cre and c?u. The standard deviation of vl + elt is calculated as Reported as /?2 between Reported as R2 within is Reported as Rz overall i corr(xit3,yt()2
454
xtreg — Fixed-, between-, and random-effects, and population-averaged linear models
xtreg, mle
The log likelihood for the ith unit is rr\
1/ 1 2
The mle and re options yield essentially the same results except when total N — Ei ^1 *s small (200 or less) and the data are unbalanced. xtreg, pa
See [R] xtgee for details on the methods and formulas used to calculate the population-averaged model using a generalized estimating equations approach. xttestO
xttesto reports the Lagrange multiplier test for random effects developed by Breusch and Pagan (1980) and as modified by Baltagi and Li (1990). The model yit = a + xitf3 + i/it is estimated via OLS, and then the quantity
(nT)2 /
A\
is calculated, where
The Baltagi and Li modification allows for unbalanced data and reduces to the standard formula
nT 2 i -~ 5(r^T)l i&dw a • E.E.4 J
ALM
when Tt = T (balanced data). Under the null hypothesis, ALM is distributed X 2 (l). xthausman xthausman reports Hausman's (1978) specification test. This test is formally a test of the equality of the coefficients estimated by the fixed- and random-effects estimators. If the coefficients differ significantly, either the model is misspecified or the assumption that the random effects Vi are uncorrelated with the regressors x it is incorrect. Thus, under the assumption of a correctly specified model, Hausman's test examines the appropriateness of the random-effects estimator; see Greene (2000, 383-387) and Judge et al. (1985, 527). The test statistic is
W = (/3f - ft/
xtreg — Fixed-, between-, and random-effects, add population-averaged linear models
455
where all vectors and matricjes have the row and column corresponding to the intercept removed along with any row and column c ^responding to a parameter that cannot be estimated by the fixed-effects estimator (regressors that ar constant over time). Under the null hypothesis, W is distributed x /2 (k). where k is the number of estimated coefficients ill /8 excluding the intercept and time-invariant regressors.
References Baltagi. B. H. 1985. Pooling cr^ss -sections with unequal timfe-series lengths. Economics Letters 18: 133-136. . 1995. Econometric Analy: is of Panel Data. New Yorkj John Wiley & Sons. Baltagi. B. and Chang Y. 1994. Incomplete Panels: A compfirative study of alternative estimators for the unbalanced one-way error component regression model. Journal of Econometrics 62: 67-89. Baltagi. B. H. and Qi Li. 1990. A Lagrange multiplier test for the error components model with incomplete panels. Econometric Reviews 9(1): 1103 -107. ; Breusch. T. and A. Pagan. 1980. The Lagrange multiplier test and its applications to model specification in econometrics. Review of Economic Studie; 47: 239-253. Dwyer. J. and M. Feinleib. 199 . Introduction to statistical models for longitudinal observation. In Statistical Models for Longitudinal Studies of lealth. ed. J. Dwyer. M. Femleib, P. Lippert, and H. Hoffmeister, 3-48. New York: Oxford University Press. Greene. W. H. 1983. Simultaneous estimation of factor substitution, economies of scale, and non-neutral technical change. In Econometric Ana'yses of Productivity, ed. A. iDogramaci. Boston: Kluwer-Nijhoff. . 2000. Econometric AnaJy. is. 4th ed. Upper Saddle Riter, NJ: Prentice-Hall. Hausman. J. A. 1978. Specification tests in econometrics. Edonometrica 46: 1251-1271. Judge. G. G.. W. E. Griffiths, R. C. Hill. H. Liitkepohl. and T.-C. Lee. 1985. The Theory- and Practice of Econometrics. 2d ed. New York: John Wil y & Sons. Lee. L. and W. Griffiths. 1979. Fhe prior likelihood and best' linear unbiased prediction in stochastic coefficient linear models. University of New- England Working Papers in Econometrics and Applied Statistics No. 1. Armidale. Australia. Rabe-Hesketh, S.. A. Pickles, an i C. Taylor. 2000, sg)29: Generalized linear latent and mixed models. Stata Technical Bulletin 53: 47-57. Reprintejd in Stata Technical Bulletin Reprints, vol. 9, pp. 293-307. Swamy. P.A.V.B. and Arora. S.S 1972. The exact finite sample properties of the estimators of coefficients in the error components regression mode s. Economefrica 40: 643-65:7. Taub. A. J. 1979. Prediction in he context of the variance-components model, Journal of Econometrics 10: 103-108.
Also See Complementary:
[R] a ijust, [R] Hncom, [R] m|x. fR] predict, [R] test. [R] testnl, [R] vce. [R]x data, [R] xtdes, [R] xtsijm, [R] xttab
Related:
[R]x
Background:
[U] 15i.5 Accessing coefficients and standard errors, 23 Estimation and post-estimation commands,
;gee, [R] xtintreg, [R] xtlvreg, [R] xtregar, [R] xttobit
[u] 23,.11 Obtaining robust variance estimates. [R]x
Title Xtregar — Fixed- and random-effects linear models with an AR(l) disturbance
Syntax Random-effects model xtregar depvar \yarlisi\ [if exp] [in range] [, re rhotype (rhomethod} twostep rhof(#) Ibi level(#) ] Fixed-effects model xtregar depvar [varlist] [if exp] [in range] , fe [ rhotype (rhomethod') twostep
rhof(#) Ibi level(#) ] You must tsset your data before using xtregar; see [R] tsset. by ... : may be used with xtregar; see [R] by. varlist may contain time-series operators; see [U] 14.4-3 Time-series varlists. rtregar shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands.
Syntax for predict predict [type] newvarname [if exp] [in range] [, statistic where statistic is xb ue *u *e
Xj 4 b, fitted values (the default) Ui -f- ejt, the combined residual Ui, the fixed component en, the overall error component
u and e are available only for the fixed-effects estimator. Unstarred statistics are available both in and out of sample: type predict . . . if e (sample) . . . if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample even when if e (sample) is not specified.
Description xtregar estimates cross-sectional time-series regression models when the disturbance term is first-order autoregressive. xtregar offers a within estimator for fixed-effects models and a GLS estimator for random-effects models. Consider the model yn = a + x,;t/3 + i/i + €it
z - 1. ...,,¥;
t - I,..., T?:,
(1'
where eit = p€it.i + rht 456
(2)
xiregut
rixtKi- nu raiiuuiii-eiitsuus iineai muueis wiui an tvn{i) uisiuruemia;
tat
is independent and identically distributed (iid) with zero mean and and where \p < 1 and vaance a2. If i/,- are assumed to be fixed parameters, then the model is a fixed-effects model. If v.t are assumed to be reali: ations of an iid process with zero mean and variance cr2, then it is a random-effects model. In tf > fixed-effects model, the z^ may be correlated with the covariates Xj t . However, the random-effect model maintains the assumption that the v± are independent of the x^. On the other hand, any x^ that do not vary over { are collinear with the Vi and will be dropped from the fixed-effects mode In contrast, the randoni-effects model can accommodate covariates that are constant over time. xtregar can accomrnod; te unbalanced panels whose observations are unequally spaced over time. xtregar implements the methods derived in BaltagJ and Wu (1999). Since xtregar uses dm -series methods, you must tsset your data before using xtregar. See [R] tsset for details.
Options re requests the GLS estimator of the random-effects: model, re is the default, f e requests the within estin ator of the fixed-effects \model, Thotyps(rhomethod) allow: the user to specify any of the following estimators of p: dw regress freg tscorr
- dw/2, where dw is tjie Durbin-Watson d statistic Preg — ? from the residual regression et — fat-i Pfreg = 3 from the residual regression et — fct+i e'et-i/e'e. where € isi the vector of residuals and e ( _i is the vector Ptscorr Of lagged residuals j
theil nagar onestep
Ptheil
Pdw =
:
Pnagar Ponestep = —e'e t _i/e'e, where| e is the vector of residuals, n is the number Of obseifvations, and mc is the njumber of consecutive pairs of residuals
dw is the default method Except for onestep, thf details of these methods are given in [R] prais. prais handles unequall spaced data, onestep ss the onestep method proposed by Baltagi and Wu (3999). Further deta Is on this method are available below in Methods and Formulas. twostep requests that a two-step implementation of the rhomethod estimator of p be used. Unless a fixed value of p is sp dfied, p is estimated' by running prais (sic) on the de-meaned data, When twostep is specifi ed, prais will stop on the first iteration after the equation is transformed by p—the two-step effi :ient estimator. Although it is customary to iterate these estimators to convergence, they are eificient at each step. When twostep is not specified, the FGLS process iterates to convergence a 5 described in [R] prais. rhof (#) specifies mat the g ven number is to be usdd for p and that p is not to be estimated. Ibi requests that the Balta n-Wu (1999) locally bept invariant (LBI) test statistic that p — 0 and a modified version of the Ehargava et al. (1982) Dujtin-Watson statistic be calculated and reported, The default is not to report them. One can requeit and obtain these statistics using Stata's replay facility. level (#) specifies the confidence level, in percent, fcjir confidence inten'als. The default is level (95) or as set by set level see [U] 23.5 Specifying the width of confidence inten'als.
458
xtregar — • Fixed- ana ranaom-ent;m» imoai muucia
Options for predict xb, the default, calculates the linear prediction, x^b. ue calculates the prediction of Ui + en . u calculates the prediction of m, the estimated fixed effect. e calculates the prediction of en.
Remarks If you have not read [R] xt, please do so. Consider a linear panel-data model with an AR(1) disturbance term
(1) where
\p < I , and rjn is independent and identically distributed (iid) with zero mean and variance a^. In the fixed-effects model, the i/j are a set of fixed parameters to be estimated. Alternatively, the Vi may be random and correlated with the other covariates, with inference conditional on the Vi in the sample. See Mundlak (1978) and Hsiao (1986) for a discussion of this interpretation. In the random-effects model, also known as the variance-components model, the i/i are assumed to be realizations of an iid process with mean 0 and variance 0^. xtregar offers a within estimator for the fixed-effect model and the Baltagi-Wu (1999) GLS estimator -of the random-effects model. The Baltagi-Wu (1999) GLS estimator extends the balanced panel estimator in Baltagi and Li (1991) to a case of exogenously unbalanced panels with unequally spaced observations. Both of these estimators offer several estimators of p. The data can be unbalanced and unequally spaced. Specifically, the dataset contains observations on individual i at times t^ for j — 1, ... ,m. The difference tij — ^,j_i plays an integral role in the estimation techniques employed by xtregax. For this reason, you must tsset your data before using xtregar. For instance, if you have quarterly data, the "time" difference between the 3rd and 4th quarter must be 1 month, not 3. Let's examine the fixed-effect model first. The basic approach is common to all fixed-effects models. The Vi are treated as nuisance parameters. We use a transformation of the model that removes the nuisance parameters and leaves behind the parameters of interest in an estimable form. Note that subtracting the group means from equation (1) removes the vl from the model
€iti-
(3)
where ,•_, J— x
After the transformation, equation (3) is a linear AR(1) model, potentially with unequally spaced observations. Equation (3) can be used to estimate p. Given an estimate of p, one must do a CochraneOrcutt transformation on each panel and then remove the within-panel means and add back the overall mean for each variable. OLS on the transformed data will produce the within estimates of a and ft.
xtregar — Fixed- ind random-effects linear models with an AR(1) disturbance
459
!> Example Let's use the Grunfeld investment dataset to illusllrate some aspects of how xtregar can be used to estimate the fixed-effects model. This dataset contains information on 10 firms' investment, market value, and the value of their capital stocks. The d^ita were collected annually between 1935 and 1954. The following output shows that we have tsfeet our data and gives the results of running a fixed-effects model with investment as a function of market value and the capital stock. use grunfeld tsset panel variabl company, 1 to 10 tine variablfe year, 1935 to 1954 . xtregar invest mva ue kstock, fe regression Fired-effects Group variable (i) company R-sq: within = 0.5b27 between = 0.7)89 overall = 0.7)04
Number of obs = Number of groups = Dbs per group: min = avg = max = FC2.178) = Prob > F =
corr(u_i, Xb) = -0. )454
[952 Conf . Interval]
invest
3oef.
Std. Err.
mvalue kstock _cons
.09 19999 .3 50161 -20. 72952
.0091377 . 0293747 5.648271
rho_ar sigma ju sigma_e rho_fov
.672 L0608 91.5 )7609 40.9 32469 .83 ?8647
(fraction of variance due to u_i)
F test that all u_i=D:
F(9,178) =
t 10.40 11.92 -3.67
P>lt| 0.000 0.000 0.000
11.53
190 10 19 19.0 19 129.49 0.0000
.0769677 .2921935 -31.87571
.113032 .4081286 -9.583334
Prob > F = 0.0000
Note that since there are 10 roups. the panel-by-panil Cochrane-Orcutt method decreases the number of available observations fijom 200 to 190. The abdve example used the default dw estimator of p. Usine the tscorr estimate of p yields . xtregar invest mvalue kstock, fe rhotype(tscorr) Fixed-effects (withia) regression Number of obs = Group variable (i) : company Number of groups = R-sq: within = 0.6583 Obs per group: min = between = 0.8524 " ; avg = overall = 0.7933 max = F(2,178) = corr(u_i, Xb) = -0.[0709 Prob > F = invest
Doef.
Std. Err.
mvalue kstock _cons
.OS r8364 2 46097 -28 .3671
. 0096786 .0242248 6.621354
rho_ar sigma_u sigma_e rho_fov
.541 31231 90.8 93572 41. £ 92151 .82C 86297
F test that all u_i= 0:
t
P>|ti
10.11 0.000 14.29 0.000 -4 :28 0 . 000
[957. Conf .0787369 .2982922 -41.43355
190 10 19 19.0 19
171.47 0.0000 Interval] .1169359 .3939018 -15.30064
(fraction of variance due to u_i) F(9,178) =
1SJ.73
Prob > F = 0.0000
460
xtregar — Fixed- and random-effects linear models with an AR(1) disturbance
a Technical Note The tscorr estimator of p is bounded in [—1, Ij. The other estimators of p are not. In samples with very short panels, the estimates of p produced by the other estimators of p may be outside of [ — 1 , 1 ]. If this happens, use the tscorr estimator. However, simulations have shown that the tscorr estimator is biased towards zero, dw is the default because it performed well in Monte Carlo simulations. Note that in the example above, the estimate of p produced by tscorr is much smaller than the one produced by dw.
t> Example The following example shows the error that you get when you try to run xtregar on a dataset that has not been tsset. . tsset , clear . xtregar invest mvalue kstock, fe must tsset data and specify panelvar r(459);
xtregar requires you to tsset your data to ensure that xtregar understands the nature of your time variable. For instance, suppose that our observations were taken quarterly instead of annually. The following example shows that we will get exactly the same results with the quarterly variable t2 that we did with the annual variable year. . list year t2 in 1/5 year t2 1935 1935ql 1936 1935q2 1937 1935q3 1938 1935q4 1936ql 1939 tsset company t2 panel variable: company, 1 to 10 time variable: t2, 1935ql to 1939q4 xtregar invest mvalue kstock, fe Fixed-effects (within) regression Group variable (i) : company R-sq: within = 0.5927 between = 0.7989 overall = 0..7904 corr(u_i, Xb) = -0.0454 Coef.
mvalue kstock _cons
.0949999 .350161 -20.72952
.0091377 .0293747 5.648271
rho_ar sigma_u sigma_e rho_fov
.67210608 91.507609 40 . 992469 .8328647
(fraction of variance due to u_i) F(9,178) =
t 10.40 11.92 -3.67
11.53
P>lt| 0.000 0.000 0.000
= = = = = »
190 10 19 19.0 19 129.49 0 . 0000
[957. Conf. Interval]
invest
F test that all u_i=0:
Std. Err.
Number of obs Number of groups Obs per group: min avg max FC2.178) Prob > F
.0769677 .2921935 -31.87571
.113032 .4081286 -9.583334
Prob > F = 0.0000
F
xtregar — Fixed- and random-effects linear models with an AR(1) disturbance
461
In all the examples thus ar, we have assumed thjjit eit is first-order autoregressive. Testing the hypothesis of p — 0 in a first- irder autoregressive process has a long history of producing test statistics with extremely complicated distributions, the extensions of these tests to panel data have continued this tradition. Bhargava et al (1982) extended the DuVbin-Watson statistic to the case of balanced, equally spaced panel datasets Baltagi and Wu (1999) modified their statistic to account for unbalanced panels with unequally spaced data. In the same article, Baltagi and Wu (1999) derived the locally best invariant test statistic of p 0. Both these test statistics have extremely complicated distributions, although Bhargava et al. (19 2) did publish some cutoffs in their article. Specifying the Ibi option on xtregar will cause Stata o calculate and report the modified Bhargava et al. Durbin-Watson and the Baltagi-Wu LBI.
t> Example This example illustrates h< w to calculate the modified Bhargava et al. Durbin-Watson statistic and the Baltagi-Wu LBI. In this example, we exclude tirijie periods 9 and 10 from the sample, thereby reproducing the results of B< tagi and Wu (1999, 822). . xtregar invest mval' e kstock if year!=1943l& year !=1944 , fe Ibi Fixed-effects (within regression '. Number of obs = Group variable (i) ompany Number of groups = R-sq: within =0.59 7 Obs per group: min = avg = between =0.79 8 overall = 0.78 9 max = F(2,158) = corr(u_i, Xb) = -0.0 39 ' Prob > F = invest
ef.
mvalue kstock _cons
.092 066 .350' 339 -20.0 932
.0090362 .0320278 6.192364
rho_ar sigma_u sigma_e rho fov
.6748 913 94.56 243 42.60 124 .8313 847
(fraction of variance due to u_i)
Std. Err.
P>lt| 10.2$ 10.94 -3.24
0.000 0.000 0.001
F test that all u_i=0 FC9.158) = 10.J&6 modified Bhargava et d. Durbin-Hatson = .70678896 Baltagi-Wu LBI =1.02 8978
170 10 17 17.0 17 114.00 0.0000
[957, Conf. Interval] .0743593 .287676 -32.28981
.1100539 .4141919 -7.828832
Prob > F = 0.0000
The random-effects model In the random-effects moc ;1. the v, are assumed to be realizations of an iid process with zero mean and variance a'. Furthermon the ;/; are assumed to He independent of both the elt and the covariates x,- f . The latter of these as assumptions can be very strong. However, inference is not conditional on the particular realizations, of the in the sample. See Mbndlak (1978) for a discussion of this point. Example By specifying the re opl on, one obtains the Battagi-Wu GLS estimator of the random-effects model. This estimator can ac commodate unbalanced panels and unequally spaced data. Running this model on the Grunfeld datas produces the following results:
462
xtregar — Fixed- and random-effects linear models with an AR(1) disturbance xtregar invest mvalue kstock if year!=1943 ft year !=1944 , re Ibi Random-effects GLS regression Number of obs = Group variable (i) : company Number of groups = R-sq: within = 0.7718 Obs per group: min = between = 0,8036 avg = overall = 0,7956 max = Random effects u_i - Gaussian Wald chi2(3) = corr(u_i, Xb) * 0 (assumed) Prob > ch±2 = invest
Cocf .
mvalue kstock _cons
.0948541 . 322599 -44.82233
.0085443 .0271626 27 . 24889
rho_ar sigma_u sigma_e rho_f ov theta
.67483913 74 . 332091 43.199999 .74751539 .65649837
(estimated autocorrelation coefficient)
Std. Err.
z 11.10 11.88 -1.64
P>lz| 0.000 0.000 0.100
180 10 18 18.0 18 335.41 0.0000
[9SV. Conf . Interval] .0781075 .2693613 -98.22918
.1116007 . 3758368 8.584515
(fraction of variance due to u_i)
modified Bhargava et al. Durbin-Watson = .70578896 Baltagi-Wu LBI = 1,0218978
Note that the modified Bhargava et al. Durbin-Watson and the Baltagi-Wu LBI are exactly the same as those reported for the fixed-effects model because the formulas for these statistics do not depend on which model is being estimated. These test statistics have the same formula, regardless of whether we are estimating the fixed-effects model or the random-effects model.
(Continued on next page)
xtregar — Fixed-
»nd random-effects linear models with an AR(1) disturbance
463
Saved Results xtreg, re saves in e() Scalars e(dl) e(ds) e(N) e(N_g) e(df_m) e(g_max) e(g_min) e(g_avg) e(chi2) e(rho_fov) e(Tbar) e(Tcon) e(r2_w) Macros e(cmd) e(depvar) e(model) e(rhotype) e(dw)
Bhargava et al. Durbin-Watson centered Haltagi-Wu LBI number o r observations number o: groups model dej irees of freedom largest group size smallest group size average gfoup size ut fractioi of variance harmonic mean of group sizes I if T is constant ,R-square< for within model xtreg
name of jiependent variable re method o estimating p ar LBI, if requested
Matrices e(b)
coefficien vector
Functions e(sample)
marks estimation sample
6(LBI) Baltagi-Wu LBI statistic fe(NJLBI) number of obs used in e(LBI) fe(r2_o) fl-squared for overall model fe(r2_b) JJ-squared for between model te(rho_ar) autocorrelation coefficient e(sigma_u) panel-level standard deviation te(sigma_e) standard deviation of e^ te(thta_min) minimum 8 e(thta_5) 9, 5th percentile e(thta_50) 6, 50th percentile e(thta_95) 9, 95th percentile e(thtajnax) maximum 8
e(ivar) variable denoting groups e(cM2type) Wald; type of model x2 test te(predict) program used to implement predict b(tvar) time variable
e(V)
(Continued on I next page)
VCE for random-effects model
xtregar — Fixed- and random-effects linear models with an AR(1) disturbance
464
xtreg, fe saves in e ( ) : Scalars e(dl) e(ds) e(N) e(N_g) e(mss) e(tss) e(df_m) e(rss) e(df_r) e(r2) e(r2_a) e(F) e(rmse)
Bhargava et al. Durbin-Watson centered Baltagi-Wu LSI number of observations number of groups model sum of squares total sum of squares model degrees of freedom residual sum of squares residual degrees of freedom JX-squared adjusted JS-squared F statistic root mean square error log likelihood log likelihood, constant-only model degrees of freedom for absorbed effect
e(LBI) e{N_LBI) e (g_max) e(g_adn) e(g_avg) e(rho_fov) e(Tbar) e(Tcon) e(r2_w) e(r2_o) e(r2_b) e(rho_ar) e(corr) e(sigma_u) e(sigma_e) e(F_f)
Baltagi-Wu LBI statistic number of obs used in e(LBI) largest group size smallest group size average group size Ui fraction of variance harmonic mean of group sizes 1 if T is constant R- squared for within model R~ squared for overall model fi-squared for between model autocorrelation coefficient corrfui, Xb) panel-level standard deviation standard deviation of (.# F for «,—0
Macros e(cmd) e(depvar) e (model) e(rhotype)
xtreg name of dependent variable fe method of estimating par
e(ivar) e (predict) e(tvar) a(dw)
variable denoting groups program used to implement predict time variable LBL if requested
Matrices e(b)
coefficient vector
e(V)
variance-covariance matrix of the estimators
Functions e( sample)
marks estimation sample
e(df_a)
Methods and Formulas Consider a linear panel-data model with an AR(1) disturbance term
yit - a
.,Q a. v. 4. f . , •
JT JL/
I
"l
I
^-"li
7—1 v
Ar-
I * . * . , , jt*.
t—I
is ~"~
(1)
T-
J - . . • - . -t 3 ^
where
(2) 2
[PI < 1 and rjit is independent and identically distributed (iid) with zero mean and variance ar}. The data can be unbalanced and unequally spaced. Specifically, the dataset contains observations on individual i at times i%j for j ~- 1 . . . , , n,.
I
xtregar — Fixed and random-effects linear models with an AR(1) disturbance
465
Estimating The estimate of p is a ways obtained after retrieving the group means. Let ya ~ Ua Xj, and let e Then, except for the omsstep method, all the estimates of p are obtained by running Stata's prais
on
yit See [R] prais for the fc rmulas for each of the methods. When onestep is spe rified, a regression is rin on the above equation and the residuals are obtained. Let e^ be the •esidual used to estimate! the error e^. If t^ - ti.j-i > 1, then eltij is set to zero. Given this ser es of residuals, Ponestep —
mcc
where n is the number of nonzero elements in e and1 mc is the number of consecutive pairs of nonzero
Transforming the data to remove the AFJ(1) component After estimating p, Baltagi and Wu (1999) derive a transformation of the data that removes the AR(1) component. Their C 'i(p) can be written as
( ICi1 Pn2)*l \*i (l-P2)i
\
if
3
y
f
if
1
>1
L
Using the analogous tnnsform on the independent variables generates transformed data that are cleansed of the AR(1) prccess. Performing simple OLS on the transformed data leaves behind the residuals p.*.
The within estimator cf the fixed-effects model To obtain the within est mator. we need to transform the data that come out of the AR(1) transform, In order for the within transform to remove the fix4d-effects, the first obsen'ation of each panel must be dropped. Specifically, t
f
where
466
xtregar — Fixed- and random-effects linear models with an AR(1) disturbance
—*
t-,i=\
Xi
=
=* _
-* —
* _
The within estimator of the fixed-effects model is then obtained by running OLS on $1^ = a + Xtty/3+eity Reported as R2 within is the E? from the above regression. f
Reported as R2 between is < corr(xj/3 ,yj) V
f
-^
1z
Reported as H2 overall is <, con(xitft ,yu] \ .
The Baltagi~Wu GLS estimator The residuals p.* can be used to estimate the variance components. Translating the matrix formulas given in Baltagi and Wu (1999) into summations yields the following variance component estimators:
(gilgi)
a,
xtregar — nxea- aid random-streets linear moaeis wim an AH^IJ oisuiroance
where
9i =
1,
: . . Ji .
and /i* is the Ui x 1 vector of residuals from n* that correspond to person i Then,
where
With these estimates in hand, one can transform the data via * s=\ 9iszit i3
for z£ {t/,x}. Running OLS on the trans brmed data y**.x** yields the feasible GLS estimator of a and /9. Reported as R2 between Reported as P? within is Reported as J?2 overall is
The test statistics The Baltagi-Wu LBI is th^ sum of terms 4 <
where
I)}3
,_ 3
468
xtregar — Fixed- and random-effects linear models with an AR(1) disturbance
j
I() is the indicator function that takes the value of 1 if the condition is true and 0 otherwise. The ziti -_! are residuals from the within estimator. Baltagi and Wu (1999) also show that di is the Bhargava et al. Durbin- Watson statistic modified to handle cases of unbalanced panels and unequally spaced data.
Acknowledgment We would like to thank Badi Baltagi, Department of Economics, Texas A&M University for helpful comments.
References Baltagi, B. H. 1995. Econometric Analysis of Panel Data. New York: John Wiley & Sons. Baltagi, B. H. and Li. Q. 1991. A transformation that will circumvent the problem of autocorrelation in an error component model. Journal of Econometrics 48: 385-393. Baltagi, B. H. and P. X, Wu. 1999. Unequally spaced panel data regressions with AR(1) disturbances. Econometric Theory 15: 814-823. Bhargava, A., L. Franzini, and W. Narendranathan. 1982. Serial correlation and the fixed effects model. The Review of Economic Studies 49: 533-549. Hsiao, C. 1986. Analysis of Panel Data. New York: Cambridge University Press. Mundlak, Y. 1978. On the pooling of time series and cross section data. Econometrics 46: 69-85.
Also See Complementary: Related: Background:
[R] adjust, [R] lincom, [R] mfx. [R] predict, [R] test, [R] testnl, [R] vce, [R] xtdata, [R] xtdes, [R] xtsum, [R] xttab [R] xtgee, [R] xtintreg, [R] xtivreg, [R] xtreg, [R] xttobit [u] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [R] xt
Title xtsum — Summarize t data
1
Syntax xtsum [var/wf] [if ex ] [, Kvarfiame") ] by ... : may be used with rt mm; see [R] 6y.
Description xtsum, a generalization of summarize, reports means and standard deviations for cross-sectional time-series (xt) data; it di fers from summarize |n that it decomposes the standard deviation into between and within comp nents.
Options i (vamame) specifies the ariable name that contains the unit to which the observation belongs, You can specify the i() op on the first time you estimate, or you can use the i is command to set i() beforehand. After hat, Stata will remembet- the variable's identity. See [R] xt.
Remarks If you have not read [R xt, please do so. xtsum provides an alterative to summarize. Fjor instance, in the nlswork dataset described in [R] xt, hours contains the number of hours worked last week: . summarize hours Variable hours . xtsum hours
bs 28 67
Variable hours
Mean 36.55956 Mean
overall between within
xtsum provides the same a between (oTj) and withi comparable). The overall is calculated over 4,710 p was observed in the hour
36.55956
Std. fcev. 9.869623
Std, Dev.
Min
Max
1
168
Hin
Max
Observations
9.869623 1 7.846585 1 7.520712 -j. 154726
168 83.5 130.0596
N = 28467 n = 4710 T-bar = 6.04395
iformation as summarise and more. It decomposes the variable XH into Xit — Xi + x; the global mean ~x being added back in make results id within are calculated over 28,467 person-years of data. The between •sons. And, for your information, the average number of years a person data is 6. 46SI
Jgrr
xtsum — Summarize xt data
470
xtsum also reports minimums and maximums: Hours worked last week varied between 1 and (unbelievably) 168. Average hours worked last week for each woman varied between 1 and 83.5. "Hours worked within" varied between —2.15 and 130.1, which is not to say any woman actually worked negative hours. The within number refers to the deviation from each individual's average, and naturally, some of those deviations must be negative. In that case, it is not the negative value that is disturbing but the positive value. Did some woman really deviate from her average by +130.1 hours? No; in our definition of within, we add back in the global average of 36.6 hours. Some woman did deviate from her average by 130.1 - 36.6 = 93.5 hours, which is still quite large. The reported standard deviations tell us something that may surprise you. They say that the variation in hours worked last week across women is very nearly equal to that observed within a woman over time. That is, if you were to draw two women randomly from our data, the difference in hours worked is expected to be nearly equal to the difference for the same woman in two randomly selected years. If a variable does not vary over time, its within standard deviation will be zero: . xtsum birth_yr Variable birtli_yr overall between within
Mean 48.08509
Also See Related:
[R] xtdes, [R] xttab
Background:
[R]xt
Std. Dev.
3.012837 3.051795 0
Kin
Max
Observations
41 41 48.08509
54 54 48.08509
N = 28534 n = 4711 T-bar = 6.05689
Title xttab-- Tabulate xt < ata
1
Syntax xttab
varname [if exp
xttrans varname [i: exp] [, i(vamamej) k (varname^ f req ] by ... : may be used with : ttab and rttrans; see [RJ by,
Description xttab, a generalization of tabulate, performs one-way tabulations, and decomposes counts into between and within components in cross-sectional time-series (xt) data. xttrans, another ger eralization of tabulat^, reports transition probabilities (the change in a single categorical variabl over time).
Options i (varnamei) specifies th< variable name that contains the unit to which the observation belongs. You can specify the i() c ition the first time you jestimate, or you can use the i is command to set i-0 beforehand. After that, Stata will remember the variable's identity. See [R] xt. t(varnamet') specifies th j variable that contains the time at which the observation was made. You can specify the t() don the first time you estimate, or you can use the tis command to set t() beforehand. After that, Stata will remembpr the variable's identity. See [R] xt. f req, allowed with xttrans only, specifies that frequencies as well as transition probabilities are to be displayed.
Remarks If you have not read | l] xt, please do so. !> Example Using the nlswork c taset described in [R] xt!, variable msp is 1 if a woman is married and her spouse resides with her d 0 otherwise: . xttab msp msp 0 i Total
Iverall Fre . Percent 1132 1719
39.71 60 . 29
2851 i
100.00
Betbeen Freq. j Percent j 3113 66.08 3643 77.33
Within Percent
6756 ' 143.41 (n = 4711)
64.14
4tl
55.06 71.90
xttab — Tabulate xt data
472
The overall part of the table summarizes results in terms of person-years. We have 11,324 person-years of data in which msp is 0 and 17,194 in which it is 1—in 60.3% of our data the woman is married with her spouse present. Between repeats the breakdown, but this time in terms of women rather than woman-years; 3,113 of our women ever had msp 0 and 3,643 ever had msp 1, for a grand total of 6,756 ever having either. We have in our data, however, only 4,711 women. This means that there are women who sometimes have msp 0 and at other times have msp 1. The within percent tells us the fraction of the time a woman has the specified value of msp. Taking the first line, conditional on a woman ever having msp 0, 55.1% of her observations have msp 0. Similarly, conditional on a woman ever having msp 1, 71.9% of her observations have msp 1. These two numbers are a measure of the stability of the msp values and, in fact, msp 1 is more stable among these younger women than msp 0, meaning that they tend to marry more than they divorce. The total within of 64.14 percent is the normalized between weighted average of the within percents, to wit: (3113 x 55.06 + 3643 x 71.90)/6756. It is a measure of the overall stability of the msp variable. A time-invariant variable will have a tabulation with within percents of 100: . xttab race race
Overall Freq. Percent
1 2 3
20180 8051 303
70.72 28.22 1.06
Total
28534
100.00
Between Freq. Percent 3329 1325 57 4711 (n = 4711)
Within Percent
70.66 28.13 1.21
100.00 100.00 100.00
100.00
100.00
Example xttrans shows the transition probabilities. In cross-sectional time-series data, one can estimate the probability that x^t+i = ^2 given that x^ = Vi by counting transitions. For instance, xttrans msp 1 if married, 1 if married, spouse spouse present present 1 0
Total
0 1
80.49 7.96
19.51 92.04
100.00 100.00
Total
37.11
62.89
100.00
The rows reflect the initial values and the columns reflect the final values. Each year, some 80% of the msp 0 persons in the data remained msp 0 in the next year; the remaining 20% became msp 1. While msp 0 had a 20% chance of becoming msp 1 in each year, the msp 1 had only an 8% chance of becoming (or returning to) msp 0. The f req option displays the frequencies that go into the calculation: