_f'( ./l
ii_
StataReferenceManual Relea 7 Volume2 H-P Stata Press CollegeStation,Texas
Stata Press, 4905 Lakeway Drive, College Station, Texas 77845 Copyright C) 1985-2001 All fights reserved Version 7.0
by Stata Corporation
Typeset in TEX Printed in the United States of America l0 9 8 7 6 ISBN ISBN 1SBN ISBN ISBN
5 4 3 2
1-881228-47-9 1-88t228-48-7 1-881228-49-5 1-881228-50-9 1-881228-51-7
1
(volumes 1-4) (volume I) (volume 2) (volume 3) (volume 4)
This manual is protected by copyright. All rights are reserved. No part of this manual may be reproduced, stored in a retrieval system, or transcribed, in any form or by any means--electronic, mechanical, photocopying, recording, or otherwise--without the prior written permission of Stata Corporation (StataCorp), StataCorp provides this manual "as is" without warranty of any kind, either expressed or implied, including, bul not limited to the implied warranties of merchantability and fitness for a particular purpose. StataCorp may make improvements and/or changes in the product(s) and the programCsl described in this manual at any time and without notice. The software described in this manual is furnished under a license agreement or nondiscIosure agreement. The software may be copied only in accordance with the terms of the agreement. It is against the law m copy the software onto CD. disk, diskette, tape, or any other medium for any purpose other than backup or archival purposes. The automobile dataset appearing on the accompanying media is Copyright (_) 1979, 1993 by Consumers Union of U.S.. Inc.. Yonkers. NY 10703-1057 and is reproduced by permission from CONSUMER REPORTS, April 1979, April 1993. The Stata for Windows installation software was produced using Wise Installation System, which is Copyright @ 1994-2000 Wise Solutions. Inc, Portions of the Macintosh installation software are Copyright (_ 1990-2000 Aladdin Systems, Inc.. and Raymond Lau. Stata is a registered trademark and NetCourse is a trademark of Stata Corporation. Alpha and DEC are trademarks of Compaq Computer Corporation. AT&T is a registered trademark of American Telephone and Telegraph Company. HP-UX and HP LaserJet are registered trademarks of Hewlett-Packard Company. IBM and 0S/2 are registered trademarks and AIX. PowerPC. and RISC System/6000 are trademarks of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. Macintosh is a registered trademark and Power Macintosh is a trademark of Apple Computer. Inc. MS-I)OS. Microsoft, and Windows are registered trademarks of Microsoft Corporation. Pentium is a trademark of Intel Corporation. PostScript and Display PostScript are registered trademarks of Adobe Systems. Inc. SPARC is a registered trademark of SPARC International. Inc Star/Transfer is a trademark of Circle Systems. Sun, SunOS. Sunview, Solaris. and NFS are trademarks or registered trademarks of Sun Microsysrems. Inc. TEN is a trademark of the American Mathematical Society. UNIX and OPEN LOOK are registered trademarks and X Window System is a trademark of The Open Group Limited. WordPerfect is a registered trademark of Corel Corporation.
The suggested citation for this software is StataCorp. 2001_ Stata Statistical Software: Release 7.0. College Station. TX: Stata Corporation.
i I
i
Title hadimvo -- Identify multivariate outliers
Syntax hadiravo varlist [if exp] [in range], g_en_rate(newvarl
[newvar2]) [p(#)]
;Description hadimvoidentifies multiple outtiers in multiv_ate data using the method of Hadi (1992, 1994), creating newvarl equal to 1 if an observation is an "outlier" and 0 otherwise. Optionally, newvar2 can also be created containing the distances from!the basic subset.
Options E
generate (newvarl [newvar2]) is not optional; it identifiesthe new variable(s) to be created. Whether you specify two variables or one, however, is dptional, newvarl--which is required--will create newvarl containing 1 if the observation is an outlier in the Hadi sense and 0 otherwise: Specifying gen (odd) would call this variable odd. newvai'2,if specified, will also create newvar2 containing the distances (not the distances squared) from the basic subset. Specifying gen (odd dist) creates odd and also creates dist containing the Hadi distances. p(#) specifies the sl_,mficance level for outlier cutoff; 0 < # < 1. The default is p(.05). Larger numbers identify a larger proportion of the sample as outliers. If # is specified greater than I. it is interpreted as a percent. Thus, p(5) is the Sameas p(.05).
Remarks Multivariate analysis techniques are commonly used to analyze data from many fields of study. The data often contain outliers. The search for subsets of the data which, if deleted, would change results markedly is known as the search for outli+rs,hadimvo provides one computer-intensive but practical method for identifying such observations. Classical outlier detection methods (e.g., Mahalhnobisdistance and Wilks' test) are powerful when the data contain only one outlier, but the power Of these methods decreases drastically when more than one outlying observation is present. The losisof power is usually due to what are known as masking and swamping problems (false negative and false positive decisions), but in addition, these methods often fail simply because they are affecied by the very observations they are supposed to identih,. Solutions to these problems often involve an unreasonable amount of calculation and therefore computer time. (Solutions involving hundreds of! millions of calculations for samples as small as 30 have been suggested.) The method developed:,{bY Hadi (1992, I994) attempts to surmount these problems and produce an answer, albeit second b{st, in finite time. A basic outline of the procedure is as follows_A measure of distance from an observation to a cluster of points is defined. A base cluster of r pdints is selected and then that cluster is continually redetined by taking the r + 1 points "closest" to the cluster as the new base cluster. This continues until some rule stops the redefinition of the clustei.
d:
.au=mvo
_
=o_nt.y
mUlZlVarlale
OUlllers
Ignoring many of the fine details, given k variables, the initial base cluster is defined as r = k + 1 points. The distance that is minimized in selecting these k + 1 points is a covariance-matrix distance on the variables with their medians removed. (We wilt use the language loosely; if we were being more precise, we would have said the distance is based on a matrix of second moments, but remember, the medians of the variables have been removed. We would also discuss how the k -r- 1 points must be of full column rank and how they would be expanded to include additional points if they are not.) Given the base cluster, a more standard mean-based center of the r-observation cluster is defined and the r + 1 observations closest in the covariance-matrix sense are chosen as a new base cluster. This is then repeated until the base cluster has r = int{(n + k + 1)/2} points. At this point, the method continues in much the same way, except a stopping rule based on the distance of the additional point and the user-specified p() is introduced. Simulation results are presented in Hadi (1994).
Examples hadimvo
price
• list
if odd
• summ
price
drop
gen(odd) /*
weight
if -odd
price
weight,
gen(odd
the
*/
outliers stats
for
clean
data
*/
D)
id=_n
graph
/* make
Did
graph price weight [w=D] graph price weight [w=i/D] summarize D, detail sort D list
list
/* summary
odd
hadimvo gen
weight,
make
hadimvo • regress
price
weight
price weight ... if -odd2
an index
/* index
plot
/* 2-way /* same,
scatter, outliers
variable
,/
of D
,/ outliers small
big
*/ ,/
D odd
mpg,
gen(odd2
Di)
p(.01)
Identifying outliers You have a theory about xz, x2, ..., xk which we will write as F(xl, x2,..., xk). Your theory might be that xl, x2, ..., xk are jointly distributed normally, perhaps with a particular mean and covariance matrix: or your theory might he that
xl =/31 T/32x2 + .-- + _kxk + u where u _ N(0, or2); or your theory might be
xl -/310 +/312x2+ ,,_14x4+ ua x2 -/320 + 1321xl+ _23xz + u2 or your theory might be anything else it does not matter. You have some data on x_, x2, ..., xk, which you will assume are generated by F(.). and from that data you plan to estimate the parameters (if any) of your theory and then test your theory in the sense of how well it explains the observed data.
h_dimvo-- Identifymultivariateoutliers
l
3
What if, however, some of your data are generated not by F(-) but by G(.), a different process? a a wages example, you For have theory on how are_assignedto employees in firm and. for the bulk of employees, that theory is correct. There are, hdwever, six employees at the top of the hierarchy for whom wages are set by a completely different process. Or, you have a theory on how individuals select different health insurance options except that, for a handful of individuals already diagnosed with serious illness, a different process controls ihe selection process. Or, you are testing a drug that reduces trauma after surgery except that, for a few patients with a high level of a particular protein, the drug has no effect. Or, in another drug experiment, some of the historical data are simply misrecorded. The data values generated by G(.) rather than F(.) are called contaminant observations. Of course, the analysis should be based only on the observations generated by F(.), but in practice we do not know which observations those are. In addition, if it happened by chance that some of the observations are within a reasonable distance _om the center of F(.), it becomes impossible to detennine whether they are contaminants. Accordingly.we adopt the following operational definition: Outliers are observations that do not conformto the pattern suggestedbythe majority of the observations in a dataset. Accordingly, observations generated b_ F(.) but located at the tail of F(.) are considered outliers. On the other hand, contaminants that are within a statistically reasonable distance from the center of F(.) are not considered outliers. It is well worth noting that outliership is strongly related to the completeness of the theory--a grand unified theory would have no outliers becafise it would explain all processes (including, one supposes, errors in recording the data). Grand uni_d theories, however, are difficult to come by and are most often developed by synthesizingthe results of many special theories. i / Theoretical work has tended to focus on one !pecial case: data containing only one outlier. As mentioned above, the single-outlier techniques ofteh fail to identify multiple outliers, even if applied recursively. One of the classic examples is the star t cluster data (a.k.a. Hertzsprung-Russell diagram) shown in the figure below (Rousseeuw and Leroy 1987, 27). For 47 stars, the data contain the (log) light intensity and the (log) effective temperature _t the star's surface. (For the sake of illustration, we treat the data here as bivafiate data--not as r_gression data--i.e., the two variables are treated ..... i . similarly with no dlstmcuon between which vanatile is dependent and which is independent.) This graph presents a scatter of the data along with two ellipses expected to contain 95% of the data. The larger ellipse is based on the mean and dovariance matrix of the full data. All 47 stars are inside the larger ellipse, indicating that classical iingle-case analysis fails to identify any outliers, The smaller ellipse is based on the mean and co_,afiancematrix of the data without the five stars identified by had±taro as outliers. These observaiions are located outside the smaller ellipse. The i dramatic effects of the outliers can be seen by comparing the two ellipses. The volume of the larger ellipse is much greater than that of the smaller one and the two ellipses have completely different orientations. In fact, their major axes are nearly orthogonal to each other; the larger ellipse indicates a negative correlation (r = -0.2) whereas the smalle_rellipse indicates a positive correlation (r = 0.7] (Theory would suggest a positive correlation: hot ihings glow.)
:i
(Graph on _ext page)
i
•
_r
Itc_u..vu
"-;"
lUt:lluly
inuiLivuna|e
OUtllers
I
I
8-
/
/ "_ =
/
o o
0 0
_o_ o -4
4"
0
\\
_.
// /
@__2"j
2
/
.//
F_Log
lemperalute
The single-outlier techniques make calculations for each observation under the assumption that it is the only outlier and the remaining n - 1 observations are generated by .F('.) producing a statistic for each of the n observations. Thinking about multiple oufliers is no more difficult. In the case of two outliers, consider all possible pairs of observations (there are n(nI)/2 of them) and, for each pair, make a calculation assuming the remaining n 2 observations are generated by F(-). For the three-outlier case, consider all possible triples of observations (there are rz(_ - 1)(n - 2)/(3 x 2) of them) and, for each triple, make a calculation assuming the remaining rL 3 observations are generated by F(-). Conceptually, this is easy but practically, it is difficult because of the rapidly increasing number of calculations required (there are also theoretical problems in determining how many outliers to test simultaneously). Techniques designed for detecting multiple outliers, therefore, make various simplifying assumptions to reduce the calculation burden and, along the way, lose some of the theoretical foundation. This loss, however, is no reason for ignoring the problem and the (admittedly second best) solutions available today. It is unreasonable to assume that outtiers do not occur in real data. If outliers exist in the data, they can distort parameter estimation, invalidate test statistics, and lead to incorrect statistical inference. The search for outliers is not merely to improve the estimates of the current model but also to provide valuable insight into the shortcomings of the current model. In addition, outliers themselves can sometimes provide valuable clues as to where more effort should be expended. In a drug experiment, for example, the patients excluded as outliers might well be further researched to understand why they do not fit the theory.
Multivariate, multiple outliers hadimvo is an example of a multivariate, multiple outlier technique. The multivariate aspect deserves some attention, In the single-equation regression techniques for identifying outliers, such as residuals and leverage, an important distinction is drawn between the dependent and independent variables--the b' and the x's in y = x/3 + u. The notion that the ff is a linear function of x can be exploited and. moreover, the fact that some point (Yi-xi) is "far" from the bulk of the other points has different meanings if that "farness" is due to ;ti or xi. A point that is far due to xi but, despite that, still close in the _/i given xi metric adds precision to the measurements of the coefficients and may not indicate a problem at all. In fact, if we have the luxury of designing the experiment, which means choosing the values of x a priori, we attempt to maximize the distance between the x's (within
h_dimvo-- Identify multivariateoutliers
5
the bounds of x we believe are covered by our {inear model) to maximize that precision. In that extreme case, the distance of xi carries no information as we set it prior to running the experiment. More recently, Hadi and Simonoff (1993) exploit _he structure of the linear model and suggest two
i
diagnostics). methods for identifying muhiple outliers v,hen the inodel is fitted to the data (also see [R] regression In the multivariate case, we do not know the strhcture of the model, so (y,, x+) is just a point and the y is treated no differently from an5"of the x's_a, fact which we emphasize by+writin_ the point as (xaz,x2i) or simply (Xi). The technique doeg assume, however, that the X's are multivariate normal o1"at least elliptically symmetric. This lead_ to a problem if some of the X's are functionally related to the other X's, such as the inclusion of x _nd x 2, interactions such as .r_x> or even dummy variables for multiple categories (in which one of the dummies being 1 means the other dummies must be 0). There is no good solution to this problrm. One idea, however, is to perform the analysis with and without the functionally related variables _ind to subject all observations identified for further study (see What to do with outliers below).
]
An implication of hadJ.mvo being a multivariaie technique is that it would be inappropriate to apply it to (9,x) when x is the result of experi_ntal design. The technique would know nothing of our design of x and would inappropriately treat "distance" in the x-metric the same as distance in the +j-metric. Even when x is inuhivariate norNal, unless y and x are treated similarly it may still be inappropriate to apply had±taro to (9, x)because of the different roles that _/ and x play in regression. However, one may apply had±taro on x to identify outliers which, in this case. are called leverage points. (We should also mention here that if had:i.mvo is applied to x when it contains + constants or any collinear variables, those variables will be correctly i'_nored, allowine_ the analxs_. ,_'s to continue.) It is also inappropriate to use hadimvo (and other outlier detection techniques) when the sample + size is too small, had:i.mvo uses a small-sample correction factor to adjust the covariance matrix of the "clean" subset. Because the quantity n - (3L:@ I) appears in the denominator of the correction factor, the sample size must be larger than 31,:+ _. Some authors would require the sample size to be at least 5h, i.e., at least five observations per vhriable, With these warnings, it is difficult to misapply {his tool assuming that you do not take the results as more than suggestive, hadimvo has a p () optioh that is a "significance level" for the outliers that i
are chosen. quote the term level b_cause, although has that beeninterpretation expended to really make We a significance level,significance approximations a_e involved and itgreat will effort not have i in all cases. It can be thought of as an index between 0 and 1, with increasing values resulting in the labeling of more obse_,'ations as outliers and _ith the suggestion that you select a number much as you would a significance level--it is roughly the probability of identifying any given point as an outlier if the data truly were multivariate normal. Nevertheless, the terms significance level or critical values should be taken with a grain of shlt. It is suggested that one examine a graphical display (e.g., an index plotl of the distance with berhaps different values of p(). The graphs give more information than a simple yes/no answer. FOr example, the graph may indicate that some of the observations (inliers or outliers) are only mar_nally so,
i }
What to do with outliers After a reading of the literature on outlier ddtection., many people are left with the incorrect impression that once outliers are identified, they, s_ould be deleted from the data and analysis should be continued. Automatic deletion (or even automatic down-weighting) of outliers is not ahvavs correct because outliers are not necessarily bad obser 'atlons. On the contrary,, if they are correct, they' may bc thc most informative points in the data. For _xample, they may indicate that the data do not
!:
wuaunmwvv
v
--
Igl_lltlly
illglIiVarla|e
come from a normally distributed techniques.
OUtllers
population,
as is commonly
assumed
by almost all multivariate
The proper use of this tool is to label the outliers and then subject the outtiers to further study, not simply to discard them and continue the analysis with the rest of the data. After further study, it may indeed turn out to be reasonable to discard the outliers, but some mention of the oufliers must certainly be made in the presentation of the final results. Other corrective actions may include correction of errors in the data, deletion or down-weighting of outliers, redesigning the experiment or sample survey, collecting more data, etc.
Saved Results hadimvo saves in r(): Scalars r(N)
number of outliers remaining
Methodsand Formulas hadimvo
is implemented
as an ado-file. Formulas are given in Hadi (I992,
1994).
Acknowledgment We would like to thank Ali S. Hadi of Comell University for his assistance
in writing hadimvo.
References Gould, W. W. and A. S. Hadi. 1993. smv6: Identifying multivariate outliers. Stata Technical Bulletin 11: 28-32. Reprinted in Stata TechnicalBulletin Reprints, vol. 2. pp. 163-168 Hadi, A. S. 1992. Identifying multiple outliers in multivariatedata. Journal of the Royal Statistical Society, Series B 54: 761-771. 1994. A modification of a method for the detection of outliers in multivariate samples. Journal of the Royal Statistical SocieD',Series B 56: 393-396. Hadi, A. S. and J. S. Simonoff. 1993. Procedures for the identificationof multiple outtiers in linear models. Journal of the American Statistical Association 88:1264 1272. Rousseeuw,P. J. and A. M. Leroy. 1987. Robust Regression and Outlier Detection. New York: John Wiley & Sons.
Also See Related:
JR] mvreg, [R] regression diagnostics,
[R] sureg
Title hausman -- Hausman specification test
Syntax hausman, save
hausman
[, [more
l!ess ] constant a_leqs skipeqs(eqtixt) i
sigmamore
p_rior(string)current (string)equations (matchlist)] hausman, clear where matchlist in equat ions () is
#:#[,#i#[,...]] For instance, equations (:1.: 1), equations(1
:_, 2: 2), or equations (1 :2).
i Description hausmanperforms Hausman's (t978) specificition test. /
Options save requests that Stata save the current estimation results, hausman will later compare these results with the estimation results from another model. A model must be saved in this fashion before a test against other models can be performed. more specifies that the most recently estimated model is the more efficient estimate. This is the default, less specifies that the most recently estimated model is the less efficient estimate. constant specifies that the estimated intercept(s) are to be included in the model comparison: by default, they are excluded. The default behavi_ is appropriate for models where the constant does not have a common interpretation across the two models. i alleqs specifies that all the equations in the mod_l be used to perform the Hausman test: by default. only the first equation is used. skipeqs(eqlist) specifies in eqlist the names of equations to be excluded from the test. Equation numbers are not allowed in this context as it is the equation names, along with the variable names. that are used to identify common coefficients. sigmamore allows you to specify that the two Cbvariancematrices used in the test be based on a common estimate of disturbance variance (cr2i--the variance from the fully efficient estimator. This option provides a proper estimate of the :contrast variance for so-called tests of e×ogeneity and over-identification in instrumental variablegregression: see Baltagi (1998, 29!). Note that this option can only be specified when both estimators save e(sigma) or e(rmse). !
prior (string) and current (string) are formattin_options that allow you to specify alternate wording for the "Prior" and "Current" default labels used to identify the columns of coefficients.
i i J
v
,JlauoIIIglll
equations
--
I ll;Ig;_lll(:ill
(matchlist)
If equations
:tl.l_l.;llli.;dl.l[.}l]
test
specifies• by number, the pairs of equations
that are to be compared.
() is not specified, then equations are matched on equation names.
equations() handles the situation where one does not. For instance, equations(l:2) means equations(i:1, 2:2) means equation 1 is to to be tested against equation 2. If equations() ignored.
estimator ases equation names and the other equation 1 is to be tested against equation 2. be tested against equation 1 and equation 2 is is specified, options alleqs and skipeqs are
clear discards the previously saved estimation resuhs and frees some memory; it is not necessary to specify hausmaxl, clear before specifying hausman, save.
Remarks hausm_n is a general implementation of Hausman's estimator that is known to be consistent with an estimator tested. The null hypothesis is that the efficient estimator the true parameters. If it is• there should be no systematic
(1978) specification test that compares an that is efficient under the assumption being is a consistent and efficient estimator of difference between the coefficients of the
efficient estimator and a comparison estimator that is known to be consistent for the true parameters. If the two models display a systematic difference in the estimated coefficients, then we have reason to doubt the assumptions on which the efficient estimator is based. To use hausman, you (estimate
the less efficient
• hausman,
save
(estimate • hausman
the fully
Alternatively,
efficient
model
)
model)
you can turn this around:
(estimate the fully efficient model) • hausman, • (estimate • hausman,
save the tess efficient
model)
less
> Example We are studying the factors that affect the wages of young women in the United States between 1970 and 1988 and have a panel-data sample of individual women over that time span.
(Continued
on next page)
lausman -- Hausman specificationtest
9
• describe Contains
data
from nlswork.dta
obs:
28,534
National i
Longitudinal
Young Women in 1968
14-26
Survey. years
of age
Z
vars :
6
size:
485,078
I Aug 2000 (88.4_ of memory
storage
09:48
fr_e)
display
value i
type
format
label!
idcode
imt
_8.0g
NLS id
year
byte
_8.0g
interview
age
byte
_8.0g
age in current
year
msp
byte
_8.0g
1 if married,
spouse
ttl_exp
float
_9.0g
total
work
experience
In_wage
float
_9.0g
In(wage/GNP
deflator)
variable
Sorted
name
by:
Note:
idcode
year
dataset
has
changed
since
last i
variable
label
year present
saved
We believe that a random-effects specification is @ppropriate for individual-level effects in our model. We estimate a fixed-effects model that will capture all temporally constant individual-level effects. . xtreg
In_wage
Fixed-effects
age msp ttl_exp,
(within)
fe Number
of obs
=
28494
(i) : idcode
Number
of groups
=
4710
within
= 0.1373
Obs per
between overall
= 0.2571 = 0.1800
Group
variable
_-sq:
corr(u_i,
Xb)
in_wage
regression
= 0.1476
Coef.
Std. Err.
i
t
group:
min =
1
avg = max =
6.0 15
F(3,23781)
=
Prob
=
P>[tl
> F
[95Z Conf.
1262.01 0.0000
Interval]
age
-.005485
.000837
_6.55
0.000
-.0071256
-.0038443
msp
.0033427
.0054868
!0.61
0.542
-.0074118
.0140971
ttl_exp
,0383604
.0012416
_0.90
0.000
.0359268
.0407941
_cons
1.593953
.0177538
_9.78
0.000
1.559154
1.628752
sigma_u
.37674223
sigma_e rho
.29751014 .61591044
F test
that
all u_i=O:
! (fraction
i of!variance ii i
F(4709,23781)ii =
due to u_i)
7.76
Prob
> F = 0.0000
{J
We assume that this model is consistent for the true parameters and save the results by typing . hausman,
save
Now, we estimate a random-effects model _s a fully efficient specification of the individual effects under the assumption that they follow a rahdom-normal distribution. These estimates are then compared to the previously saved results using tile hausman command.
1
•_
Jtuu_w
mvow I --
! lOU_I
1I_1
i _}.ll_'_ll
II,;BllOn
Ie$l
I
• xtreg
in_wage
age
msp
ttl_exp,
re
Random-effects
GLS
regression
Number
of
obs
=
28494
Group
variable
(i)
: idcode
Number
of
groups
=
4710
R-sq:
within
= 0.1373
between overall
= 0.2552 = 0.1Z97
effects
u_i
Random
corr(u_i,
X)
in_wage
Obs per
group:
min
=
1
avg max
= =
6.0 15
~ Gaussian
Wald
chi2(S)
=
5100.33
= 0 (assumed)
Prob
> chi2
=
0.0000
Coef.
Std.
Err.
z
P>Izl
[95Z
Conf.
Interval]
age
-.0069749
.0006882
-10.13
0.000
-.0083238
-.0056259
msp
.0046594
.0051012
0.91
0.361
-.0053387
.0146575
ttlexp _cons
.0429635 1.609916
.0010169 .0159176
42.25 101.14
0.000 0.000
.0409704 1.578718
.0449567 1.641114
sigma_u
.32648519
sigma_e rho
.29751014 .54633481
(fraction
of
variance
to
due
u_i)
• hausman Coefficients (b) Prior
(B) Current
(b-B) Difference
sqrt(diag(V_b-V_B)) S.E.
age
-.005485
-.0069749
msp
.0033427
.0046594
-.0013167
.0020206
.0383604
.0429635
-.0046031
.0007124
tt1_exp
b = less
efficient
B = fully Test:
Ho:
difference chi2(3)
.0014899
estimates
efficient
obtained
estimates
in coefficients
obtained
not
.0004764
previously from
from
xtreg
xtreg
systematic
= (b-B)'[(V_b-V_B)'(-I)](b-B) = 275.44 Prob>chi2
=
0.0000
Using the current specification, our initial hypothesis that the individual-level effects are adequately modeled by a random-effects model is resoundingly rejected. We realize, of course, that this result is based on the rest of our model specification and that it is entirely possible that random effects might be appropriate for some alternate model of wages. hausman is a generic implementation of the Hausman test and assumes that the user knows exactly what they want tested. The test between random- and fixed-effects is so common that Stata provides a special command for use after xtreg. We could have obtained the above test in a slightly different format by typing xthausman Hausman
specification
test -
Coefficients Fixed
in_wage
Random
Effects
Effects
Difference
age
-. 005485
-. 0069749
msp
.0033427
.0046594
-.0013167
tt l_exp
.0383604
.0429635
-.0046031
.0014899
t hausman--
Hausmanspecificationtest
11
i Test:
Ho:
difference
in coefficients
not
chi2(3)
= (b-B)'ES'(-!)]Cb-8),
Prob>chi2
= =
systematic
S = (S_fe - S_re)
275.44 O.0000
q
Example A stringent assumption of multinomial and cbnditional logit models is that outcome categories • i for the model have the property of independence of irrelevant alternatives (IIA). Stated simply, this assumption requires that the inclusion or exclusmn of categories does not affect the relative risks associated with the regressors in the remaining citegories. One classic example of a situation where thi_ assumption would be violated involves choice of transportation mode: see McFadden (1974). For s_mplicity, postulate a transportation model with the four possible outcomes: rides a train to work, t_es a bus to work, drives the Ford to work, and drives the Chevrolet to work• Clearly "drives the _ord" is a closer substitute to' drives the Chevrolet" than it is to "rides a train" (at least for most people). This means that excluding "drives the Ford" from the model could be expected to affect the relative risks of the remaining options and the model would not obey the IIA assumption. i Using the data presented in [R] mlogit, we w_l use a simplified model to test for IIA. Choice of insurance type among indemnity, prepaid, and un_sured is modeled as a function of age and gender. The indemnity category is allowed to be the base _ategory and the model including all three outcomes is estimated. i
• mlogit insure age male Iteration O: Iteration I: Iteration 2:
log likelihood = -555.854_6 log likelihood = -551.329t3 log likelihood = -551.32802
Multinomial regression
Number of obs LR chi2(4) Prob > chi2
= = =
615 9,05 0.0598
Log likelihood = -551.32802
Pseudo R2
=
0.0081
,i
insure
Coef.
Std. Err.
Prepai6 age male
-. 0100251 .5095767
.0060181 ,1977893
_cons
.2633838
.2787574
Uninsure age male _cons
z i i _1.67 2.58
I
O, 94
P>lz]
[95Y,Conf. Interval]
O. 096 O. 010
-. 0218204 .1219148
.0017702 .8972345
O. 345
-. 2829708
.8097383
0.648 O. 189 0.001
-.027501 -.2343477 -2.797504
.017116 i.184057 -.7161824
i
i -.0051925 .4748547 -1.756843
.0113821 .3618446 .5309591
40.46 iI.31 43.31 :i
(Outcome insure==Indem is the comparisonlgrouP)
]
• hausman, save
i i
!
Under the IIA assumption, we would expect no _;ystematic change in the coefficients if we excluded one of the outcomes from the model. (For an ektensive discussion, see Hausman and McFadden. 1984.) We re-estimate the model, excluding the!uninsured outcome, and perform a Hausman test
i i
. mlogit insure age male if insure-=
against the fully efficient full model.
"U_insure":insure
I
_:r,,
..
..u_.,..--n_usman
specmcaIlontest
Iteration
O:
log
likelihood
=
-394.8693
Iteration
I:
log likelihood
=
-390.4871
I_eration
2:
log likelihood
= -390.48643
Multinomial
Log
regression
likelihood
= -390.48643
Number of obs LR chi2(2)
= =
Prob
=
0.0125
=
0.0111
> chi2
Pseudo
Std.
Err.
z
R2
P>Iz}
[95Z
Conf.
570 8.77
insure
Coef.
Interval]
age male
-.0101521 .5144003
.0060049 .1981735
-1.69 2.60
0.091 0.009
-.0219214 .1259875
.0016173 .9028132
_cons
.2678043
.2775562
0.96
0.335
-.2761959
.8118046
Prepaid
(Outcome
insure==Indem
hausman,
alleqs
is the
less
comparison
group)
constant
Coefficients (b) Current
(B) Prior
(b-B) Difference
sqrt (diag (V_b-V_B)) S.E.
age male
-.0101521 .5144003
-.0100251 .509574Z
-.0001269 .0048256
_cons
.2678043
.2633838
.0044205
b = less
efficient
B = fully Test:
Ho:
efficient
difference
estimates
in coefficients
chi2(3)
obtained
estimates
from
obtained
not
.012334
mlogit
previously
from
mlogit
systematic
= (b-B)'[(V_b-g_B)_(-l)](b-B) = 0.08 Prob>chi2
=
0.9944
First, note that the somewhat used to identify the "Uninsured"
subtle syntax of the if condition on the mlogit command was simply category using the insure value label; see [U] 15.6.3 Value label.
On examining the output has been violated.
hausman,
Second, mlogit
from
since the Hausman
requires
test is a standardized
that the base category
most frequem category use the basecategory()
we see that there
be the same
is no evidence
comparison
of model
in both competing
that
the IIA assumption
coefficients,
models.
using
it with
In particular,
if the
(the default base category) is being removed to test for IIA, then you must option in mlogit to manually set the base category to something else.
The missing values for the square root of the diagonal of the covariance matrix of the differences is not comforting but is also not surprising. This covariance matrix is guaranteed to be positive definite only asymptotically and assurances are not made about the diagonal elements, Negative values _ong the diagonal are possible, and the fourth column of the table is provided maitdy for descriptive use. We can also perform the Hausman • mlogit
insure
age
male
IIA
if insure
test against the remaining alternative in the model. ~=
"Prepaid":insure
Iteration
O:
log
likelihood
= -132.59915
Iteration
i:
log
likelihood
= -131.78009
Iteration
2:
log
likelihood
= -131.76808
Iteration
3:
log
likelihood
= -131.76807
Multinomial
Log
regression
likelihood
= -131.76807
Number of obs LR chi2(2)
= =
Prob
> chi2
=
0.4356
R2
=
0.0063
Pseudo
338 i.66
lausman -- Hausmanspecificationtest
it
13
i
insure
Coef.
Std. Err.
z
P>Jzl
[95Z Conf. Interval]
Uninsure age male _cons
-.0041055 .4591072 -1. 801774
.0115807 .3595663 .5474476
_0.35 I.28 43.29
O.723 O. 202 O. 001
-.0268033 -.2456298 -2. 874752
.0185923 I.163844 -, 7287968
(Outcome insure==Indem is the comparison group)
• hausman, alleqs less constant -Coefficients -(b) Current age male _cons
-.0041055 .4591072 -1.801774
i
(B) Prior
(b-B) sqrt(diag(V b-V_B) ) Difference S.E.
-.0051925 .4748547 -1.756843
.001087 -.0157475 -.0449311
i
.0021357 .1333464
I
b = less efficient estim_ttesobtained from mlogit B = fully efficient estiz_atesobtained previously from mlogit Test:
Ho:
difference in coefficients not systematic chi2(3)
= (b-B)'[(V_b-__B)'(-I)](b-B) = -0.18 dhi2
model estimated on these } _ata fails to meet the asymptotic $ssumptions of the Hausman test i
/
In this case, the X2 statistic is actually negati_'e. We might interpret this as strong evidence that we cannot reject the null hypothesis. Such a result is not an unusual outcome for the Hausman test, particularly when the sample is relatively small'there are only 45 uninsured individuals in this dataset. Z
Are we surprised by the results of the Hausn_an test in this example? Not really. Judging from the z-statistics on the original multinomiat logit model, we were struggling to identify any structure in the data with the current specification. Even when we were willing to make the assumption of IIA I and estimate the most efficient model under this assumption, few of the effects could be identified as statistically different from those on the base category. Trying to base a Hausman test on a contrast (difference) between two poor estimates is just a_king too much of the existing data. <1 For an example applying the Hausman test to the endogeneity of variables in a simultaneous system, see [R] irreg.
]
,,
Saved Results hausman
saves
in r ()"
i ]
Scalars
i
r (chi2)
_2
r(df)
degrees of freOdom for the statistic
r(p)
p-value for the x -_
! ]
1:,,
!_
,,_
nausman -- Nausman specification test
Acknowledgment Portions of hausman Netherlands.
'
are based on an early implementation
by Jeroen Weesie, Utrecht University,
Methods and Formulas hausman
is implemented
as an ado-file.
The Hausman statistic is distributed
as X 2 and is computed
as
//- (/3c-/_e)'(_ - _)-1(/_c 9e) where tic 3e V_ V_
is is is is
the the the the
coefficient vector from the consistent estimator coefficient vector from the efficient estimator covariance matrix of the coefficients from the consistent estimator covariance matrix of the coefficients from the efficient estimator
In cases where the difference in the variance matrices is not positive definite, a Moore-Penrose generalized inverse is used. As noted in Gourieroux and Monfort (1989, 125-128), choice of generalized inverse is not important asymptotically. The degrees of freedom for the statistic are the rank of the difference in the variance matrices. When the difference is positive definite, this is the number of common coefficients in the models being compared.
References Baltagi, B. H. 1998. Econometrics. New York:Springer-Verlag. Gourieroux, H. and A. Monfort. 1989. Statistics and Econometric Modets, Vol. 2. New York: Springer-Verlag. Hausman, J. 1978. Specificationtests in econometrics. Econometrica46: 1251-1271. Hausman, J. and D. McFadden. 1984. Specificationtests in eco_ometncs. Econometrica 52. 12t9-1240. McFadden, D. 1974. Measurement of urban travel demand. Journal of Public Economics 3: 303-328.
Also See Related:
i:
[R| lrtest, [R] test, [R] xtreg, [R] xtregar
Title
[
i
hq_-kman -- Heckman selection model II1
[I
l
J
l
i
_
/I
I
I
'
11
,Syntax
i
Basic syntax heckman depvar
[vartist],
se!oct(varlis4)
[varlist],
setect(depvars!=
[ twostep
]
or
hectman
depvar
varlists)
[ twostep
]
Full syntax for maximum likelihood estimates only heckraan depvar
[varlist] [weight] [if
exp]] [in range],
select(
[ depvars = ] varlists [, off_et(varname)
[ robust
cluster
noconstant
] )
('varname) score (nev_'varlist) n_sshazard (newvarname)
L--
!
mills(newvarname)
offset(varname)
_oconstant constraints(numlist)
first
noskip level(#) iterate(O)nolog ldaximize_options ] /
Full syntax for Heckman's two-step consistent es_mates only i heckman depvar select(
[varlist] [ifexp] [in range],
[depvars
= ]varlists
,[ nshazard(newvalwame) L --
[ rhosigma by ...
I rhotrunc
: may be used with heckman;
[,
twostep
noccalstant
mills(newvamame) --
] rholimited
]
)
noconstant
rhoforce
first
level(#)
--
] ]
see [R] by.
pweights, aweights, fweights, and i_eights are allowed with maximum likelihood estimation: see [U] 14.1.6 weight. No weights are allowed if twostep is specified. heckman
shares the features of all estimation commands; _ee [U] 23 Estimation
and post-estimation
command_.
i ] i
!Syntaxfor predict predict [type] newvarname where statistic is
[il
exp] [in range]
[, statistic nooffset ]
-Ii _''
1_
neci(man -- Heckman selection model
xb
xjb.
ycond
E (yj IYj observed)
__ yexpected
E
nshazard psel xbsel stdpsel
or mills
* (yj),
fitted values (the default)
yj taken to be 0 where unobserved
nonselection hazard (also called inverse of Mills" ratio) P(yj observed) linear prediction for selection equation standard error of the linear prediction for selection equation
pr(a,b)
Pr(yj
[ a < yj < b)
e(a,b)
E(yj
ystar(a,b) __ strip stdf
E (yj), * yj* = max{a, min(yj,b)} standard error of the prediction standard error of the forecast
t a < yj < b)
where a and b may be numbers or variables; a equal to ' " means -oc: These statistics are available the estimation sample.
both in and om of sample;
wpe
predict
...
if
b equal to ".' means +oo.
e(sa_'nple)
. ..
if wanted only for
Description hecl_nan estimates regression models with selection using either Heckman's estimator or full maximum-likelihood.
two-step consistent
Options select (...) specifies the variables and options for the selection equation. specifying a Heckman model and is not optional. twostep specifies that Heckman's (1979) two-step errors, and covariance matrix are to be produced.
efficient estimates
It is an integral part of
of the parameters,
standard
robust specifies that the Huber/White/sandwich estlmator of the variance is to be used in place of the conventional MLE variance estimator, robust combined with cluster() further allows observations clusters).
which are not independent
If you specify pweights,
robust
within cluster (although they must be independent
is implied; see IU] 23,11 Obtaining
robust
variance
between estimates.
cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily independent within groups, varname specities to which group each observation belongs. cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients, cluster() estimates for unstratified cluster-sampled data. cluster() cluster()
implies robust; by itself.
that is. specifying
can be used with pweights robust
cluster()
is equivalent
to produce to typing
score (newvarlist) creates a new variable, or a set of new variables, containing the contributions the scores for each equation and ancillary parameter in the model.
to
_heckman- Heckman selection model
17
....
The first new variable specified will contain uij = OlnLj/O(xj_) for each observation j in the sample, where lnLj is the jth observation's contribution to the log likelihood. • i The second new variable: u2j = OlnLj/O(zj_) The third: u3j = OlnLj /O(atanh p) The fourth: u4j = OInL.j/O(ln a)
i If only one variable is specified, only the first score is computed; if two variables are specified, only the first two scores are computed: and soon. i The jth observation's contribution to the score! vector is i
{01,,L/o 01nLj/0(-y) 01nL/0(atanh p) 01nLj/O(ln}= ( ljxj , jzj ',3:j ) !i
The score vector can be obtained by summing over j; see [U] 23.12 Obtaining
scores.
nshazard(newvarname) and mills (newvarnamk) are synonyms; either will create a new variable containing the nonselection hazard--what Hecl_an i (1979) referred to as the inverse of the Mills' ratio--from the selection equation. The nonselection hazard is computed from the estimated parameters of the selection equation. offset (varname) is a rarely used option that spdcifies a variable to be added directly to Xb. This option may be specified on the regession equOtion, the selection equation, or both. noconstant omits the constant term from the equations. This option may be specified on the regression equation, the selection equation, or both• i constraints(numtist) specifies by .number • the • linear constraints to be applied during estimation, i The default is to perform unconstrained est_matmn. Constraints are specified using the constraint command; see [R] constraint. See [R] reg3 for ihe use of constraints in multiple-equation contexts. constraints (numtist) may not be specified With twostep. I
first specifies that the first-step probit estimatds of the selection equation be displayed prior to estimation, i noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be estimated. This model is not di@ayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asympto!ically equivalent Watd test of all the parameters in the regression equation being zero (exceptthe constant). For many models, this option can substantially increase estimation time.
_ ]
level(#)specifies the confidence level, in percent, for confidence intervals. The default is level(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. iterate(0) produces Heckman's (1979) two-step parameter estimates with standard errors computed from the inverse Hessian of the full information! matrix at the two-step solution for the parameters. As an alternative, the twostep option comput¢s Heckman's two-step consistent estimates of the standard errors, iterate(#) can also be used tb restrict the maximum number of iterations during optimization; see [R] rholimited, maximize. rhosigma, rhotrunc, and rhofdrce are rarely used options to specify how the two-step estimator, option twostep, handles _nusual cases where the two-step estimate of p is outside the admissible range for a correlation, [_ 1, 1]. When abs(p) > 1, the two-step estimate of the coefficient variance-covariance matrix may'not be positive definite, and thus may be unusable for testing. The default is rhosigma. :i rhotrunc specifies that p be truncated to lie in the range [- 1, 1]. If the two-step estimate is less than -1, p is set to -I: if the two-step estimate is greater than 1, p is set to 1. This truncated value of p is used in all computations to estimate the two-step covariance matrix. i
]
i
i
1 i
_:l:
,..
,,,_..R,,,,:,. -- nuuKman SeleC'[ion moclel rhosigmaspecifies that p be truncated, as with option rhotrunc,and that the estimate of o"be made consistent with _', the truncated estimate of p. So, _ =/3,._; see Methods and Formulas for the definition of/3 m. Both the truncated p and the new estimate of _ are used in all computations to estimate the two-step covariance matrix.
rholimited specifies that p be truncated only in the computation of the diagonal matrix D as it enters Vtwostep and Q; see Methods and Formulas. In all other computations, the untruncated estimate of p is used. rhoforee specifies that the two-step estimate of p be retained even if it is outside the admissible range for a correlation. This may, in rare cases, lead to a nonpositive-definite covariance matrix. These options have no effect when estimation is by maximum likelihood, the default. They also have no effect when the two-step estimate of p is in the range [ - 1,1 ], nolog suppresses the iteration log. maximize._options control the maximization process; see [R] maximize. You will likely never need to specify any of these options except for iterate(0) and possibly difficult.If the iteration log shows many "not concave" messages and it is taking many iterations to converge, you may want to try using the difficult option and see if that helps it converge in fewer steps.
Options for predict xb, the default, calculates the linear prediction xjb. ycond calculates the expected value of the dependent variable conditional on the dependent variable being observed, i.e., selected; E(_Ij [ yj observed). yexpected calculates the expected value of the dependent variable (y_) where that value is taken to be 0 when it is expected to he unobserved: y; = Pr(yj observed)E(yj _ yj observed). The assumption of 0 is valid for many cases where nonselection implies nonparticipation (e.g., unobserved wage levels, insurance claims from those who are uninsured, etc.) but may be inappropriate for some problems (e.g., unobserved disease incidence). nshazardand millsare synonyms; either calculates the nonselection hazard what Heckman (1979) referred to as the inverse of the Mills' ratio--from the selection equation. psel calculates the probability of selection (or being observed): P(yj observed) = Pr(zj7 + u2j > 0). xbsel calculates the linear prediction for the selection equation. stdpsel
calculates the standard error of the linear prediction for the selection equation.
pr(a,b) calculates Pr(a < xjb q- Ul < b), the probability that yjlxj interval (a, b).
would be observed in the
a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xjb + Ul < 30); pr(lb, ub) calculates Pr(lb < x_b--r ul < ub); and pr(20,ub) calculates Pr(20 < xjb + zh < ub). a =. means -oc; pr(. ,30) calculates Pr(xjb + "u-1< 30); pr(lb,30) calculates Pr(xjb + ul < 30) in observations for which lb = . (and calculates Pr(/b < xjb + ul < 30) elsewhere).
•
iii
b =. means +ec; pr(20,. ) calculates Pr(xjb + ul > 20): pr(20,ub) calculates Pr(xjb + zh > 20) in observations for which ub -- . (and calculates Pr(20 < xjb + ul < ub) elsewhere).
"
_eckman -- Heckman selection model e(a,b) calculates E(xjb-1-ulla < xjb + ul _ b), the expected value of yjlxj yj Ixj being in the interval (a, b), which is to shy, yj [xj is censored. a and b are specified as ffley are for prO.
19
conditional on
ystar(a,b) calculates E(y_) where yj = a ifixjb + uj _ a, yj. = b if xjb + uj > b, and y_ = xjb + uj otherwise, which is to say, y_ i s truncated, a and b are specified as they are for prO. stdp calculates the standard error of the prediction. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast. !This is the standard error of the point prediction for a single observation. It is commonly referrefl to as the standard error of the future or forecast value. By construction, the standard errors pr_uced by stdf are always larger than those by stdp; see [R] regress Methods and Formulas. noofgset is relevant when you specify offset (vilrname) for heckman. It modifies the calculations made by predict so that they ignore the offsdt variable; the linear prediction is treated as xjb rather than xjb + offsetj.
Remarks The Heckman selection model (Gronau 1974, Lewis 1974, Heckman 1976) assumes that there exists an underlying regression relationship Yj = xj_ + ulj
regression
equation
The dependent variable, however, is not always observed. Rather, the dependent variable for observation j is observed if zj'y + u2j > 0
selection
equation
where
corr( ,
=p
When p ¢i O, standard regression techniques applied to the first equation yield biased results. hectman provides consistenL asymptotically efficient estimates for all the parameters in such models.
t i I
In one classic example, the first equation describes the wages of women. Women choose whether to work and thus, from our point of view as researdhers, whether we observe their wages in our data. tf women made this decision randomly, we could ignore the fact that not all wages are observed and use ordinary regression to estimate a wage model. Such a random-participation-in-the-labor-force assumption, however, is unlikely to be true; wome_ who would have tow wages may be unlikely to choose to work and thus the sample of observed wages is biased upward. In the jargon of economics, wage greater women choose Thus, not toit work their that personal than have the wage by employers. is alsowhen possible wome_reservation who choose not isto work could even offered higher reservation wages. One could tell a story that competency is related to wages, but competency is rewarded more at home than in the labor force. offer wages than those who do work--they
may _ave high offer wages, but they have even higher
.,
r.u
.¢-,,;_lu|all
--
nls;Klllafl
lseteCl[ion moael
In any case, in this problem--which is the paradigm for most such problems--a solution can be found if there are some variables that strongly affect the chances for observation (the reservation wage) but not the outcome under study (the offer wage). Such a variable might be the number of children in the home. (Theoretically, one does not need such identifying variables, but without them, one is depending on functional form to identify the model. It would be difficult for anyone to take such results seriously since the functional-form assumptions have no firm basis in theory.)
_>Example In the syntax for heckman, depvar and varlist are the dependent variable and regressors for the underlying regression model to be estimated (y = X/_) and varlis% are the variables (Z) thought to determine whether depvar is observed or unobserved (selected or not selected). In our female wage example, number of children at home would be included in the second list. By default, heckman will assume that missing values (see [U]15.2.1 Missing values) of depvar imply that the dependent variable is unobserved (not selected). With some datasets, it is more convenient to specify a binary variable (depvars) that identifies the observations for which the dependent is observed/selected (depvars¢ O) or not observed (depvars= 0); beckman will accommodate either type of data. We have a (fictional) dataset on 2,000 women, of whom 1.343 work: • summarize age educ married children wage Variable
Obs
Mean
age education married children wage
2000 2000 2000 2000 1343
36.208 13.084 .6705 1.6445 23.69217
Std. Dev. 8. 28656 3. 045912 .4701492 1. 398963 6.305374
Min
Max
20 10 0 0 5,88497
59 20 1 5 45.80979
We will assume that the hourly wage is a function of education and age whereas the likelihood of working (the likelihood of the wage being observed) is a function of marital status, number of children at home, and (implicitly) the wage (via the inclusion of age and education which we think determine the wage):
(Continued on next page)
hbckman-Heckman selecUonmodel i! • heckman
wage educ
Iteration Iteration Iteration
O: t: 2:
log log log
Heckman selection
age, select(married likelihood likelihood likelihood
ch_dren
21
educ age)
= -5178.7009 = -5178.3049 = -5178.3045
model
(regression model wi_h sample selection)
Log likelihood = -5178.304 Coef.
Std, Err.
wage education
.9899537
.0532565
age _cons
.2131294 ,4857752
.0206031 I.077037
z
Number of obs
=
2000
Censored obs Uncensored obs
= =
657 1343
Weld chi2(2)
=
508.44
Prob > chi2
=
0.0000
P>Izl
[95_ Conf. Interval]
18.59
O,000
.8855729
I.094334
I0.B4 0.45
O,000 O,652
.1727481 -I.625179
.2535108 2.59673
.5772647 .4931601
'I
select married
.44_51721
.0673954
6 .B1
O.000
.3130794
children education age _cons
,43,87068 .05.57318 .0365098 -2.491015
.0277828 .0107349 .0041533 .1893402
15.79 5.19 8.79 -13.116
O.000 0.000 O.000 O.000
.3842534 .0346917 ,0283694 -2.862115
.0767718 .0446502 -2.119915
i i
/athrho
.8742086
.1014225
/Insigma
I.792559
,027598
rho
.7035061
sigma lambda
6.004797 4.224412
........
.6754241
I.072993
I.738468 /
I,84665
.0512264
.5885365
.7905862
.1657202 .3992265
5.68862 3.441942
6.338548 5.006881
LR test of indep, eqns. (rho = 0):
8 .!62 O.000 64.95
O.000
chi2(i) =
61.20
Prob > chi2 = 0.0000
heckman assumes thatwage is the dependent vada_]e and thatthe firstvariablelist(educ and age) are the determinants of wage. The variablesspecifiedin the select() option (married, children.
educ,andage)areassumedtodetermine whetherihedependent variable isobserved(theselection equation). Thus, we estimated the model wage = fi0+ flledUc + fl2age+ ul andwe assumedthatwage isobservedif 70 + 71married
+ 72children
_ 73educ + 74age + u2 > 0
where ut and u2 have correlation p.
l t
The reported results for the wage equation are interpreted exactly as though we observed wage data for all women in the sample; the coefficients on age and education level represent the estimated marginal effects of the regressors in the underlying regression equation. The results lbr the two ancillary parameters require some explanation, hec_anan does not directly estimate p: to constrain ,o within its valid limits, and for numerical stabiliiv during optimization, it estimates the inverse hyperbolic tangent of p:
atanh p = _
1-_/
I,
22
heckman -- Heckman selection model
This estimate is reported as /athrho. In the bottom panel of the output, heckman undoes this transformation for you: the estimated value of p is .7035061. The standard error for p is computed using the delta method and its confidence intervals are the transformed intervals of/aghrho. Similarly, or, the standard error of the residual in the wage equation, numerical stability, heckman instead estimates In or. The untransformed of the output: 6.004797.
is not directly estimated; for sigma is reported at the end
Finally, some researchers--especially economists are used to the selectivity effect summarized not by p but by A = per. heckman reports this, too, along with an estimate of the standard error and confidence interval.
q
[] Technical Note If each of the equations in the model had contained many regressors, the heckman command could become quite long. An alternate way of specifying our wage model would make use of Stata's global macros. The following lines are an equivalent way of estimating our model. global
wageeq
global
seleq
. heckman
"wage "married
$wageeq,
educ
age"
children
edue
age"
select($seleq)
[]
o Technical Note The reported model X _ test is a Wald test of all coefficients in the regression model (except the constant) being 0. heckman is an estimation command, so you can use test, testnl, or lrtest to perform tests against whatever nested alternate model you choose; see [R] test, [R] testnl, and [R] lrtest. The estimation of P and cr in the form atanh p and In cr extends the range of these parameters to infinity in both directions, thus avoiding boundary problems during the maximization. Tests of p must be made in the transformed units. However, since atanh(0) - 0, the reported test for atanh p = 0 is equivalent to the test for ,o = O. The likelihood-ratio test reported at the bottom of the output is an equivalent test for p = 0 and is computationally the comparison of the joint likelihood of an independent probit model for the selection equation and a regression model on the observed wage data against the heckman model likelihood. The z -- 8.619 and X _ of 61.20, both significantly different from zero. clearly justify the Heckman selection equation with this data. []
Example heckman supports the HuberAVhite/sandwich estimator of variance under the robust and cluster() options, or when the pweights are used for population weighted data: see IU] 23.11 Obtaining robust variance estimates. We can obtain robust standard errors for our wage model by specifying clustering on county of residence (the county variable).
h$ckman-- Heckmanselection model
23
i
• heckman
wage
educ
Iteration
O:
log likelihood
= -5178.7009
Iteration
I:
log likelihood
= -5178.3049
Iteration
2:
log likelihood
= -5178,3045
Heckman
selection
(regression
select(married
children
model
model
Log likelihood
age,
with
educ age)
Number sample
selection)
= -5178.304
Coef.
cluster(county)
of obs
=
2000
Censored obs Uncensored obs
= =
657 1343
Wald
=
272.17
=
0.0000
chi2(1)
Prob > chi2 (standard
errorsiladjusted :!
Robust
i!
Std,
Err.
for clustering
P>lzl
[957 Conf.
on county)
Interval]
! wage education age _cons
.9899537
.0600061
16.$0
0.000
.8723438
1.107564
.2131294 .4857752
.020995 1.302103
I0,i5 0._7
0.000 0.709
.17198 -2.066299
.2542789 3.03785
.4451721 .4387068
.0731472 .0312386
6.09 14.04
0.000 0.000
.3018062 .3774802
.5885379 .4999333
5._6
0.000
.0341645
.0772991
0.000 0.000
.0285954 -2,717059
.0444242 -2.264972
0.000
.5991596
1.149258
0.000
1.741902
1.843216
select married children education age _cons
.0110039 •004038 .1153305
9.$4 -21._0
.8742086
.1403337
6._3
1.792559
.0258458
69._6
rho
•7035061
,0708796
.5364513
.817508
sigma lambda
6.004797 4.224412
.155199 .5186709
5.708189 3.207835
6.316818 5.240988
Prob
= 0.0000
/athrho /Insigma
Wald
.0557318 •0365098 -2.491015
test
of indep,
eqns.
(rho = 0): chi2(l!
=
38.81
> chi2
The robust standard errors tend to be a bit larger, bit we do not notice any systematic differences. This is not surprising since the data were not constructed to have any county-specific correlations or other characteristics that would deviate from the assumptions of the Heckman model. q
Example The default statistic produced by predict after tieckman is the expected value of the dependenl variable from the underlying distribution of the regression model. In our wage model, this is the expected wage rate among all women, regardless of whether they were observed to participate in the labor force. • predict heckwage (option xb assumed;
fitted
values) }
It is instructive to compare these predicted wage v_lues from the Heckman model with an ordinary regression model--a model without the selection adiustment:
;_r;
24
heckman-Heckmanselectionmodel wage educ age
. regress
]
Source
13524.0337
wage
age _cons education
(option
MS
2
39830.8609
Total
• predict
df
i
Model Residual
)
SS
1340
53354,8948
1342
I
Coef.
Std.
I
.1465739 6.084875 .8965829
Number of obs F( 2, 1340)
= =
1343 227.49
6762.01687
Prob
=
0.0000
29.7245231
R-squared
=
0.2535
39.7577456
Adj R-squared Root MSE
= =
0.2524 5.452
Err.
t
.0187135 .8896182 .0498061
7.83 6.84 18.00
> F
P> It J
[95Y, Conf.
0.000 O. 000 0.000
.109863 4.339679 .7988765
Interval]
.1832848 7. 830071 .9942893
re.age xb assumed;
. summarize
fitted
heckwage
values)
re.age
'
Variable
0bs
Mean
Std.
Dev.
Min
Max
(
heckwage
2000
21. 15532
3. 83965
14.6479
32. 85949
regwage
2000
23. 12291
3. 241911
17. 98218
32. 66439
Since this dataset was concocted, we know the true coefficients of the wage regression equation to be 1, 0.2, and 1, respectively. We can compute the true mean wage for our sample. • gen
truewage
• sum
truewage
= i +
,2*age
+ l*educ
Variable
I
0bs
Mean
truewage
I
2000
21. 3256
Std.
Dev.
3.797904
Min
Max
15
32.8
Whereas the mean of the predictions from heckmanis within 18 cents of the true mean wage, ordinary regression yields predictions that are on average about $1.80 per hour too high due to the selection effect. The regression predictions also show somewhat less variation than the true wages. The coefficients from heckman are so close to the true values that they are not worth testing. Conversely, the regression equation is significantly off, but seems to give the right sense. Would we be led far astray if we relied on the OLS coefficients? The effect of age is off by over 5 cents per year of age and the coefficient on education level is off by about 10%. We can test the OLS coefficient on education level against the true value using test. • test (I)
educ
= 1
education F(
= 1.0
1, 1340) = Prob > F =
4.31 0.0380
Not only is the OL$ coefficient on education substantially lower than the true parameter, the difference from the true parameter is statistically significant beyond the 5% level. We can perform a similar test for the OLS age coefficient: • test (1)
age
=
.2
age
=
.2
F(
1, 1340) = Prob > F =
8.15 0.0044
We find even stronger evidence that the OLS regression results are biased away from the true parameters. <1 !
!
tw,_..
heckman-- Heckmanselection model
25
> Example Several other interesting aspects of the Heckmah model can be explored with predict;. Continuing with our wage model, the expected wages for wOmen conditional on participating in the labor force can be obtained with the ycond option. Let's gdt these predictions and compare them with actual wages for women participating in the labor forcel • predict
hcndwage,
• stmm_lize
wage
ycond
hcndwage
Variable wage hcndwage
if wage
-=
Obs
Mean
Std ! Dev.
1343
23.69217
6.3_5374,
1343
23.68239
3.355087 i
Min
Max
5.88497
45.80979
16. 18337
33.7567
We see that the average predictions from beckman are very close to the observed levels but do not have exactly the same mean. These conditional w'age predictions are available for all observations in the dataset, but can only be directly compared with observed wages where individuals are participating in the labor force. What if we were interested in making predictions about mean wages for all women? In this case, the expected wage is 0 for those who are not exp_ted to participate in the labor force, with expected participation determined by the selection equation.: These values can be obtained with the yexpected option of predict. For comparison, a variable can be generated where the wage is set to 0 for nonparticipants. . predict
hexpwage,
yexpected
• gen wageO=
wage
(657 missing
values
generated)
. replace
wageO=
0 if wage
(657 real
changes
made)
• summarize Variable hexpwage wageO
hexpwage
== .
wageO
fibs
Mean
Stdi Dev.
2000
15. 92511
5.949336
2000
15,90929
12. _7081
Min 2.492469
Max 32.45858
0
45.80979
i
i
Again, we note that the predictions from heckman are very' close to the observed mean hourly wage rate for all women. Why aren't the predictions uging ycond and yexpected exactly equal to their observed sample equivalents? For the Heckman _odel, unlike linear regression, the sample moments implied by the optimal solution to the model likelihood do not require that these predictions exactly match observed data. Properly accounting for thh additional variation from the selection equation ,-quires that the model use more information thar_just the sample moments of the observed wao_es. q
Example Stata wilt also produce Heckman's (1979) two-step efficient estimator of the model with the twostep option. Maximum likelihood estinaation of the parameters can be time-consuming with large datasets and the two-step estimates may provide a_ood alternative in such cases. Continuing with the women's wage model, we can obtain the two-step estimates with Heckman's consistent covariance estimates by typing
!
I
',,
26
heckman m Heckman selection model
• heckman wage educ age, select(married children ednc age) twostep Heckman selection model -- two-step estimates (regression model with sample selection)
Coef, wage education age _cons
Std, Err.
z
Number of obs Censored obs Uncensored obs
= = =
2000 657 1343
Wald chi2(4) Prob > chi2
= =
551.37 0.0000
P>Izl
[95_ Conf. Interval]
.9825259 .2118695 .7340391
.0538821 .0220511 1.248331
18.23 9.61 0.59
0.000 0.000 0.557
.8789189 .1686502 -1.712645
1.088133 .2550888 3.180723
.4308575 .4473249 .0583645 .0347211 -2.467365
.074208 .0287417 .0109742 .0042293 .1925635
5.81 15.56 5.32 8.21 -12.81
0.000 0.000 0.000 0.000 0.000
.2854125 .3909922 .0368555 .0264318 -2.844782
.5763025 .5036576 .0798735 .0430105 -2.089948
select married children education age _cons mills lambda
4.001615
rho sigma lambda
0.67284 5.9473529 4.0016155
.6065388
6.60
0.000
2.812821
5.19041
.6065388
q
t] Technical Note The Heckman selection mode] depends strongly on the model being correct; much more so than ordinary regression. Running a separate probit or ]ogit for sample inclusion followed by a regression, referred to in the literature as the two-part model (Manning, Duan, and Rogers 1987) not to be confused with Heckman's two-step procedure--is an especially attractive alternative if the regression part of the model arose because of taking a logarithm of zero values. When the goal is to analyze an underlying regression model or predict the value of the dependent variable that would be observed in the absence of selection, however, the Heckman model is more appropriate. When the goal is to predict an actual response, the two-part model is usually the better choice. The Heckman selection model can be unstable when the model is not properly specified, or if a specific dataset simply does not support the model's assumptions. For example, let's examine the solution to another simulated problem.
(Continued
on next page)
heckman-- _man
• heckman
yt xl x2 x3,
Iteration Iteration Iteration Iteration Iteration
O: i: 2: 3: 4:
log log log log log
selec¢(zl
likelihood likelihood likelihood likelihood likelihood
selection model
27
z2) = = = = =
-t11.94996 -110.82258 -II0.17707 -107.70663 (not concave) -107.07729 (not concave)
(outputo_.ed ) Iteration 31: Iteration 32:
log likelihood = -104.08268 log likelihood = -104.08267 (backed up)
Heckman selection model
Number of obs
=
150
(regression model with sample selection)
Censored obs Uncensored obs
= =
87 63
Wald chi2(3)
=
8.64e+07
Prob
=
0.0000
Log likelihood
= -104.0827 Coef.
Std. Err.
z
> chi2
P>Izl
[957,Conf, Interval]
yt xl x2
.8974192 -2,525302
.0006338 1415._ .0003934 -6418.57
O.000 O.000
.896177 -2.526074
.8986615 -2.524531 2. 856651
x3 _cons
2.855786 .6971255
.0004417 6465.84 .0851518 8.I_
O.000 O.000
2.85492 ,5302311
zl
-.6830377
.0824049
-8.29
O.000
-.8445484
-.521527
z2 _cons
1.004249 -.361413
.1211501 .1165081
8. _ -3,ID
O. 000 O.002
.7667993 -.589"/647
1. 241699 -. 1330613
/athrho /insigma
15.12596 -.5402571
151.3627 ,1206355
0.10 -4._
0.920 O.000
-281.5395 -.7766984
311.7914 -.3038158
.8640198
select
i
rho sigma lambda
1 .5825984 .5825984
4.40e-
LR test ol indep, eqns. (rho = 0):
I
I !
11
-1
.0702821 .0702821
.459922 .4448481 chi2(i) =
25.67
1 .7379968 .7203488
Prob > chi2 = 0,0000
the form of the likelihood for the Heckman selectioh model, this implies a division by zero and it is surprising that the model solution turns out as will as it does. Reparameterizing p has allowed The model has converged to a value of p that is 1.0--within machine rounding tolerances. Given the estimation to converge, but we clearly have problems with the estimates. Moreover, if this had occurred in a large dataset, waiting over 32 iteration_ for convergence might take considerable time. This dataset was not intentionally developed to cause problems. It is actually generated by a "Heckman process" and when generated starting fromidifferent random values can be easily estimated. The luck of the draw in this case merely led to daia that, despite its source, did not support the assumptions of the Heckman model. The two-step model is generally more stable in chses where the data are problematic. It is even tolerant of estimates of p less than -1 and greater !than t. For these reasons, the two-step model may be preferred when exploring a large dataset. Still, if the maximum likelihood estimates cannot converge, or converge to a value of p that is at the bouhdary of acceptable values, there is scant support for estimating a Heckman selection model on the d_ta. Heckman (1979) discusses the implications of p being exactly t or 0, together with the implica!ions of other possible covariance relationships among the model's determinants.
_
l_,T
28
heckman --
Saved Results heckman
saves
Heckman selection model
in e():
Scalars e (N)
number of observations
e (re)
return code
e (k)
number of variables
e (chi2)
x2
e(k_eq)
number of equations
e(chi2_c)
X2 for comparison test
e(k_dv) e (N_eens)
number of dependent variables number of censored observations
e(p_c) e (p)
p-value for comparison test significance of comparison test
e (dr_m)
model degrees of freedom
e (rho)
p
e(11)
log likelihood
e(£c)
number of iterations
e(ll_O) e(N_clust)
log likelihood, constant-only model number of clusters
e(rank) e(rankO)
rank of e(V) rank of e(V) for constant-only
e(lambda)
A
model
e(selambda) standard errorof A
e(sigma)
sigma
typeof optimization
Macros e(cmd)
heckman
e(opt)
e(depv_')
name(s)of dependent variable(s)
e(chi2type) Wald or Lit; typeof modcl x2 test
e(title) e(title2)
title in estimation output secondary title in estimation output
e(chi2_ct)
Wald or LR; type of model )c2 test corresponding to e(chi2_c)
e(utype)
weight type
e(offset#)
offset for equation #
e (wexp) e(clustvar)
weight expression name of cluster variable
e (mills)
variable containing nonselection hazard (inverse of Mills')
e (method)
requested estimation method
e(predict)
program used to implement predict
e(vcetype) e (user)
covanance estimation method name of tikelihood-evaluator program
e(cnslist)
constraint numbers
e(b)
coefficient vector
e(V)
variance-covariance
e(ilog)
iteration log (up to 20 iterations)
Matrices matrix of
the estimators
Functions e(sample)
marks estimation sample
Methods and Formulas heckma_n is implemented 446-450)
provide
as an ado-file.
an introduction
Greene
Regression estimates using the nonselection maximum likelihood estimation. The regression
equation
(2000,
928-933)
to the Heckmm-a selection hazard
(Heckman
is
yj = xjO + ulj The selection
equation
is zj'7 + u2j
> 0
where
ul _ N(O, a) u2 _ N(0,
1)
1
I_'i''1 :-
cor_(_l,u_)= p
or Johnston
and DiNardo
(1997,
model. t979')
provide
starting
values
for
|
!
--
i necKman- Heclcmanselection model
2g
The log likelihood for observation j is
observed lj =
V/1-- "_
-_
a
/
-- Wj
ln(
wjln @(-zjT)
yj
yj not observed i
where _0
is the standard cumulative normal and wj is an optional weight for observation j.
In the maximum likelihood estimation, o-and p are not directly estimated. Directly estimated are In a and atanh p:
(
l+p
_
i i i
atanh p = _ ln\ _] The standard error of ,_ = #a is approximated through the propagation of error (delta) method: that is, Var(A) _ D Var{(atanh
p lncr)} D'
where D is the Jacobian of )_ with respect to at_h p and In a. The two-step estimates are computed using H_ckman's (1979) procedure. Probit estimates of the selection equation Pr(yj
observed I zj)-
_(zj")')
are obtained. From these estimates the nonselection hazard, what Heckman (t979) referred to as the inverse of the Mills' ratio, m j, for each observa¢ion 3 is computed as
¢(zjS) mj where ¢ is the normal density. We also define
Following Heckman, the two-step parameter estimates of /3 are obtained by augmenting the regression equation with the nonselection hazard m. Thus, the regressors become [X m] and we obtain the additional parameter estimate/3,a on the variable containing the nonselection hazard. A consistent estimate of the regression disturbance variance is obtained using the residuals from the augmented regression and the parameter estimate on the nonselecfion hazard. e'e +/3_ _--]j=l N N
5j
The two-step estimate of p is then _ = /3r,L c3 Heckman derived consistent estimates of the coefficient covariance matrix based on the augmented regression.
]
.......
--.o_..
.,vv,-..,t,ua|
.O_I¢_I_.I.I1JII
IIIUUI_I
Let W = [X m] and D be a square diagonal matrix of rank N with (1 _ P^2o_ j) on the diagonal elements.
Vtwostep - 2(W'W)-I(W'DW + Q)(W'W)-1 where q = _2(W'DW)Vp(W'DW) and Vp is the variance-covariance
estimate from the probit estimation of the selection equation.
References Greene. W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River. NJ: Prentice-Hall. Gronau. R. 1974. Wage comparisons: A selectivity bias. Joumat of Political Economy 82: 1119-1155. Heckman, J. 1976. The common structure of statistical models of truncation, sample selection, and limited dependent variables and a simple estimator for such models. The Annals of Economic and Social Measurement 5: 475--492. 1979. Sample selection bias as a specificationerror. Econometrica47: 153-16t. Johnston. J, and J. DiNardo. 1997. EconometricMethods. 4th ed. New York: McGraw-Hill. Lewis, H. 1974. Comments on selectivity biases in wage comparisons. Journal of Political Economy 82: 1119-1155. Manning, W. G., N. Duan. and W. H. Rogers. 1987, Monte Carlo evidence on the choice between sample selection and two-part models. Journal of Econometrics 35: 59-82.
Also See Complementary:
[R] adjust, [R] constraint, JR] lincom, [R] lrtest, [R] mfx, [R] predict, [R] test, [R] testnl, JR] vce, [R] xi
Related:
[R] heckprob,
Background:
[U] [u] [U] [U]
[R] regress,
[R] tobit, [R] treatreg
16.5 Accessing coefficients and standard errors. 23 Estimation and post-estimation commands, 23.11 Obtaining robust variance estimates. 23.12 Obtaining scores
---Title heckprob -- Maximumd_kelihood probit estimation with selection
Syntax heckprob
dewar
[vartist] [,,eight]
[if
exp] [in
range],
select( [ depva,'s = ] varlists [, ,gffse_(varname) } [ robust
] )
cl__uster (vamame)
constraints by .,.
noconstant
(numlist)
s qcore(newv_rlist) first noconstant i noskip level(#) _ffset (varname) maximize_options
: may be used with heckprob; and i_eights
see [R] by.
_eights,
f_eights,
are allowed; see [U] 1_1.6
weight.
heckprob
shares the featuresof all estimationcommands;see [U] 23 Estimationand post-estimationcommands.
Syntaxforpredict predict
[type] newvarname
[if exp] [in range]
[, statistic nooffset
]
where statistic is /
pmargin
q'(xjb),
success probability (th_ default)
pll
_2(xjb,
_/.probit = 1, yj _ select zjg, p), predicted prolJability P'tyj
plO
_52(xjb,-z/g,-p),
predicted ,robability Pr(_/pr°bit = 1,_/;elect = O)
pO1
_52(-x3b,
predicted _robability P_yj
pO0
_2(-xjb,--zjg,
psel pcond
_(zjg), selection probability _52(xjb, zig, p)/_5(zjg), prob_ility
xb stdp
xyb, fitted values standard error of fitted values
xbsel
linear prediction for selection equation
stdpsel
standard error of the linear prediction for selection equation
zjg,-p),
_/_ probit
p),
predicted )robability
Pr(y
= l)
= O, yj
_ select
p.r°bit = O, y;elect
= 1) = O)
of success conditional on selection
q)() is the standard normal distribution function and q52() is the bivariate normal distribution function. These statistics are available both in and out of sample; type predict the estimation
...
if
e(sample)
...
sample.
Description heckprob
estimates maximum-likelihood probit models with sample selection. 31
if wanted only for
Options select(...) specifiesthe va_ables and optionsfor the selectionequation. It is an integral part of specifying a selection model and is not optional. robust specifies that the Huber/White/sandwich estimator of the variance is to be used in place of the conventional MLE variance estimator, robust combined with cluster() further allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights, robustis implied; see [U] 23.11 Obtaining robust variance estimates. clust;er (varname) specifies that the observations are independent across groups (clusters) but are not necessarily independent within groups, varname specifies to which group each observation belongs, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE),but not the estimated coefficients, cluster() can be used with pweights to produce estimates for unstratified cluster-sampled data. cluster() cluster()
implies robust; by itself.
that is, specifying robust
cluster()
is equivalent to typing
score(newvarlist) creates a new variable, or a set of new variables, containing the contributions to the scores for each equation and the ancillary parameter in the model. The first new variable specified will contain ul_ -- OtnLj/O(xj/3) for each observation j in the sample, where lnLj is the 3th observation's contribution to the log likelihood. The second new variable: u2j = OlnLj/O(zj_') The third: u3j = OlnLj / O(atanh p) If only one variable is specified, only the first score is computed; if two variables are specified, only the first two scores are computed; and so on. The jth observation's contribution to the score vector is { OtnLj/Ol30lnLj/O("/)
OlnLj/O(atanhp)}
= (UljXj
u2jzj
u3j)
The score vector can be obtained by summing over j; see [U] 23.12 Obtaining scores. first specifies that the first-step probit estimates of the selection equation be displayed prior to estimation. noconstant omits the constant term from the equation. This option may be specified on the regression equation, the selection equation, or both. constraints (numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command; see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. noskS.p specifies that a full maximum likelihood model with only a constant for the regression equation be estimated. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. level (#) specifies the confidence level, in percent, for confidence intervals The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. ogfset (w_rname) is a rarely used option that specifies a variable to be added directly to Xb. This option may be specified on the regression equation, the selection equation, or both.
hect(prob--:Maximum-likelih_od_pmbit estimationwith selection
33
_,
mca:imize_.options control the maximization process; see [R] maximize. With the possible exception of iterate (0) and trace, you should never ha_e to specify them.
Optionsfor predict pma.rgin, the default, calculates the univariate (ma@nal) predicted probability of success _e, probit 1). ptty j _. calculates the bivariate predicted probability P_yj
probit
=
__ probit plO calculates the bivariate predicted probability P,_y)
--
pll
i
1
- 1, _yjselect
_}_,probit
p01 calculates the bivariate predicted probability
P_yj
._. probit pO0 calculates the bivariate predicted probability P_(yj
psel
_ select , yj
=
l).
___ 0).
o select
= O, yj
-----1).
. select = O, yj
:
0).
calculates the univariate (marginal) predicted probability of selection Pr(y_ elect = l).
pcond calculates the conditional (on selection) predicted probability of success P_tYj-'" probit = l, yj-select = t)/Pr(y_ elect = 1). xb calculates the probit linear prediction xjb. stdp calculates the standard error of the prediction, it can be thought of as the standard error of the predicted expected value or mean for the obsel_'_tion's covariate patfern. This is also referred to as the standard error of the fitted value. xbsel
calculates the linear prediction for the selectibn equation.
stdpsel
calculates the standard error of the linear irediction for the selection equation.
nooffset is relevant only if you specified offset (varname) for heckprob. It modifies the calculations made by predict so that they ignore th_ offset variable; the linear prediction is treated as xj b rather than xj b + offsetj.
Remarks The probit model with sample selection (Van d_ Yen and Van Pragg 1981) assumes that there exists an underlying relationship y_ = xjj3 + ulj
latent
equation
probit
equation
such that we observe only the binary outcome , probit
yj
= (y_ > O)
The dependent variable, however, is not always observed. Rather, the dependent variable for observation j is observed if ffselect j
I =-
(Zjy-4:-U2j
>
0)
selection
where ul _ 2?{0, 1)
Nio,1) corr(ul,
=p
equation
i
F
F
o,_
necKproo
m MaxlmumqlKellhood
probit estimation
with
selection
When p _ 0, standard probit techniques applied to the first equation yield biased results, heckprob provides consistent, asymptotically efficient estimates for _t the parameters in such models.
> Example We use the data from Pindyck and Rubinfeld (1998). In this dataset, the variables are whether children a_end private school (private), number of years the family has been at the present residence (years), log of property tax (logptax), log of income (loginc), and whether one voted for an increase in prope_y taxes (vote). In this example, we alter the meaning of the data. Here we assume that we observe whether children attend private school only if the family votes for increasing the property taxes. This is not true in the dataset and we make this untrue assumption only to illustrate the use of this command. We observe whether children attend private school only if the head of household voted for an increase in property taxes. We assume that the vote is affected by the number of years in residence, the current prope_y taxes p_d. and the household income. We wish to model whether children are sent to private school based on the number of years spent in the current residence and the cu_ent prope_y taxes paid. . heckprob Fitting
private
probit
years
logptax,
sel(vote=years
Iteration
O:
log
likelihood
Iteration
I:
log
likelihood
= -18.407192
Iteration
2:
log
likelihood
= -16.1412S4
Iteration
3:
log
likelihood
= -15.953354
Iteration
4:
log
likelihood
= -15.887269
Iteration
5:
log
likelihood
= -15.883886
Iteration
6:
log
likelihood
= -15.883655
Fitting
selection
model:
O:
log likelihood
= -63.036914
Iteration
I:
log likelihood
= -58.581911
Iteration
2:
log
likelihood
= -58.497419
Iteration
3:
log
likelihood
= -58.497288
log
likelihood
= -74.380943
Comparlson: starting
values:
Iteration
O:
log
likelihood
Iteration
I:
log
likelihood
= -17.920826
Iteration
2:
log
likelihood
= -18.375362
= -40.895884
Iteration
3:
log likelihood
= -16.067451
Iteration
4:
log
likelihood
=
Iteration
5:
log
likelihood
= -15.760354
Iteration
6:
log
likelihood
= -15.753805
Iteration
7:
log
likelihood
= -15.753785
full
model
Iteration
0:
log
likelihood
= -75.010619
Iteration
I:
log
likelihood
= -74.287753
Fitting
-15.84787
Iteration
2:
log
likelihood
= -74.250148
Iteration
3:
log
likelihood
= -74.245088
Iteration
4:
log
likelihood
= -74.244973
Iteration
5:
log
likelihood
= -74.244973
Probit
Log
model
likelihood
logptax)
= -IZ.122381
Iteration
Fitting
loginc
model:
with
sample
= -74.24497
selection
(not
concave)
Number
=
95
Censored obs Uncensored obs
of obs
= =
36 59
Wald
chii(2)
=
Prob
> chi2
=
1.04 0.5935
heckprob- Maximum-likelihoodprobitestimationwith selection
Coef.
Std.
Err.
P> lzl
!z
[95_.
Conf.
35
Interval]
)
private years logptax _cons
-. 1142597
.1461711
-0 i78
0.434
-.4007498
.1722304
.3516095 -2,780664
1.01648 6.905814
0 i 35 -0_40
O. 729 0.687
-1.640655 -16.31581
2.343874 10.75448
-,0167511
.0147735
-li13
0.257
-.0457067
vote years
.0122045
loginc
.9923025
.4430003
2 i24
O.025
,1240378
I.860567
logptax _cons
-1. 278783 -.5458214
.5717544 4.070415
-2 _24 -0.13
O.025 O.893
-2.399401 -8,523689
-. 1581649 7.432046
/athrho
-.8663147
1.449995
-0.60
O.550
-3. 708253
1.975623
-. 6994969
.7405184
-. 9987982
.9622642
rho LR test
of indep,
eqns.
(rho = 0):
chi2(_)
=
0.27
Prob
> chi2
= 0.6020
J
The output shows several iteration logs. The first iteration log corresponds to running the probit model for those observations in the sample where we hav< observed the outcome. The second iteration log corresponds to running the selection probit model, _hich models whether we observe our outcome of interest. If p = 0, then the sum of the log likelihoods from these two models will equal the log likelihood of the probit model with sample selectioh; this sum is printed in the iteration log as the comparison log likelihood. The third iteration log shows starting values 'for the iterations. The finn iteration log is for estimating the full probit model with sample selection. A likelihoodratio test of the log likelihood for this model and the comparison log likelihood is presented at the end of the output. If we had specified the robust option, then this test would be presented as a Wald test instead of a likelihood-ratio test.
q
Example In the previous example, we could obtain robust standard errors by specifying the robust We also eIiminate the iteration logs using the nolog option. • heckprob Probit
private
model
Log likelihood
with
years
lo_ptax,
sample
selection
eel(vote=years
= -74.24497
loginc
lo_q_tax) nolog
robust
Number of obs Censored obs
= =
95 36
Uncensored
=
59
obs
Wald
chi2(2)
=
Prob
> chi2
=
2,55 0.2798
Kobust Coef.
Std. Err.
2
P>Iz[
[95_ Conf.
Interval]
i
private years
-.1142597
.1113949
i -1.03
0.305
-.3325896
•1040702
logptax _cons
.3516095 -2.780664
.735811 4.786602
0.48 -0.58
0.633 0.561
-I.090553 -12.16223
1.793773 6.600903
)
Vote
! i
years
-.0167511
.0173344
-0.97
0.334
-.0507258
.0172237
loginc
.9923025
.4228035
2.35
0.019
.1636229
1.820982
lo_ptax _cons
-1.2T8783 .5458214
.5095157 4.543884
-2._1 -0._2
0.012 0.904
-_.277416 -9.45167
-.280t505 8.360027
option.
!L
t_ir_
_
,=©_Rp, uu --
/athrho rho
maA..um-.Ke.nooa
-.8663147
1.630569
-. 6994969
.8327381
prODlI estimation
-0,53
Wald test of indep, eqns, (rho = 0): chi2(1) =
with
O.595
0.28
selection
-4,062171
2.329541
-, 9994077
.9812276
Prob > chi2 = 0.5952
Regardless of whether we specify the robustoption, it is clear that the outcome is not significantly different from the outcome obtained by estimating the probit and selection models separately. This is not surprising since the selection mechanism estimated was invented for the example rather than born from any economic theory.
> Example It is instructive to compare the marginal predicted probabilities with the predicted probabilities we would obtain ignoring the selection mechanism. To compare the two approaches, we will synthesize data so that we know the "true" predicted probabilities. First, we need to generate correlated error terms, which we can do using a standard Cholesky decomposition approach. For oar example, we will clear any data from memory and then generate errors that have correlation of .5 using the following commands. Note that we set the seed so that interested readers might type in these same commands and obtain the same results. clear set
seed
set
obs 5000
gen ci
12309
= invnorm(uniform())
gen c2 = invnorm(uniform()) matrix P = (1,.5\.5,1) matrix A = cholesky(P) local facl = A[2,1] local fac2 = A[2,2] gen ul = cl gen u2 = "facl"*cl + "fac2"*c2
We can check that the errors have the correct correlation using the corr command. We will also normalize the errors such that they have a standard deviation of one so that we can generate a bivariate probit model with known coefficients. We do that with the following commands. summarize ul replace ul = ul/sqrt(r(Var)) summarize u2 replace u2 = u2/sqrt(r(Var)) drop cl c2 gen xl = u_uifomn{)-.5 gen x2 = uniform()+i/3 gen yls = 0.5 gen
y2s
gen yl gen
+ 4.x1
= 3 - 3.x2 = (yls>0)
y2 = (y2s>0)
+ ul + .5*xl
+ u2
heckpmb -- Maximum-likelihood, , probit estimationwith selection
37
We have now created two dependent variables yl aM y2 that are defined by our specified coefficients. We also included error terms for each equation and ]he error terms are correlated. We run heckprob to verify that the data have been correctly _oenerate(]according to the model Yl -- .5 -}-4Xl _ ul Y2 = 3 + .5xl _ 3x2 + u2 where we assume that Yl is observed only if Y2 = J. • heckprob yl xl, sel(y2 = xl x2) nolog Probit model with sample selection
Log likelihood = -3600.854 Coef.
Std. Err.
Number of obs Censored obs Uncemsored obs
= = =
5000 1790 3210
Wald chi2(1) Prob > chi2
= =
941.68 0.0000
P>Iz]
[95_ Conf. Interval]
xl _cons
3.985923 .4852946
.1298904 .0464037
30._9 i0•_6
0.000 0.000
3.73i342 •3943449
4.240503 .5762442
xl x2 _cons
.5998148 -3.004937 3.0_1587
.0716655 .0829469 .0782817
8.37 -36•23 38.47 i
0.000 0.000 0.000
.4593531 -3.1/6751 2.858157
.7402765 -2.842364 3.165016
0.000
.4053964
.7427295
.3845569
.6307914
y2
/athrho rho
,574063
.0860559
,5183369
.062935
LR test of indep, eqns. (rho = 0):
6._7
chi2(I) =
46.58
Prob > chi2 = 0.0000
Now that we have verified that we generated data according to a known model, we can obtain and then compare predicted probabilities from the pi'obit model with sample selection and a (usual) probit model. predict pmarg (option pmargin assumed; Pr(yl=l)) probit yl xl if y2==l
(outputomitted) predict phat (option p assumed;
Pr(yl))
Using the (marginal) predicted probabilities from the probit model with sample selection (pmarg) and the predicted probabilities from the (usual) prob!t model (phat), we can also generate the "true" predicted probabilities from the synthesized yls variOble and then compare the predicted probabilities: • gen ptrue
= norm(yls)
• summarize pmarg ptrue phat Variable Obs
Mean
Std. Dev. i
Min
Max
.0658356 1.02e-06
.9933942 1
pmarg ptrue
5000 5000
.619436 .6090161
.32094_4 .34990_5
phat
5000
.6723897
.30632_)7
.096498
.9971064
i
_
38
heckprob m Maximum-likelihood
Here
we see that ignoring
the selection
probit estimation with selection
mechanism
(comparing
the phat
variable
with the true
ptrue variable) results in predicted probabilities that are much higher than the true values. Looking at the marginal predicted probabilities from the model with sample selection, however, results in more accurate
predictions.
<1
Saved Results in e():
heckprob saves Scalars e (N)
number of observations
e (re)
return code
e (k)
number of variables
e (chi2)
x_
e(k_eq)
number of equations
e (ch±2_c)
X_ for comparison test
e (k_dv) e (N_cens)
number of dependent variables number of censored observations
e (p_c) e (p)
p-value for comparison test significance of comparison test
e (df_.m) e(ll)
model degrees of freedom log likelihood
e (rho) e(ic)
p number of iterations
e(ll_O)
log likelihood, constant-only model
e (rank)
rank of
e(ll_c)
log likelihood, comparison model number of clusters
e(rank0)
rank of e(V) for constant-only model
e (cmd)
heckprob
e (opt)
e(depvar)
name(s) of dependent variable(s)
e(chi2type)
type of optimization Wald or LR; type of model x 2 test
e(title)
title in estimation output weight type
e(chi2_ct)
type of comparison X_ test
e (wtype)
e (offset)
offset
e(uexp) e(clustvar)
weight expression name of cluster variable
e(predict) e(cnslist)
program used to implement predict constraint numbers
e(vcetype)
covariance estimation method
e (user)
name of likelihood-evaluator program e (V)
variance-covariance
e (N_clust)
e (V)
Macros
Matrices e (b)
coefficient vector
e(ilog)
iteration log (up to 20 iterations)
matrix
of the estimators
Functions e(sample)
marks estimation sample
Methods and Formulas heckprob is implementedas an ado-file. Vande Venand VanPragg (1981)provide an introduction and an explanation The
probit
of this model.
equation
is
vj = (x9 + ulj > 0) The selection equation is zj'T + u2i
> 0
where
ul _ N(0, 1) uz _ N(O. 1)
corr(ul,u2)- p
heckprob-- Maximum-!ikelih0odprobit estimationwith selection
39
The log likelihood is
IES
_ti=0
+
{1- (z,-y+ offsetT)}
where S is the set of observations for which 9i is observed, (1)20 is the cumulative bivariate normal distribution function (with mean [0 0 ]'), _0 is the standard cumulative normal, and wi is an optional weight for observation i. In the maximum likelihood estimation, p is not directly estimated. Directly estimated is atanh p: i
7
From the form of the likelihood, it is clear that if p = 0, then the log likelihood for the probit model with sample selection is equal to the sum of the probit model for the outcome 9 and the selection model. A likelihood-ratio test may therefore be performed by comparing the likelihood of the full model with the sum of the log likelihoods fo_ the probit and selection models,
References Greene,W. H. 2000. EconometricAnalysis.4th ed. Upper Sa_le River, NJ: Prentice-Hall. Beckman.J. t979. Sampleselectionbias as a specificationerror. Economettica47: 153-161. Pindyek.R. and D. Rubinfeld.1998. EconometricModelsand EconomicForecasts.4th ed. New York:McGraw-Hill. Vande Ven,W. R M. M. and B. M. S. VanPragg. 1981.The demandfor deductiblesin private health insurance:A probit modelwith sample selection.JournaIof Econometric_17: 229-252.
Also See Complementary:
[R] adjust, [R] constraint, [R] l_com, [R] lrtest, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, [R] Xi
Related:
[R] heckman, JR] probit, [R] treatreg
Background:
[u] [U] Iv] [u]
16.5 Accessing coefficients and standard errors, 23 Estimation and post-est_ation commands, 23.H Obtaining robust var] ance estimates, 23.12 Obta_iningscores
vw __'° ti"
Title he
u Obtain on-line help In
I
I
I
Syntax Windows, Macintosh, and Unix(GUI): help [ command or topic name ]
whelp[command or fopicname] Unix(console & GUI):
{help lma.n} [command or topic name ]
Description The help command displays help information on the specified command or topic. If help is not followed by a command or a topic name, a description of how to use the help system is displayed. Stata for Unix users may type help or mmamthey mean the same thing. Stata for Windows, Stata for Macintosh, and Stata for Unix(GUI) users may click on the Help menu. They may also type whelp something to display the help topic for something in Stata's Viewer. whelp typed by itself displays the table of contents for the on-line help.
Remarks See [U] 8 Stata's on-line help and search facilities for a complete description of how to use help. Q Technical Note When you type help something, Stata first looks along the S_ADOpath for something.hip; [U] 20.5 Where does Stata look for ado-files?. If nothing is found, it then looks in state.hip the topic.
Also See Complementary:
[R] search
Related:
[R] net search
Background:
[GSM]4 Help, [GSW]4 Help, [GSU] 4 Help, [U] 8 Stata's on-line help and search facilities
40
see for vl
Title i [ hetprOb - llliMaximum-l_etih°°d r 1 _ heter°skedastic "'i!_pr°bit estimati°n l llllllll I ,I I I
II I
i
i
ntax
eerar het(varlist,
[offset(varname)
c!luster(varname) nolrtest
'd [noconstant
level(#)
asis_robust
score (newvarl [newvar2:]) noskip offset (varname)
constraints
by ... : may be used with hetFob; fweights, iweights,
])
(numlist) nolqg maximize_options ] see [R] by.
and pweights are allowed; see [U] 14il.6 weight.
This command shares the features of all estimation commands;see [U] 23 Estimation and post-estimation commands.
Syntaxforpredict
i
predict[O?e]newvarname[ifexp][in r_ge] [, { p I xb [ sigma } nooffset] These statistics are available both in and out of sample; type predict the estimation sample.
...
if e(samp!e)
...
i
if wanted only for
scription hetprob
estimates a maximum-likelihoodhetero_kedasticprobit model.
See [R] logistic for a list of related estimation commands.
Options het(varlist [, of.fset(varname)]) specifies the independent variables and the offset variable, if there is one, in the variance function, her () is not optional. noconstant
suppresses the constant term (intercept}in the model.
level (#) specifiesthe confidencelevel, in percent, foi confidenceintervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying !he width of confidence intervals. as is forces retention of perfect predictor variablesand their associatedperfectly predictedobservations and may produce instabilities in maximization; see [R] probit. robust specifies that the Huber/White/sandwichestinmtor of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust varianee estimates, robust combined with cluster () allows observations which are not independent within cluster (although the)' must be independent between clusters). If you specify pweights, robust is implied: see[U] 23.13 Weighted estimation, 41
i
42
hetprob -- Maximum-likelihood heteroskedastic probit estimation
cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups, varname specifies to which group each observation belongs; e,g., cluster(personid) in data with repeated observations on individuals, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U] 23.11 Obtaining robust variance estimates, cluster() can be used with pweights to produce estimates for unstratified cluster-sampled data. but see the svyprobit command in [r<] svy estimators for a command designed especially for survey data. cluster() by itself.
implies robust;
specifying
robust
cluster()
is equivalent
to typing cluster()
score (newvarl [newvar2 ] ) creates newvarl containing uj = OlnLj/0(xj b) for each observation j in the sample. The score" vector is _ OlnLj/Ob = _ wjujxj; i.e.. the product of newvar with each covariate summed over observations. The second new variable, newvar2, contains vj = OlnLj/0(zjq,).
See [U] 23.12 Obtaining
scores.
noskip requests fitting of the constant-only model and calculation of the corresponding likelihood-ratio X 2 statistic for testing significance of the full model. By default, a Wald X 2 statistic is computed for testing the significance of the full model. offset (varname) to be 1.
specifies that varname is to be included in the model with coefficient constrained
nolrtest specifies that a Wald test of whether lnsigma2 LR test.
-- 0 should be performed
instead of the
constraints(numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command; see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. nolog
suppresses
maximize_options specify them.
the iteration log. control the maximization
process:
see [R] maximize.
You should never have to
Options for predict p, the default, calculates
the probability
of a positive outcome.
xb calculates the linear prediction. sigma
calculates the standard deviation.
noof fset is relevant only if you specified off set (varname) for het prob. It modifies the calculations made by predict so that they ignore the offset variable: the linear prediction is treated as xjb rather than xjb + offsetj.
Remarks hetprob performs maximum likelihood estimation of the heteroskedastic probit model, a generalization of the probit model. Let yj, j = 1,.., N be a binary outcome variable taking on the value 0 (failure) or 1 (success). In the probit model, the probability that yj takes on the value 1 is modeled as a nonlinear function of a linear combination of thc k independent variables xj -- (z 1j, x2_ .... , xkj): Pr(yj
- 1) - _b(xjb)
hetprob- Maximum-likelihoo_heteroskedasticprobitestimation
43
in which _0 is the cumulative distribution function (CDF) of a standard normal random variable. that is, a normally distributed (Gaussian) random varihble with mean 0 and variance 1. The linear combination of the independent variables, xjb, is cdmmonly called the index fimction or index. Heteroskedastic probit generalizes the probit model b_ generalizing _I,0 to a normal CDF with a variance no longer fixed at 1 but allowed to vary as a h_nction of the independent variables, hetprob models the variance as a multiplicative function of _ese m variables zj = (zlj, z2j,..., Zmj), following Harvey (1976): 2 i 2
={exp(zj )} :i I
Thus, the probability of success as a function of all the independent variables is Pr(yj=l
=@
xj
expzj-y Z
From this expression it is clear that, unlike the index Xjb, no constant term can be present in zj7 ]f the model is to be identifiable. i Suppose the binary outcomes yj are generated by tltresholding an unobserved random variable t_, which is normally distributed with mean xjb and varihnce 1 such that
YJ=
01
ifw_ 0
;This process gives the probit model: Pr(yj = 1) = Pr(wj
N 0) = _(xjb)
Now suppose that the unobserved wj are heteroskedasiic with variance crj2= {exp(zjb)}
2
Relaxing the homoskedastic assumption of the probit rhodel in this manner yields our muItiplicative heteroskedastic probit model:
Pr(yj
= 1)=
_{xj_/exp(zj'7)}
Z
> Example For this example, we generate simulated data for a simple heteroskedastic probit model and then estimate the coefficients using hetprob: • set obs
obs
was
O,
1000 now
1000
set
seed
1234567
gen
X = l-2*uniform()
gen
xhet
= uniform()
gen
sigma
gen
p = norm((O.3+2*x)/s±gma)
= exp(1.5*xhet)
gen
y = cond(un±form()<=p,l,O)
_,
_t;'
44
hetprob -- Maximum-likelihood heteroskedastic probit estimation
• hetprob Fitting
y x, het(xhet) comparison
model:
Iteration
O:
log
likelihood
= -688.53208
Iteration
1 :
log likelihood
= -592.23614
Iteration
2:
log likelihood
= -591.50687
Iteration
3:
log likelihood
= -591.50674
Fitting
full
model:
Iteration
O:
log
likelihood
= -591.50674
Iteration
I:
log
likelihood
= -572. 12219
Iteration
2:
log
likelihood
=
Iteration
3:
log
likelihood
= -569.48921
Iteration
4:
log
likelihood
= -569.47828
Iteration
5:
log
likelihood
= -569.47827
Heteroskedastic
probit
-570.7742
model
Number of obs Zero outcomes Nonzero
Log
likelihood
= -569.4783
y
Coef.
x
Std.
Err.
z
= =
outcomes
1000 452
=
548
Wald
chi2 (I)
=
78.66
Prob
> chi2
=
0.0000
P>Izl
[95Y, Conf.
Interval]
Y 2. 228031
.2512073
8,87
O. 000
i, 735673
2. 720388
_cons
.2493822
.0862833
2.89
O. 004
.08027
.4184943
xhet
I. 602537
.2640131
6.07
O. 000
I.085081
2. 119993
Prob
= 0.0000
insiEma2
Likelihood
ratio
test
of Insigma2=O:
chi2(1)
=
44.06
Above we created two variables, x and xhet,and then simulated
> chi2
the model
Pr(y=11=F{(80+ ,x)/exp( lxhet)} for/3o = 0.3,/31 = 2, and 71 -- 1.5. According to hetprob's output, all coefficients are significant and, as we would expect, the Wald test of the full model versus the constant-only model, e.g., the index consisting of/3o + filx versus that of just /30, is significant with X2(1) = 79. Likewise, the likelihood-ratio test of heteroskedasticity which tests the full model with heteroskedasticity against the full model without is significant with X2(1) = 44. See [R] maximize for further explanation of the output. Note that for this simple model hetprob took five iterations to converge. As stated elsewhere (Greene 2000, 829), this is a difficult model to fit and it is not uncommon for it to require many iterations or for the optimizer to print out warning and informative messages during the optimization. Slow convergence is especially common for models in which one or more of the independent variables appear in both the index and variance functions.
<1
Q Technical Note Stata interprets a value of 0 as a negative outcome (failure) and missing) as positive outcomes (successes). Thus, if your dependent and 1, 0 is interpreted as failure and 1 as success. If your dependent 1, and 2, 0 is still interpreted as failure, but both 1 and 2 are treated
treats all other values (except variable takes on the values 0 variable takes on the values 0, as successes.
O
.............
S'
RObust standard errors If you [tJ] 23.11 re-estimate robust to
Maximum-likelihooq heteroskedasticprobit estimation specify the hetprob robust --option, hetprob reports robust standard errors as described 45 in Obtaining robust variance estimates. TO illustrate the effect of this option we will our coefficients using the same model and data in our example, only this time adding our hetprob command:
Example • hetprob
y x,
het(xhet)
robust
nolog
Heteroskedastic probit model
Log likelihood = -569.4783
Number of obs Zero outcomes Nonzero outcomes
= = =
I000 452 548
Wald chi2(I) Prob > chi2
= =
65.23 0.0000
Robust y
Coef.
x _cons
2. 22803 .249382 t
Std. Err.
z
P>Izl
[95_,Conf. Interval]
8.08 2.96
O,000 O.003
l. 687355 .0840853
Y .2758597 .0843367
2.768705 .4146789
i insigma2 xhet
i 1. 602537
Wald test of insigma2=O:
i
.2671326
6. O0 chi2(1) =
O.000 35.99
1.078967
2. 126107
Prob > chi2 = 0.0000
The robust standard errors for two of the three parameters are larger than the previously reported conventional standard errors. This is to be expected, even though (by construction) we have perfect model specification, since this option trades off efficient estimation of the coefficient variancecovariance matrix for robustness against misspecificat_on. 4 Specifying the cluster() option relaxes the usual assumption of independence between observations to the weaker assumption of independence jusi between clusters, that is, hetprob, robust cluster() is robust with respect to within-cluster coffelation, There is a cost in terms of efficiency with this option, since in this case hetprob inefficiefitly sums within cluster for the standard error calculation rather than attempting to exploit what might be assumed about the within-cluster correlation (as do the xtgee population-averaged models).
Obtaining predicted values
]
Once you have estimated a model, you can use the predict command to obtain the predicted probabilities for both the estimation sample and other samples; see [U] 23 Estimation and postestimation commands and [R] predict, predict without arguments calculates the predicted probability of a positive outcome. With the xb option, it calculates the index function combination xjb, where x7 are the independent variables in the jth observation and b is the estimated parameter vector. With the sigma option, predict calculates the predicted standard deviation oj = exp(zj2().
_;:,,,-
46
hetprob -- Maximum-likelihood
heteroskedastic
probit estimation
> Example We use predict to corapute the predicted model in order to compare these with the actual
probabilities and standard
deviations
based
on our
values:
predict phat (optionp assumed; Pr(y)) gen diff_p = phat - p • summarize diff_p Variable I
Obs
Mean
diff_p ]
1000
-.0107081
Std. Dev,
Min
Max
.0131869 -.0466331
.010482
• predict sigmahat, sigma gen diff_s = sigmahat - sigma . summarize diff s Variable
Ohs
Mean
Std. Dev.
diff_s
i000
.1558882
.1363698
Min
Max
,0000417
.4819107
Saved Results hetprob
saves
in e():
Scalars e (N)
number of observations
e (re)
return code
e (k)
number of variables
e (chi2)
2
e(k_eq)
number of equations
e(chi2_c)
X 2 for heteroskedasticity LR test
e(k_dv)
number of dependent variables
e (p_c)
p-value for heteroskedasticity LR test
e(N..f) e(N_s)
number of zero outcomes number of nonzero outcomes
e(df..m_c)
degrees of freedom for heteroskedasticity LR test
e(df..m) e (11)
model degrees of freedom log likelihood
e(p) e (ic)
significance number of iterations
e(ll_O) e(ll_c) e(N_clust)
log likelihood, constant-only model log likelihood, comparison model number of clusters
e(rank) e(rankO)
rank of e(V) rank of e(V) for constant-only model
e (cmd)
hetprob
e (opt)
type of optimization
e(depvar) e(title)
name of dependent variable title in estimation output
e(chi2type) e(chi2_ct)
Watd or LR; type of model x 2 test Wald or LR; type of model x 2 test
e(clustvar)
name of cluster variable
e(method) e(vcetype) e(user)
requested estimation method covariance estimation method name of iike]ihood-evaluator program
e(offset#) e(predict) e(cnslist)
offset for equation # program used to implement predict constraint numbers
e(b)
coefficient vector
e(V)
variance-covariance
e(ilog)
iteration log (up to 20 iterations)
Macros
corresponding to e(chi2_c)
Matrices
Functions e(sample)
marks estimation sample
the estimators
matrix of
I i i
hetprob--Maximum-likelih¢od heteroskedasticprobit estimation
47
i
Methodsand Formulas ?
hetprobis implemented as an ado-file. !
The heteroskedastic probit model is a generalizaiion of the probit model since it allows the scale
t
variables. of the inverse link function to vary from observation to observation as a function of the independent The log-likelihood function for the heteroskedas|ic probit model is
lnL = _ wj In_;{xjb/exp(zT)}+ }-'_wj In [1- _{xjb/exp(zT)}] jeS
jffS
where S is the set of all observations j such that yj -¢ 0 and is maximized as described in IN] maximize.
Wj
denotes the optional weights. In L
References Greene, W. H. 2000. Econometric Harvey. A. 1976. Estimating
Analysis. 4th ed. Upper _addle River. NJ: Prentice-Hall. i regression models with multiplicative heteroscedasticity. Ecooometrica
44: 461-465,
AlsoSee Complementary:
[R] adjust, [R] constraint, [R! lincom, [R] linktest, [R] lrtest, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, [R] xi
Related:
[R] biprobit, [R] ciogit, [R] casum, [R] glm, [R] glogit, JR] logistic, [R] logit, [R] mlogit, [R] olog!t, [R] probit, [R] scobit, [R] xtprobit
Background:
[U] 16.5 Accessing coefficienb and standard errors, [U] 23 Estimation and post-_timation commands, [u] 23.11 Obtaining robust variance estimates, [u] 23.12 Obtaining scores, [R] maximize
i
....... "_'"',F J
Title hilite -- Highlight a subset of points in a two-way scatterplot
Syntax hilito
yvar xvar [ifexp] [in range], hilite(exp2)
[ graph_options ]
Description The hilitecommand draws a two-way scatterplot highlighting the observations selected by exp2.
Options hilite(exp2) is not optional. It specifies an expression identifying the observations to be highlighted. graph_options are any of the options allowed with graph, twoway; see [G] graph options.
Remarks > Example You have data on 956 U.S. cities, including average January temperature, average July temperature, and region. The region variable is coded 1, 2, 3, and 4, with 4 standing for the West. You wish to make a graph showing the relationship between January and July temperatures, highlighting the fourth region: . hilite tempjan tempjuly, hilite(region==4)
region==4
ylabel xlabel
highlighted
80=
= =
©
60"
'
_ :,"'.3 =_ 4e
._.¢'""
C
•
g (g _'
",
20
io_ a •
.....
g=*
" .
i"_r
• J _
". ,tNi'_': " : "; _'.2"_.: , "-.:
=
,:
4
.; .
e
0
1 Average
July
48
Temperalure
,, :_
hilite -- Highlighta subset of pointsin a two-way scatterplot It is possible to use graph
to product graphs like ffiis, but hilite is often more convenient.
49
q
[3Technical Note By default, hilite uses'.' for the plotting Lvmbbl and additionally highlights using the o symbol. Its default is equivalent to specifying sTabol(.o)as one of the graph_options. You can vary the symbols used, but you must specify exactly two symbols. The first is used to plot all the data and the second is used for overplotting the highlighted Subset.
Methodsand Formulas hilite is implemented as an ado-file.
References Weesie,J. 1999.dr38: Enhancementto the hilite command.Stata TechnicalBulletin 50: 17-20. Reprintedin Stara TechnicalBulletin Reprints,vol. 9, pp. 98-101.
AlsoSee Related:
[R] separate
Background:
Stata Graphics Manual
'"::"
Title I
hist-
Categorical
variable histogram
[
II
Syntax hist
varna,ne [weight]
[if exp] [in range]
[. i._ncr(#) graph_options
]
fweights are allowed; see [U] 14.1.6 weight.
Description hist graphs a histogram of varname, the result being quite similar to graph hist, however, is intended for use with integer-coded categorical variables. hist determines the number of bins automatically, labels are centered below the corresponding bar.
the z-axis
hist may only be used with categorical variables maximum(varname) - minimum(varname) < 50.
with
varname,
is automatically a range
of less
histogram.
labeled, and the than
50;
i.e.,
Options incr(#) specifies how the z-axis is to be labeled, incr(1), the default if varname reflects 25 or fewer categories, labels the minimum, minimum + 1, minimum 4- 2..... maximum, incr (2), the default if there are more than 25 categories, would label the minimum, minimum + 2, ..., etc. graph_options xlabel(), saving().
refers to any of the options allowed with graph's histogram style excluding bin (), and xscale(). These do include, for instance, freq, ylabel(), by(), total, and See [G] histogram.
Remarks b, Example You have a categorical variable rep78 reflecting the repair records of automobiles. 1 = Poor, 2 = Fair, 3 = Average, 4 = Good, and 5 - Excellent. You could type
(Continued
on next page)
5O
It is coded
h_t -- Categoricalvariablehistogram
51
• graph rep78, histogram bin(5)
to obtain a histogram. Youshould specie, bin(5) because your categorical variable takes on 5 values and you want one bar per value. (You could omit the option in this case, but only because the default value of bin() is 5; if you had 4 or 6 bars, you would have to specify it; see [G]histogram.) In any case, the resulting graph, while technically correct, ii aesthetically displdasing because the numeric code 1 is on the left edge of the first bar while the numeric code 5 is on the fight edge of the last bar. Using hist; is better: • hist rep78
434783
0
Repair
Record
1878
not only centers the numeric codes underneath the corresponding bar, it also automatically labels all the bars. hist
You are cautioned: hist is not a general replacement for graph, histogram, hist is intended for use with categorical data only, which is to say, floncontinuousdata. If you wanted a histogram of automobile
prices,
for instance,
you
would
still want
_o use the graph,
histogram
command.
;:
,_r,.
52
hist -- Categorical variable histogram
I> Example You may use any of Research and Survevs on election day data in Lipset (1993)--you • hist
candi
the options you would with graph, histogram. Using data collected by Voter based on questionnaires completed by 15,490 voters from 300 polling places originally printed in the New York Times• November 5. 1992 and reprinted draw the following graph:
[freq=pop],
by(inc) total ylab ytine noaxis title (Exit Polling By Family Income)
$50-75k
$75k+
6
= o o It.
.6
o,,_' .... L, ' Candidale
Exit Polling
voted
for,
1992
by Family
Income
[] Technical Note In both of these examples, each bar is labeled; if vour categorical you may not want to label them all. Typing
variable takes on many values,
hist myvar, incr(2)
would labeleveryotherbar.Specifying incr(3) would labeleverythirdbar,and so on,
Methods and Formulas hist is implemented
as an ado-file.
References Lipset, S. M. ]993. The significance
of the I992
Also See Related:
[R] spikeplot, [G] histogram
Background:
Stata Graphics Manual
election.
Political
Science
and Politic,_ 26_1_: 7-16,
Title hotel -- Ho_lling's
generalized means test
Syntax hotel varlist [weigh_ [iJ exp] [in range] [, by(varname)notable aweights
and fweights
]
are allow _d: see [U] 14,1.6 weight
DescriptiOn hotelperforms Hotelling's T-squared test for testing whether a set of means is zero or, alternatively, equal between two groups
i
Options by(varname) specifies a var able identifying two groups; the test of equality of means between groups is performed. If by '.) is not specified, a test of means being jointly zero is performed. i
t
notablesuppresses printing
table of the means being compared.
Remarks hotel performs Hotelling's T-squared test of whether a set of means is zero, or two sets of means are equal. It is a multivariate est that reduces to a standard t test if only one variable is specified.
I, t i I
_ Example You wi_h to _est whether a new fuel additive improves gas mileage in both stop-and-go and highway situatiotls. Taking tw_lye cars, you fill them with gas and run them on a highway-style track, recordingtheir gas mileage. Y_)uthen refill them and run them on a stop-and-go style track. Finally, you repeat the two runs but this time use fuel with the additive. Your dataset:is . describe Contains d_ta from gasexp.dta obS : 12 vats : size:
i
variable n_me
i ! i
id bmpgl ampgl bmpg2 ampg2
_
5 288 (c _.9% of memory free) storage type float float float float float;
lisplay _ormat
value label
/,9. Og /,9.Og /,9.0g /,9.0g /,9.0g
13 Jul
2000
13:22
variable label car id trackl before additiv_ trackl after additive_ track 2 before additive track 2 after additive
Sortdd by :
53
r'_
54
hotel -- Hoteiling's T-squared generalized means test [
To perfor_ zero:
the statistical test, you jointly test whether the differences
• g_n |
diffl
= ampgl
- bmpgl
g_n dill2 = ampg2 | hgtel diffl dill2
bmpg2
Variable
0bs
diffl dill2
12 12
I
Mean
Std.
1.75 2.083333
Dev.
2.70101 2.906367
1-g_oup Hotelling's T-squared = 9.6980676 F t, st statistic: ((12-2)/(12-1)(2)) x 9.6980676 H0:
Vector
of means is equal ¢o a vector F(2,10) = 4.4082
Prob
The meat
> F(2,10)
=
in before-and-after
Min
Max
-3 -3.5
5.5
results are
5
= 4.4082126
of zeros
0.0424
are different at the 4.24% significance level.
[] Technical _ote We used Hotelling's T-squared test because we were testing two differences jointly. Had there been onlvlone difference, we could have used a standard t test which would have yielded the same results as_totelling's''
test:
* W_ could have performed ampgl = bmpgl
the
test
like
this:
t test
Vari
Obs
Mean 22.75
12
• tt_st
20.68449
24.81551
2.730301
19.26525
22.73475
12
1.75
•7797144
2.70101
.0338602
3.46614
mean(ampgl
- bmpgl)
Ha:
= mean(diff)
mean(dill) ~= 0 t = 2,2444
P >
Itl =
mean(dill) > 0 _ = 2.2444
P > t =
0,0232
this: = 0
t test Std.
dlffl
12
1.75
.7797144
of freedom:
Ha: mean < 0 t = 2.2444 0.9768 this:
Err.
Std.
Dev.
2.70101
[95_
Conf,
.0338602
Interval] 3.46614
11 He:
Or like
= 0 Ha:
0.0463
Mean
t =
Interval]
3.250874
Obs
PI<
Conf,
.7881701
Variable
Degrees
[95_
.9384465
0,9768
dill1
One-_ample [
Dev.
21
Ha_ mean(dill) < 0 t = 2.2444
* Or like
Std.
Err.
12
Ho:
P < t =
Std.
meam(diffl) Ha:
P >
mean t =
Itl =
= 0
-= 0 2.2444 0.0463
Ha: mean > 0 t = 2.2444 P > t =
0.0232
""---
hotel-- Hotellin_sT-squaredgeneralizedmeanstest
55
. _otel dill1 i
Variable
i
[
0
Mean
diffl
Std.
1.75
Dev.
Min
Max
2.70101
-3
5
1-Sroup _otelling's T squared = 5.0373832 F _est statistic: ((i!2-I)/(12-I)(I))x 5.0373832 = 5.0373832
I
HO{ Vecter of means i 3 equal to a vector of zeros F(I,II) = 5.0374 Prob > F(I,II) = 0.0463
> Example Now"donsider a variation _n the experiment: rather than using 12 cars and running each car with and without the fuel additiv, you run 24 cars, 12 with the additive and 12 without. You have the following!dataset: . d_scribe
! i
I i
I
Contains data from ga: _xp2.dta o_s: 24 vats: 4 size: 480 97.4_ of memory free)
8 Sep 2000 12:19
[
! storage variable name type
display format
id mpgl mpg2 additive
_9.Og Z9.0g _9.0g _9.0g
float float float float
value label
Variable label
yesno
car id _rack 1 track 2 additive?
Sorted by: • tab additive additive?
Fr_q.
Percent
_um. i
no
12
50.:00
50.00
yes
12
50.00
100.00
Total
24
I00,00
jr
I
i
This is an_unpaired expefime_t because there is no natural pairing of the cars; we want to test that the rneanslof mpgl are equal for the two groups specified by additive as are the means of mpg2:
{
(Continued on next page)
_
=_
60
d9 -- ICD-9-CM diagnostic and procedure codes
] !
t ICD-9 codes and, if it does, icd9_] clean modifies the _ariable to contain the codes in either of two standard formats. Use of icd9[p]_clean is optional: all icd9[p] commands work equally well with cleaned or uncleaned codes. 'I_e_e are numerous ways of writing the same ICD-9 code, and icd9[p] clean is designed (1) to ensure c insistency, and (2) to make subsequent output look better. icd9[p] uncleaned) icd9[p] ge code. icd9 ICD-9 code
generate produces new variables based on existing variables containing (cleaned or ICD-9 codes, icd9[p] generate, main produces newvar containing the main code. aerate, description produces newvar containing a textual description of the ICD-9 p] generate, range() produces numeric newvar containing I if varname records an in the range listed and 0 otherwise.
icd9_p] lookup and icd9[p] search are utility routines that are useful interactively, icd9[p] lookup sin ply displays descriptions of the codes specified on the command line, so to find out what diagnostic _:913.1 means, you can type icd9 lookup e913.1. The data you have in memory are . I lrrelevant-_and remain unchanged when using icd9[p] lookup, icd9[p] search is like icd9[p] lookup ex!:ept that it turns the problem around; icd9[p] search looks for relevant ICD-9 codes from the dd_cription given on the command line. For instance, you could type icd9 search liver or icd9p
s._arch
liver
to obtain a list of codes containing the word "liver".
icd9[p] query displays the identity of the source from which the leD-9 codes were obtained and the textual _escription that icdg[p] uses. Note that! ICD-9 codes are commonly written in two bays,, with and without periods. For instance, with diagnostic codes, you can write 001, 86221. E8008. and V822, or you can write 001., 862.21, E800.8, and V82.2. With procedure codes, you can write 01, 50. 502. and 502l, or 0l., 50., 50.2. and 50.21. _he icd9[p] command does not care which syntax you use or even whether you are consistent. _ase also is irrelevant: v822, v82.2, v822. and v82.2 are all equivalent. Codes may be recorded w_h or without leading and trailing blanks.
Optionsfor use with icd9[p]check any tells ic the code, 230.52 option, conside list
59[p] check to verify that the codes fit the format of leD-9 codes but not to check whether are actually valid. This makes icd9_p] check run faster. For instance, diagnostic code _r 23052 if you prefer) looks valid, but there is no such ICD-9 code. Without the any 30.52 (or 23052) would be flagged as an error. With any. 230.52 (and 23052) is not d an error.
tells cd9[p] chock that invalid codes were found in the data. 1. 1.1.1. and perhaps 230.52 if any is n )t specified, are to be individually listed.
genaratet
ewvar) specifies that icd9[p]
check
is to create new variable newvar containing,
for
each observation, 0 if the code is valid and a number from 1 to I0 otherwise. The positive numberslindicate the kind of problem and correspond to the listing produced by icd9[p] check. For instance. 10 means the code could be valid, but it turns out not to be on the official list.
Options for use with icd9[p] clean dots specifies whether periods are to be included in the final format. Do you want the diagnostic codes recorded, for instance, as 86221 or 862.21? Without the dots option, the 86221 format would b_ used. With the dots option, the 862.21 format would be used. pad specifids that the codes are to be padded with spaces, front and back. to make the codes line up vertically.: in listings. Specifying pad makes the resulting codes look better when used with most other Stata commands.
i
icd9 -- ICD-9-CM diagnoSticand procedure codes
Options fOr i
61
with icd )[p]generate
main,descrip}ion,and ra _ge(icd9rangelist) specify what icd9[p] generate is to calculate. In all cases, varname specifies variable containing ICD,9 codes. main specifies ihat the malt .'ode is to be extracted from the IED-9 code. For procedure codes, the
i
i i
main code i_ the first tw_ characters. For diagnostic codes, the main Code is usually the first three or four characters (tlie characters before the dot if the code has dots) In any case, icdg[p] generate &Ses not care _hether the code is padded With blanks in front or how strangely it might be written; :icd9[p] gene rate will find the main code and extract it. The resulting variable is itself an ICD-Ocode and may be used with the other icd9[p] subcommands. This includes icd9[p] generate, ilain.
i ! i i_
descriptiondreates newva, containing descriptions of the ICD-9 codes.
I i{ i I
long is for _e with desc: iption.It specifies thai the new variable, in addition to containing the text describing the code, is to contain the code, too. Without long, newvar in an observation might contain "bro1_chus injury-closed". With long, it would contain " 862.21 t_ronchus injury-closed". end modifie_ long (speci _'ying end implies long) and places the code at the end of the string: "bronchus injury-closed 8I 2.21".
! i
i
range(icd9ran_elist) allows you to create indicator variables equal to l when the ICD-9 code is in the inclusive! range specifi _d.
Optionsfor usewith icd )[p]search
!
i
or specifies thai ICD-9 codes are to be searched for entries that contain any of the words specified after icd9[p I search,Th, default is to list only entries that contain all the words specified.
i
Remarks
I
code is
Let us begin _withdiagnost
codes--the
codes icd9 processes. The format of an ICD-9 diagnostic
[blanks { O-9,V,v} {0-9} {0-9} [. ] _0-9 r [0-9] jl [blanks] or
I
i i ; .
i
[blanks! E.e } { 0-9 } { 0-9 } {0-9 }[. ][0-9 [0-91 ] [blanl_s1 icd9 can dell with ICD-9 tiagnOstic codes written in any of the ways the above allows. Items in square brackets tare optional, the code might start with some number of blanks. Braces { } indicate required items. _he code the_ either has a digit from 0 to 9 the letter V (utmercase or lowercase) (first hne), or thei letter E (upl_ercase or lowercase ' '_.. second line).' After that, it has--two or mo re d_'g"_ts s, perhaps followed b a enod and th v u to tw e dvm ha s tollowed b ' more b' " i ! Y P " _t en it may ha "e p omor "_'siper p. " y
,anks .
l
_;;::
62
icd9 --
ICD-9-CM diagnostic and procedure codes
All of the following
meet the above
definition:
00: ool
')ol 001,9 O019 862 ._2 862,22 E80_). 2 e8092 V82|
Meeting t_ae above the above[definition, Examl_les|
definition of which
of currently
does not make the code valid. 15,186
defined
are currently
diagnostic
codes
There
are 233.100
possible
codes meeting
defined include
l
code
I
t i
I
description
001 001.0
cholera* cholera d/t vib cholerae
001.9 001.1
cholera cholera nos d/t vib el tot
999
complic medical care nec*
VOl
communicable dis contact*
V01. i VOI. 701.20 VOI. 3 VOl. 4 VOI. 5 VO1.6 VOl. 7 YO1.8 V01.9
tuberculosis contact cholera contactcontact poliomyelitis smallpox contact rubella contact rabies contact venereal dis contact viral dis contact nec communic dis contact nec communic dis contact nos
. . .
E800 E800.0 E800.1 E800.2 E800.3 E800.8 E800.9
rr rr rr rr rr rr rr
collision nos* collision nos-employ coll nos-passenger coll nos-pedestrian coll nos-ped cyclist coil nos-person nec coil nos-person nos
o , o
"Main ,eodes" refer to the part of the code to the left of the period. v82, and !_800 ..... E999 are main codes. There are 1.182 diagnostic
001,002 ..... main codes.
999. v0] .....
The m'Ain code corresponding to a detailed code can be obtained by taking the part of the code to the left lof the period, except for codes beginning with 176. 764. 765. v29. and v69. Those main codes are not defined, yet there are more detailed codes under them:
icd9 -- ICD-9-CM diagnostic and procedure codes I
cdde
d,:scription
[
1}'6 if(6.0 176,1
CDDE DOES NOT EXIST, but $ codes starting with 176 do exist: sl in - ka_si's sarcoma sl tisue - kpsi's srcma
764 754.0 754. O0
C )DE DOES NOT EXIST, but 44 codes starting with 7i54 do exist: It for-dates w/o let real* li :ht-for-dates winos
63
i
.5.
755 7_5.0 7_5. O0
C )DE DOES NOT EXIST, but 22 codes starting with %5 do exist: extreme immaturity* e_treme immatur wtnos
I
V_9 V:_9.0
O )DES DOES NOT EXIST, but 6 codes stating with V29 do exist: nt obsrv suspct infect
i
V_9.1
nt obsrv
I
V69 V_9.0 V619.1
O )DE DOES NOT EXIST, but 6 codes starting with V69 do exist: la k of physical exercse inirpprt diet eat habits
suspct
neurlgcl
"'_"
!
Our solution iis to define f)ur new codes:
t ! i
!
code
description
176 764 765 729 g69
kaposi's sarcoma (Stata)* light-for-dates (Stata)* immat & preterm (Stata)* nb suspct cnd (St,am)* lifestyle (stata)*
Thus, there are 15,186 + 5 = 15,191 diagnostic
i
Things are less confusing format of I CD-9iprocedure
!
I I
I
'
codes, of which
_'ith respect to procedure
codes_the
1,181 + 5 = 1,186 are main codes. codes processed
by icd9p.
The
co :les is [banks]
{0-9}{0-9}
[. ] [0-9 [0-9]]
[blanks]
Thus, there are i0,000 possil: e procedure codes, of which 4,275 are currently valid. The first two digits represent _e main co& of which 100 are feasible and 98 are currently used (00 and 17 are not used).
Descriptions The degcriptidns given for each of the codes is as found in the original source. The procedure codes • contain' the addition of fve new codes b_, us. An asterisk on the end of a description n_ d_cate_ "" _
i_ I
that the c°trespoiding ICD-9 tiagnostic code has subcategories. icd9[pJi quer_ reports thebriginal source of the information
1
on the codes:
r F J
64
icd9 -- ICD-9-CM diagnostic and procedure codes
• _cd9
query
_dta: I
i 1
Dataset from http://www.hcfa.gov/stats/pufiles.htm obtained 24aug1999 file http://www.hcfa,gov/stats/icdgv16.exe Codes 176, 764, 765, V29, and V69 defined
I
-- 765 176 kaposi's immat _ preterm sarcoma (Stata)* (Stata)* V29 nb suspct end (Stata)* V69 lifestyle (Stata)* cd9
query
J _d_a: Dataset obtained 24aug1999 • from http://www.hcfa.gov/stats/pufiles.htm file http://www.hcfa.gov/stats/icd9vl6.exe
> Example You t_ve a dataset containing up to three diagnostic codes and up to two procedures on a sample of 1,000 Ipatients: _se patients, _ist in 1/10 7.
patid I
I:
I_.
clear diagl 65450
diag2
3 2
710.02 23v.6
5 6 7 8 9
861.01 38601 705 v53.32 20200
procl 9383
proc2
37456
8383
17
2969
9337 7309 7878 0479
8385 951
i0
464.11
7548
diag3
E8247
20197
!
4641
Do not tD, to make sense of these data because, in constructing procedure codes were randomly selected.
this example,
the diagnostic
and
- _,-_Be_inlbvnoting that variable diagl is recorded sloppily--sometimes the dot notation is used and sometimes not, and sometimes there are leading blanks. That does not matter. We decide to begin by using icd9 cd9
clean clean
to clean up this variable:
diagl
di_gl contains invalid ICD-9 codes r (459) ;
icd9 clean refused because there are invalid codes among the 1.000 observations, check to find and flag the problem observations (or observation, as in this easel:
We can use icd9 :
i-|
!
[
) ) i
i
_....
icd9-_-ICD-9-CMdiagnostic and proce
. icd9 check diagl, gen(prob) diagllcontains i invalid ¢odes: I. Invalid placemer_ of period 2_ Too)many periods
t
0 0
I
3, 4_ 5i
Cod_ too short Cod# too long Invalid let chaz (not 0-9, E, or V)
0 0 0
i
6_
Invalid 2nd chax (not 0-9)
0
I
81 7_ 9.
Invalid 4th chat (not 0-9) Invalid Invalid 3rd 5th chat chat (not (not 0-9) 0-9)
0 0i
I
10,
Cod_ not defined
0
,ot.
i
i
. list pati_ diagl prob Lf prob
I
[
2.
patid 2
diagl 23v. 6
65
prob 7
Let's assume that _ve go back t_ the patient records and determine that this should have been coded 230.6: • replace d_agl = "230.6 (i re_l change made) . drop prob
if patid==2
We now tD,_againlto clean up t e formatting of the variable: • icd9 cleam diagl (643 dhange_ made) • lis_ in 1/10
_;
1, 2. 3. 4.
patid 1 2 3 4
diagl 65450 2306 71002 1026
diag2
I
5.
5
86101
6 7 8 9
38601 705 V5332 20200
2969
I )
6. 7. 8. 9.
[
10.
10
46411
20197
diag3
37456
procl 9383 8383
proc2 17
629
7548
E8247
9337 7309 7878 0479
8385 951
4641
)
Perhaps we prefer!the dot notati )n. icd9 clean can be used again on diagl, and now we will clean
l
Up diag2
and diag3:
• ted9 clea_ diagl, dots (936 _he/Ige_made) • icd9 clean diag2, dot_ (551 Changes made) • icd9 clea_ diag3, dote (i00 Changes made)
i
)
(Continued on next page)
_"
! i_d9 -- ICD-9-CM diagnostic and procedure codes
66
• lit
in
1
|
!
1/10
1
diagl 654.50
diag2
i.
patid
diag3
procl 9383
proc2
2.
2
230.6
374.56
3.
3
V10.02
8383
17
4.
4
102.6
5.
5
861.01
6.
6
386.01
7.
7
705
7309
8385
8.
8
V53.32
7878
951
9.
9
202.00
754.8
10.
10
464.11
201.97
629
We now turn to cleaning codes:
296.9
9337
E824.7
0479 4641
the procedure
codes.
We use icd9p
diag3
procl 93.83 83.83
(emphasis
on the p) to clean
these
l
. iccl9P clean procl, (816|changes made) |
dots
. ic_9p clean proc2, (140|changes made) !
dots
• li_t
in
1/10
I. i 2.
patid 1 2
diagl 654.50 230.6
diag2
3.
3
V10.02
4.
4
102.6
5.
5
861.01
6.
6
386.01
7.
7
705
73,09
83.85
8.
8
V53.32
78.78
95.1
9.
9
202.00
754.8
10.
10
464.11
201.97
374.56
296.9
icdl p check
E824.7
04.79 46.41
rules
clean
and icdgp
for the code:
clean
il does
not check
valid
ICD-9
procedure
codes;
168 missing
the code
is itself
proc2
contains
invalid
codes:
Invalid
2 3
Too many Code too
4 5
Code too long Invalid 1st char
(not
0-9)
0 0
6
Invalid
2nd
char
(not
0-9)
0
7
Invalid
3rd
char
(not
0-9)
0
8
Invalid
4th
char
(not
0-9)
0
Code
that
that the variable
values)
1
10
only verify
procl
contains
icd p check proc2
93,37
that both icd9
cleaned follows the construction icd_[p] check does that:
(proc
17
62.9
It is imDDKant to understand being valid.
proc2
not
placement
of period
0
periods short
0 0
1
defined
1
Total
Note that d_ag2 has an invalid generate( code. We;ould did above _ith icd9 check, .
find it using
icdg_] han create codes. For _nstance.
textual
new variables
containing
icd9p
descriptions
check,
generate(), just as we
of our diagnostic
and procedure
_
icd9 -- ICD-9-cM
i
diagnostic
and
proc_lure codes
67
• icd9 gen!tdl = diagl, desc . sort pared • list pared diagl tdl m 1/10 1. 2. 3. 4. 5 6
pa_id 1 2 3 4 5 6
diagl 654.50 230.6 VlO.02 102.6 861.01 386.01
tdl cerv incompet preg-unsp ca in situlanus nos hx-oral/ph_aTnxmalg nec yaws of bo_e _ joint heart contUsion-closed meniere dis co¢hlvestib
7 8 9
705 V53.32 202.00
disorders of sweat gland* ftng autmt¢ dfibrilla_or ndlr lymunsp xtrndl ors
7 8 9
I !
I
10 iCd9_] I0 464.11 ac tracheitis w obstruct Notethat _enerate, escription does notpreserve thesort order ofthedata(andneither does icdg[p] cheek unless you specify the any option),
Recall that pro6edure-codep:ioc2 had an invalid codel Even so, icdgp generate, is willing to_create a textual de: cription variable:
descript
. iedgp gen!tp2 : proc2, desc (I nonmissidg value inva!_idand so could not be labeled) . sortlpatid listipatid proc2 tp2 i:I1/10 i
pared
proc2
i
I.
' _:
2. 3,
i2 17 _3 5
i
5. 6. 7. 8.
_i
83.85 95.1
Y B
9.
itp2
musc/tend Img change nec form _ structur eye exam*
D
lo.
io
tp2 contains nothing when F "oc2 is 17 because 17 is not a valid procedure code. icdg[p] g*ner_te
can also reate variables containing main codes:
. icdg!gen main1 : diagl
I
main
listlpatid!diagl pati_ dinE1mainl [n I/I0 mainl 1. 1 654.50 654
I
3.
3 vtoo2 2 4
230.6 102.6
rio
5, 6,
_
861.01 386.01
861 386
7. 8.
7 8
705 V53.32
705 V53
10. 9.
109
464.11 202.00
464 202
2. 4. i
i
icdgp generate, :
230 102
_ain can sit ilarlygenerate main procedure codes. i
Sometimeslone i_ merely exa_fining an observation: • list
dins*
_f patid==56_
ion
_-
68
icd9-
ICD-9-CM diagnostic and procedure codes
I ...........
diagl 56 I.
diag2
diag3
526.4
If we woladered what 526.4 was, we could type ! . i_d9 |
lookup
1 m_tch
found:
526.4
icd9[p]
526.4
inflammation of jaw
]_ookup has the ability to list ranges of codes: I • i_d9
lookup
526/527
12 _atches found: 526 jaw diseases* 526.0 devel odontogenic cysts 526.1 fissural cysts of jaw 526.2 cysts of jaws nec 526.3 cent giant cell granulom 526.4 infl_mmation of jaw 526,5 alveolitis of jaw 526.8 other jaw diseases* 526.81 exostosis of jaw 526.89 [526.9 527
icd9[p]
st_arch /
jaw disease nec jaw disease nos salivary gland diseases*
has the ability to go from description to code:
• i_d9 search jaw
disease
|
4 m_tches found: |526 jaw diseases* 1526.8 other jaw diseases* 526.89 jaw disease nec
q 526.9
jaw disease nos
I
Saved Results icd9
ahd icd9p save in r()" Scalars r(e#) r(esum)
number of errors of type # total number
of errors
References Gould, W. 2p00. din76:ICD-9 diagnostic and procedure Technica I Bulletin Reprints. vol. 9. pp. 77-87. t l
codes.
Slate
Technical
Bulletir_ 54: 8-16.
Reprinted
in State
impute--- Predictmissingvalues
i
F
Syntax imputedepvdr
varlist [w_'ght]
Iif
exp] [in range],
g_enerate(newva,'l)
[ _ rp(nevar2)] aweights
and fweiShts
are allow_
see [V] 14.L6 weight.
Description impute mils in
missing vail _s; depvar is the variable whose missing values are to he imputed. vartist is tile list of variables m which the imputations are to be based and newvarl is the new '
i
i
variable to contain the imputations.
!
conducted dfficientty: this nece sitates a liner of 31 variables in varlist.
impute organi_zesthe casesby pa_ternsof missingdata so the missing-valheregressionscan be
Options generate(t/ewva'r
i [
I i
1) specifies he name of the new' variable to be created and is not optional.
varp(newvar2) specifies the n_me of a new variable to contain the variance (not the standard error'_ of the pr_edicti_n.
Remarks
i
In observation s where depva F is not equal to missing; newvarl is set equal to depvar and newvar2 (if specified) is set to zero. Whlere depvar is missing, neuvarl is imputed using the prediction from
i ! !
the best available subset of othelrwise present data. r_ewVar2(if specified) is setto the variance of the prediction. This variance is in tl_esense of predicts stdp option, although squared: see [R] predict. It is an estimate df how far thel prediction of the mean would differ from the actual data point were
l
_t known. This is not the only method or coping with missing data, but it is often much better than deleting cases with any migsing data, wl_ich is the default. For a discussion of different methods of imputation. see, for example, Little and Rubin (1987).
! i
Example imputemay be used in conjunction with, for instance, regression (or an5' estimation technique) to avoid the lo_s of an unacceptal:le number of cases due to missing data. Bear in mind, however, lhat the subsequent estimates may b : biased because any variable imputed by impute is only an estimate of the unknown, true value. In he case of linear regression, a reasonable bound (in fractional terms) for the bias!is gix,!enby the ratio of the mean of newvar2 to the variance of newvarl. Usualh'. the bias is toward zerO, meaning tl-at the effect of the variable will be underestimated. 69
.....
,v
,,,,_ut_
u
rr_olc(
missing
values
You have been hired by the restaurant industry to study expenditures on eating and drinking. You, have &ta on 898 U,S. cities: describe C_ntains data from emd.dta obs: _ars:
898 9 34,124 (96.6_
ize: v
1980 City Data 13 Jul 2000 14:00
iable name
type storage long float int float float
f_s it.eat iz:ome_pc l_.rsales_pc jaltemp
of memory free)
format
label
display _10.Og _9.0g _8.0g _9.0g _9.0g
value
variable label
precipitation
float
_9.0g
state/place code In(Dining sales per capita) Per capita money income in(retail sales per capita) Median January temperature (Fahrenheit) Annual precipitation (in.)
in,income me_ian_age hh_ize
float float float
_9.0g _9.0g _9.0g
in(median per capita income) Median age Persons per household
...... 1
So,ted by: !
You
beglnby running theregression: | • _egress In_eat In_rsales jantemp precip in_inc median_age Source
I
SS
df
MS
I Model I Residual ]
87.7285014 45.1948678
6 657
14.6214169 .068789753
E 132.923369
663
.200487736
Total I In_eat
Coef.
Std. Err.
t
P>It[
hhsize
Number of obs = F( 6, 657) =
664 212.55
Prob > F R-squared Adj R-squared Root MSE
0.0000 0.6600 0.6569 .26228
=
[95Z Conf. Interval]
In_:sales_pc jantemp pre:ipitat~n i.n_income
.6611241 .0019624 -.0014811 .I158486
.026623 .0007601 .0008433 .056352
24.83 2.58 -1.70 2.06
0,000 0.010 0.090 0-040
.6088476 .0004698 -.0030869 .0051969
.7134006 .003455 .000224Z .2265003
m,_dian_age hhsize .cons
-.0010863 -.0050407 -I.377592
.0002823 .0004243 .4777641
-3.85 -11.88 -2.88
0.000 0.000 0.004
-.0016407 -.0058739 -2.31572
-.0005319 -.0042076 -.459463
Despit, having data on 898 cities, your regression was estimated on only 664 cities_74% of the original 8 )8. Some 234 observations were unused due to missing data. In this case, when you type snrnmari: e, you discover that each of the independent variables has missing values, so the problem is not that one variable is missing in 26% of the observations, but that each of the variables is missing in some _servations. In fact, summarize revealed that each of the variables is missing in roughly 5% of th_ observations. We lost 26% of our data because, in aggregate. 26% of the observations have one )r more missing variables. Thus, |"+'eimpute each independent variable on the basis of the other independent variables: . i_pute In_rtl jantemp precip In_inc medage hhsize, gen(i_In_rtl) 4._0Y, (44) observations imputed impute } jantemp in rtl precip In_inc medage hhsize, gen(i_]antmp) 5.B0_, (53) observations imputed
f
impute -- Predict missing values
71
. impute _recip In rtL jantemp In_inc medage hhsize, gen(i_precip) 4.i56_(41) observati)ns imputed . impute In_inc In rtL jantemp precip medage!hl_size,gen(i_in_inc) 4.!34_ (39) observati)ns imputed . impute Medage In rt jantemp precip In inc hhsize, gen(i_medage) 4._5% (40) observati .ns imputed . impute lihsize In rt jantemp precip in_inc medage, gen(i_hhsize) 5._3_ (47) observati ,nsimputed
Thatdone,we can now re-estmatethe recession on the imputedvariables: • regress !in,eat i_Injsales Soul4ce Mod_l Residual ! _ Total
: i
In_eat
!
i _
S_
df
108.8_923 63.792_
i_jantmp i_precip i_in_inc i_median_ase i hhsize
.45
172.65:i145 Conf.
MS
6
18.1432051
891
,071596986
897
.192477308
Std. Err.
t
Number of obs = F( 6, 89_) = Prob > F =
898 253.41 0.0000
R-squared = Adj R-squexed = Root MSE =
0.6305 0.6280 .26758
P>ItI
[95% Conf. Interval]
i_im_rsales i_jantmp i_precip i_in_inc
.660_ )6 .0021G .9 -.0013_88 .095883
.0245827 .0006932 .0007646 .0510231
26,89 3.03 -1.74 1.88
0.000 0.002 0.083 0,061
.6126593 .0007414 -.0028275 -.0042764
.7091528 .0034625 .0001739 .1960024
i_median_age i_khsize _cons
-,0011_34 -.0052508 -I.143142
.0002584 .0003953 .4304284
-4.35 -13.28 -2.66
0.000 0.000 0,008
-.0016304 .0060267 -1.987914
-.0006163 -,004475 -.2983702
Notethat the regressionis no_ estimatedon all 898observations. <1
> Example impute canalsobe used 4th factor to extend[actorscoreestimatesto caseswith missing data.Forinstance, we havea /afiantof the automobile dataset (see[U] 9 Stata'son-linetutorials and sampledataSets)that conltinsa few additionalvariables.Wewill begin by factoringall but the price
vadable;
see [R] factor
• factor m_g-foreign, f ctors(4) (obs=66) (principal _actors; 4 factors retained) Eigenvalue Difference Proportion
Factor
Cumulative
3
i
1 2 3 4 5 6 7 8 9 I0
6. 99066 1.39528 O. 58952 O. 29870 0.24252 O. 12598 0.03628 -0.01457 -0.02732 -0.05591
5. 59588 0.80576 O. 29082 O. 05618 0.11654 O. 08970 0.05085 0.01275 O. 02860 0.05736
O. 7596 0.1516 O. 0641 O. 0325 O. 0264 O. 0137 0.0039 -0.0016 -0.0030 -0.006i
O.7596 0.9112 O. 9753 1. 0077 1. 0341 1.0478 1.0517 1.0501 1.0472 1.0411
tt I2 13
-0.11327 -0.11891 -0. 14605
O.00564 0,02714
-0.0i23 -0.0129 -0.0i59
1. 0288 1.0159 1.0000
r
72
impute
--
Predict missing values Factor Loadings
i Variable
1
mpg rep78 I rep77 headroom Irear_seat trunk
2
3
4
Uniqueness
-0.78200 -0.51076 -0.27332 0.56480 0.66135 0.72935
-0.02985 0.68322 0.70653 0.26549 0.20473 0.37095
-0.06546 -0.1i181 -0.32005 0.29651 0.36471 0.28176
0.33951 -0.01428 0.04710 0.16485 0.02062 0.12140
0,26803 0.25963 0.32145 0.49542 0.38727 0.23633
weight length turn ! displacement
0.95127 0.94621 0.88264 0.92199
0.10135 0.19595 -0.05607 0.06333
-0.18056 -0.05372 -0.08502 -0.17349
-0.09179 -0.10325 0.01169 -0.02554
0.04378 0.05274 0.21043 0.11518
_ar_ratio | order |foreign
-0.82782 -0.25907 -0.75728
0.06672 0.15344 0.30756
0.24558 0.01622 0.19130
-0.10994 0.14668 -0.29188
0.23787 0.88756 0.21014
I
There appear interpreta_on
to
be two we might
factors interpret
here. the
Let's pretend that we have given first factor as size. We now obtain
the first the factor
two factors scores:
an
||
. s_ore fl f2 (based on unrotated factors) (2 scorings not used) Scoring Coefficients 1 2
Variable I
mpg rep78 rep77 headroom _ear_seat I trunk
-0.02094 -0.03224 -0.11150 0.05530 0,03355 0.04603
0.11107 0.44562 0.27942 0.10017 0.02812 0.20622
I
0.12250 0.39997 0.04562 0.19281 -0.08534 0.00638
-0.13040 0.60223 -0.12825 0.11611 0.03528 0.06433
weight length turn displacement g_ar_ratio order
-0.06469
foreign Although nfissing observati(
is not v ]ues is:
(we
0.28292
revealed
by
this
output,
in 8 cases
would
see
that
if we typed
the
scores
summarize).
could
To
not
impute
. i_ _ute fl mpg-foreign, gen(i_fl) 10.91_ (8) observations imputed i, _ute f2 mpg-foreign, gen(i_f2) I0._1Z (8) observations imputed And
we _ ight
now
run
a regression
of price
(Continued
in terms
on next
of the
page)
two
thctors:
the
be calculated factor
scores
because
of
to all the
impute -- Predict imissingvalues i
. regre_s
price
i_f3
Source
i_f2 SS
df
MS
Number
of obs =
F( 2, Model
15¢_._23103
Residual
47£ 342293
Total
63_ )65396
price i_fl
t
73
74
71)=
2
79611551.5
Prob
71
6702004.13
R-squared:
=
0.2507
8699525.97
kdj R-squ_red Root MSE
= =
0.2296 2588.8
73
Err.
t
P>lt
> F
3oef.
Std.
3.347
315.7177
3.88
0.000
595.8234
{
[95Y. Conf.
=
ti.88 0.0000
Interval] 1854.87
i_f2
I
911:2878
339.9821
2.68
0.009
233.3827
1589.193
icons
J
626 1.285
301.7093
20.76
0.000
5660.694
6863,877
Methodsand Formulas imputeis implemented
Lsan ado-file.
Consider the command
repute y xl X2 ... Xk, gen(_)
When y is not missing,
varp(_).
=yand_=0.
Let y9 be an observatiol br which y is missing. A regressor list is formed containirig all x's for whic_ xij is not missing. If _e resulting list is empty, missing. !OtherWise a regres!iion of y on the list is dsdmated (see [R] regress) the predicted Value of yj (si,'e IN] predict), t,"j is defined as the square of the prediction, as Calculated by _redict, stdp; see [Ri predict.
from xl, x2 ..... xk _.3 and _j are set to and _j is defined as standard error of the
References i
Goldstein,R. 1996.sedl0: Patters of missingdata, Stata TechnicalBulletin32: 12-13, Reprintedin Stata Technical Bulletin Reprints.vol. 6. p. I 5.
i[
_.
1996. sedl0.I: Patternsof!missing data. update. Stata TechnicalBulletin 33: 2, Reprintedin Stata Technical BulletinReprints,vol. 6, pp. 15-116.
l,ittle.R. i. A. and D. B. Rubin. 1987.StatisticalAnalysis u4OJMissingData. New York:John Wiley& Sons. }
.
Mander.Ai asd D. Clayton.I999 sg116:Hotdeckimputation.Stata TechnicalBulletin 51: 32-34. Reprintedin Srata TechniralBulletin Reprints, v,)t.9, pp. 196-199. -_.
2000. sgll6A: Update to hotdeck imputation.Stata TechnicalBulletin 54: 26. Reprinted in Smta Technical BulletirtReprints, vol. 9, p. )9,
AlsoSee
i
Complementary:
[R] pr, diet
Related:
[R] ipelate, JR]regress
..... Title
,t ;
i
Quick reference for reading data into Stata
Description This er_try provides a quick reference for determining which method to use for reading non-Stata data into _hemoD,. See [U] 24 Commands to input clam for more details.
Remarks Summary bfthe different methods insheet o inshe, t reads text (ASCII) files created by a spreadsheet o The da a must be tab-separated space-s ,_parated. o A sing] _ observation
or comma-separated,
or a database program.
but not both simultaneously,
nor can it be
must be on only one line.
o The fir t line in the file can optionally contain the names of the variables. infile (fre_ format)--infile without a dictionary o The da
can be space-separated,
o Strings with embedded separat, d). o A singl _observation line.
tab-separated,
or comma-separated.
spaces or commas must be enclosed
in quotes (even if tab- or comma-
can be on more than one line or there can even be multiple observations
infix (fixe(_ format) o The da!
must be in fixed-column
format.
o A singl
observation
o infix
as simpler syntax than infile
can be on more than one line. (fixed format).
infile (fixe 1 format)--infile with a dictionary o The daa may be in fixed-column o A singl _ observation o infil_
format.
can be on more than one line.
(fixed format) has the most capabilities
74
for reading data.
per
infile-- Quicl_referencefor readingdata intoStata
75
I
Examples
I
l
> Example
topof exampl.raw i
1 0 0 0
0 0 I O
1 I 0
John Smith Paul Lin Jan Doe f Julie McDonald
m m f
endof exampl.raw-contains tab-separated data. The type command with the showtabs
option shows the tabs:
type eXampl.rau, slowtabs 1O1John Smithm OO.IPaulLin_T>m OIOJan Doe<3>f oO.Julie Mc[onaldf Z
It could be read in by • insheet a b c name gender using exampl
Example topof examp2.raw--
i
a,b, c,name, gender 1,0,I ,John Smith,m 0,0,I ,Pa_l Lin,m O,l,O,Jan Doe,f 0,0, Julie McDonald,:
!
endof examp2.rawcould be read in by
i
" insheet
using
examl,2
q
Example topof examp3.raw 1 0 0 0
0 0 I 0
i 1 0
"John Smith" m "Paul Lin" m "Jan Doe" f "Julie McDonald"
f
t
endof examp3.raw
contains tab-separated data _'ith strings in double quotes. ;
. type
examp3.raw, s]lowtabs
lO"John Sm th"m OO"Paul Li:"m O<:T>IO"JanDoe f OO."Julie M _Donald"f
It could be read in by i
• infile byte (a b c
I strl5 name strl gender hsing examp3
_
76
infile -- Quick reference for reading data into Stata
or
I
!
• zlsheet
a b c name
gender
using
examp3
!
Or
i_file
using
dict3
where the dictionary
dict3.dct
contains top of dict3.dct
/ infJle
dictionary
using
byte
a
byte
b
byte str15
c name
strl
gender
examp3
{
} end of dict3.dct
..
q
> Example top of examp4.raw 1 0 1 "John
Smith"
0 0 1 "Paul
Lin"
0 1
"Jan
Doe"
0 0 .
"3ulie
m
m
! !
f
McDonald"
f
end of examp4.rawcould be _ad in by • ii file
byte
• infile
using
(a b c) strl5
name
strl
gender
using
examp4
or dict4
i
l where
the _dictionary dict4,
dct
contains
i
-- top of dict4.dct
| inf_le
dictionary
using
byte
a
byte
b
byte strl5
c name
strl
gender
examp4
{
} end of dict4.dct
<3
> Example mp of examp5.raw I01_ John
Smith
O01z Paul
Lin
010J Jan Doe O0
Julie
McDonald end of examp5.raw
• i_fix
could be
a I b 2 c 3 str gender
:ad in by
4 str
name
5-19
using
examp5
i
infile-- Quickirdferencefor readingdata into Stata
77
or • imfix
}
using
dict5a
where d_ct5a.dct
contains -- topof dict5aAct--
infix didtionary
usinl a
str sir
examp5 1
b
2
c
3
gende: name
4 5-19
{
i
i
endof dict5a.dct--
or . i_file Using
dictSb
where dict5b.dctcontains ! !
topof dict5b.dct--_ infile dictionary
using
examp5
{
%If
I
byte
a
!
byte
b
_.If
i
byte strl
c gent er
Zlf Xis
strl5
name
%15s
} endof dict5b.dct i
> Example top of examp6.raw line There
I : a heading are a
total
of 4 lines
of heading.
The next line contains a useful heading: ---4+ .... I.... + .... 2 .... + .... 3.... + .... 4 .... +1
0
1
m
Jo_hn Smith
i
0
0
I
m
Paul Lin
i
0 0
i 0
0
f f
Jan Doe Julie McDonald
i i• • !
--
endofexamp6.raw
could be read in by . i_file using
where diCt6a.dct
dict6a
contains mpof dM6a.dct
i _
infi_le didtionary -firstline (5)
_ i
ex_mp6
{
byte byte _col_m_(17)
I i
usin_
-co!kunn(33)
%lf
byte strl
ender
strl5
ame
Y.15s
} endof dict6aAc_
q
11
_
78
_nfile
_
Ouick
_n_
_r
reading
d_ta
ir'_to
Stli_l_
or could bte read in by | • i_ix
5 first a I b 9 c 17 str gender 25 sir name 33-46 using examp6
or could @ read in by • infix using dict6b l where dict6b.dct contains ,_
top of dict6b.dct
infi_ dictionary using examp6 5 fifst a 1
l
b str str
{
9
c
17
gender name
25 33-46
} end ofdict6b.dct
> Example I a b _ gender name I 0 1
top of examp7.raw
John m I Smith
ooI Paul|Lin 010
't
oe
Jan O0
'4 Juli
McDonald
I
end of exampT.raw
could be --r_adin by • in_ile using dictTa
where dic}Ta.dctcontains
---
top of dict7a.dct
infi_ dictionary using examp7 { _firs_line (2) byte a byte b i _linel(2)
byte
c
_line (3)
strl str15
gender name _.15s
} end of dict7a.dct
Or, if you _,'anted to include variable labels: • fl._1 e using dict7b • in
infile -- Quick reference for reading data into Stata
where dictTb,
dct
79
contair mp of dict7b.dct
ififile dictionaryu-cLngexamp7{ __irstline (2) byte "Question 1" byte "question2" byte "Question3" _iine(2) str! mder "Genderof subject" _line (3) strl5
ime
_.15s
} end of dict7b.dct infix
could also read this data: • infix
2 first
3 lines
a i b 3 c 5 str gender
2:1 str name
3:1-15
using
examp7
or itcould be read in by
. :-infixusing
dict7 i
where dictZc.dct contair_s | iz[fixdictionaryusing examp7{ 2 ifirst I a
str str
I
I
b I
3
g der name
2:1 3 :1-15
top of dict7c.dct
end of dictTcAct or it could be read in by
where" i i
s ; i_fix dictionaryusing examp7 { 2 first a I b 3 c 5 / str
g_
1
str
n_
1-15
top of dict7d.dct
/ !
}
i
end of dict7d.dct
AlsoSee Complemental:
Background:
[R]
tit, [R] infile (fixed format),
[R]
_fix (fixed format),
[u]
4 Commands
[R] input,
to input data
[R] infile (free format), [R] insheet
Title _,_
I
infile _fixed format)|
Read ASCII (text) data in fixed format Iwith a dictionary I
]
Syntax infile
using
dfilename
[if exp] [in range]
[, _automatic
If dfilename
is specified
without
an extension,
.dot
is assumed.
;
lf filename2
is specified
without
an extension,
.raw
is assumed.
1
The synta
u_sing(fitename_)
clear
]
for a dictionary, a file created with an editor or word processor outside of Stata, is
[inf
.le]
top of dictionary
file
end of dictionary
file
dictionary [using filename] { * comments may be included freely _irecl(#) _firstlineof file (#) _lines (#) _line (#)
_newline[(') J
I
}
[_.pe]
varname
[:lblname]
['/.infmt]
["variable
labet"]
(you_data_might appearhere)
where ',in(mr is { 7,[#[.#]]{flgle} If using
ill,name /
is not specified,
If using ill, name is specified, extensiofi., raw is assumed.
] X[#]s }
the data are assumed
the data are assumed
to begin
on the line following
to be located
in filename.
the close brace.
If filename
is specified
without
an
Descriptk n infilc using reads data from a disk dataset that is not in Stata format, infile using does this by firs reading dfilename, called a dictionary, that describes the format of the data file, and then reads the ile containing the data. The dictionary is a file you create in an editor or word processor outside of Stata. The da
may be in the same file as the dictionary or in another file.
Anothe_ variation on infile omits the intermediate dictionary; see [R] infile (free format). This variation i_ easier to use. but will not read fixed-format files. On the other hand. although infile using will read free-format files, the variation is even better at it. An alternative to infile using for reading fixed-format files is infix; see [R] infix (fixed format). _nfix provides fewer features than infile using but is easier to use. I Stata his other commands for reading data. If you are not certain that infile are lookinl_ for, see [R] infile and [U] 24 Commands to input data.
80
i
using
is what you
6
•
i
infile (fixed format) -- Read ASCII (text)data in fixed format with a dictionary
81
Options automaticcauses Stata tc create value labels from the nonnumeric data it reads. It also automatically widens the display forn mt to fit the longest label.
i
using(filenamei) specific:s the name of a file containing the data. If using() is not specified, the data are assumed to fc low the dictionary in dfilename or, if the dictionary, specifies the name of Some other file, tha file is assumed to contain the data. If using(filenamei) is specified, fitename2 is used to ob din the data even if the dictionary itself says otherwise.
}! ; i
clear specifies that it is )kay for the new data tO replace what is currently in memory. To ensure that you do not lose solnething important, infi_le using will refuse to read new data if data are already in memory, cl_iar is one way you can tell infile using that it is okay. The other is to drop the data yourself _,_,typing drop _all before reading new data.
Dictionarydirective,, * marl_s comment lines. "_herever you wish to place a comment, begin the line with a *. Comments can !_appearmany times in the same dictionary. _lreci (#) is used only f_r reading da_asets that do not have end-of-litle delimiters (carriage return, line_'eed_or some coml:ination). Such files are often produced by mainframe computers and have bee n poorly translated from EBCDIC into ASCII. _lrecl() specifies the logical record !ength _lrecl()
[ i
requests thai infile
act as if a line ends every # characters.
_l_ecl() appears onl I once. and typically not at all, in a dictionara,. _firsttineoffile(#) _bbreviation ._.first())is also rarely specified. It states the line of the file where the data be_in. _:first() is not specified when the data follow the dictionary: Stata can!fi_ure that out for itself. _first () is instead specified when reading data from another file in _+hich the first line loes not contain data because of headers or other markers. F
_f_rst
() appears onl
once. and typically not at all, in a dictionary.
_line_. (#) states the nu: _ber of lines per observation in the file. Simple datasets typically have _li_aes(i). Large dat;lsets often have many lines (sometimes called records) per observation. _lines() is optional :yen when there is more ihan one line per observation because in:file can isometimes figure il out for itself. Still. if Alines(i) is not right for your data. it is best to spe_ifv the directive. _lines()
i
appears onl' once in a dictionary.
-line(#) tells infile tc .jump to line # of the observation. Distinguish _lines () from _line (). and consider a file with _lines (4). meaning four lines per observation. _line (2) says to go to the Second line of the observation. _line(4) says to go to the fourth line of the observation. You may.jump forward or b_ckward, infile does not care nor is there any inefficiency, in =,,oin_,_. forward to 21ine(a), reading: few variables, jumping back to _line(l), reading another variable, and jumping forward again to _line (3). It is not your responsib lity to ensure that, at the et_d of your dictionary, you are on _he last line of the !observation. infile knows how to get to |he next observation because it knows where you are iand it knows _lin._s(), the total number of lines per observation. _l_ne()
may appear, tnd typically does, many times in a dictionary.
f
I I I
II t
.................
I
.... ,
82 ...newlix goes _new to get
........
,
infile (fixed format) -- Read ASCII (text) data in fixed format with a dictionary e [(#) ] is an alternative to _line (). _newl ine (1), which may be abbreviated _newline, orward one line. _newline(2) goes forward two lines. We do not reconmlend the use of ine() because _line() is better. If you are currently on line 2 of an observation and want to line 6, you could type _newline (4), but your meaning is clearer if you type _line (6).
_new: ine () may appear many times in a dictionary. _colnmr (#) jumps to column # on the current line. You may jump forward or backward within a line. _ column() may appear many times in a dictionary. _skip (#_) jumps forward # columns on the current line. _skip _skil_ () may appear many times in a dictionary.
() is just an alternative to _column
().
[t)_e] va/_ame [: lblname] [%infint] ["variable label"] instructs inf i le to read a variable. The simplest form f this instruction is the variable name itself: varname. First _nderstand
that at all times infile
is on some column
of some line of an observation.
infi_e starts by being on column 1 of line 1, so pretend that is where we are. Given the simplest directiige 'varname, infile goes through the following logic: Is the.lcurrent column blank? If so, skip forward until there is a nonblank column (or until the end o_the line). If we just skipped all the way to the end of the line, store a missing value in varna/_e. If we skipped to a nonblank column, be_in collecting what is there until we come to a blankbolumn or the end of the line. That is the data for varname, Now set the current column |
to wherever we are. !
The l@ic is a bit more complicated. For instance, when skipping forward to find the data, infile might _ncounter a quote. If so, it then collects the characters for the data by skipping forward until it find_ the matching quote. If you specified a %infmt, in:file skips the skipping-forward step and simpl_ collects the specified number of characters. Nevertheless, the general logic is (optionally) skip, _oltect, and reset. ! I
Remarks i ,!
in_il using follows a two-step process using d_ cript and
i_
1. infil and
to read your data. You type something
like ingile
using reads the file descript.dct, which tells infile about the format of the data:
1
2. infil_
using
then reads the data according to the instructions
recorded
in descript.dct.
descrip_.dct ! (the file could be named anything) is called a dictionary and descript, a text file I _ou create with an editor or word processor outside of Stata. As forithe data themselves, does not aatter.
they can be in the same file as the dictionary
det is just
or in a different file. It
Readingfi ee-formatfiles 1
' ;, I
Another variation of infile for reading free-format files is described in [R] infile (free format). We will _efer to the variation as infile without a dictionary. The distinction between the two variations is in the treatment of line breaks, infile without a dictionary does not consider them significant, infile with a dictionary does.
infile(fixedformat) -- ReadASCII(tAx!)data in fixed formatwith a dictionary
83
A' line, also known a', a record, physical record, or physical line (as opposed to observations, or logical records, or logica lines), is a string of characters followed by the line terminator. If you were to type the file, a line is _,hatwould appear on y0ur screen if your screen were infinitely wide. Your screen would have to be infinitely wide so that th_erewould be no possibility that a single line could take more than one line 9f 3,our screen, thus fooling you into thinking there are multiple lines when I
i i
i
i.
i
!
i
i
therei is only one. A logical line, on the other hand, is a sequence of one or more physical lines that represents a singl+ observation of yoar data. infile with a dictionary does not willy-nilly go to new physical linesi it goes to a new lJne between obser_,ations and it goes to a new line when you tell it to, but that is all. infile without a dictionary, on the other hand, goes to a new line whenever it needs to, which can be right in tl-e middIe of an observation. Thus, consider the following little bit of data which, we will tell you, is for three variables: 54 193 2
How do you interpret th,_se data? H, re is one interpretation: There are three observations. The first is 5, 4. and missing. The second is 1, 9, and 3. The thirc is 2. missing, and missing. That is the interpretation that ±nfile with a dictionary makes. H,re is another interp etation: There are two observations. The first is 5, 4, and 1. The second is 9, 3, and 2. That is the interpretation that infile without a dictionary makes. W_aichis right? We would have to ask the person who entered these data. The question is. are the line t_reaks significant? )o they mean anything? It' the line breaks are significant, we use infile with a dictionary. If the line breaks are not significant, we use infile without a dictionary The other distinction _etween the two infiles is that infile with a dictionary does not process comma-separated-value ormat. If your data are c0mma-separated, see [R] infile (free format) or [R] Msheet.
Example Omside of Stata yout ave typed into the file -,-highway"dct information on the accident rate per million vehicle miles aldng a stretch of highwav_ the speed limit on that highway, and the number of access points (on-ram_s and off-ramps) per rnile, Your file contains [
' "
top ofhigh_ay.dct,,example]
infile dictionary i{ acc rate Ispdlimit acc_pts
} 4.58 55 4.6 2.86 6O 4.4 1.61 . 2.2 3.02 60 4.7
endof highway.dct,examplet This file can be read by yping infile data: •
using
infile using hil;hvay
infile dictionary 1{ acc_rate Ispdlimit acc_pts }(4observations r4ad) |
!
highway.
Stata displays the dictionary and reads the
1 ...........
lnfile
J ............. 84
| st . II !
.................. (fixed format) -- Read ASCII (text) data in fixed format with a dictionary
ace_rate
I. 2 I. 3_. 4_
. '_ _
4.58 2.86 1.61 3.02
spdlimit 55 60 60
acc_pts 4.6 4.4 2.2 4.7
.Z
(> Example We ca: include variable labels in a dictionary so that after we infile the data, the data will be fully labe ed. We could change highway, dot to read i
"
;
I : i I ;
.
top of highway.dct,
example 2
end of highway.dct,
example 2
inf: le dictionary { * T]is is a comment and will be ignored by Stata * Y,u might type the source of the data here. acc_rate "Acc. Rate/Million Miles"
} 4.5_ 2.8_ 1.6_ 3.0"-
spdlimit acc_pts
"Speed Limit (mph)" "Access Pts/Mile"
55 4.6 60 4.4 . 2.2 60 4.7
Now wheJb we type infile
using
highway,
Stata not only reads the data but labels the variables.
<1 I > Example
l
We caniindicate the variable types in the dictionary. For instance, if we wanted to store acc_rate as a doub..e and spdlimit as a byte, we could change highway.dct to read top of highway.dct,
example 3
infi .e dictionary { * Th s is a comment and will be ignored by Stata * dm Yo imight type the source of the data here. le acc_rate "Acc. Rate/Million Miles" byt@ !
spdlimit acc_pts
"Speed Limit (mph)" "Access Pts/Mile"
2.861 60 4.4 1.61 . 2.2 3.02
60 4.7
Since we c 3 not indicate the variable type for acc_pts, (or the typ, specified by the set type command).
end of highway.dct
example 3
it is given the default variable type float
<1
"
tnfile (fixed for at) -- Read ASCII (text) data in fixed format with a dictionary
85
Example i
By specifyingthe typ,is, we can read stringvariablesas well as numericvariables.For instance, +
topof emp.dct
iinfile dictionary '
• data on employe( str20 name
I
int
age sex
"Name" "Age" "Sex coded
0 male
I female"
} I
!"Lisa Gilmore"
25
!Branten 32 1 "Bill Ross"
27 0 end of" emp.dct
The stringscan be delimiLedby singleor doublequotesand quotesmay be omittedaltogetherif the string _:contains no blanks or other special characters,
q t [3Technical Note You may attach value abels to variables in the dictionary using the colon notation: _opof emp2.dct infile
dictionary-
data on name, strl6 name i
se: , and age "Name"
sex: sexlbl int age
"Sex" "Age"
} _Arthur Doyle" Malt 22 _Mary Hope" Female 37 #Guy Fawkes" Male _ 8 #Sherry
Crooks"
Fel ale 25 end of emp2.dct
i
If you _want the value labels to be created automatically, you must specify the automatic option on the infile command. Tl-ese data could be read by typing infile using person2, automatic assuming the dictionary alld data are stored in the file person2.dct.
i
J
}
I t
1
i
_"Example The data need not be n the same file as the dictionary. We might leave the highway data in highway.raw and write dictionary called highway.dct describing it: topof highway.dct,example4 infile
dictionary
u _ing highway
* This
dictionary
r._ads the file highway.raw.
* file
were
* read
"dictionary ace_rate
called
spdlimit
}
aee_pts
lighway.txt,
{ the first
If the
line would
1sing highway.txt" _cc. Rate/Million Miles!' Speed " ccess
Limit
(mph)"
Pts/Mile" --
end of highway.dcl,example 4
I
86 intrile(fixed format) -- Read ASCII (text) data in fixed format with a dictionary _, Example The fir_tlineoffile following rakv dataset:
() directive allows you to ignore lines at the top of the file. Consider the top of mydata.raw
The f( flowing data was entered by Marsha Martinez. Helen troy. id in( _me educ sex age 1024 5000 HS Male 28 1025 27000 C Female 24
It was checked by
end of mydata.raw
Your diction1 u'y might read top of mydata.dct infil_ dictionary using mydata { _first (4) int id "Identification Number" income "Annual income" str2 educ "Highest educ level" str6 sex byte age
} end of mydata.dct
q
1 i
> Example The _lir_e () and _lines () directives instruc! Stata how to read your data when there are multiple records per _Sbservation. You have the following in mydata2 .raw: ri
top of mvdata2.raw
id incpme educ sex age 1024 2_000 HS Male | 28 1025 2_7000 C Femalei 1035 2 000 HS Male 32 1036 25000 C Female 25
1
You can read this with a dictionary mydata2, reads the daia: • infi_e using mydata2, clear
end of mydata2.raw
dct, which we will just let Stata list as it simultaneously
i infile(fixed formal -- Read ASCII (text) data in fixed format with a dictionary
87
z
in_ile dictionary usiag mydata2 { _first(2) _lines(3) int id "Identific ttion Number" income "Annual in :ome" sir2 educ "Highes _line(2) sir6 sex _line(3)
* * * *
Begin reading on line 2 Each obbervatiOn takes 3 lines. Since __ine is not specified, Stata assumes that it is I.
educ level" * Go to line 2 of the observation. * (values for sex are located on line 2) * Go to llne 3 of the observation. * (values for age are located on line 3)
int age
} (4 bbservations read)
. list Ii 2! 3_
id 1024 1025 1035
inc(ime 251)00 27_i00 26_I00
4i
1036
25( 00
Now, here is the really good ii
could jus( as wdll have
educ _S C HS
sex Male Female Male
age 28 24 32
C
Female
25
art: We read these variables in order but that was not necessaD_. We
usedrhedictionary:
top of mydata2p.dct
inf_le dictionary using mydata2 { _first (2) _lines (3) _line (1)
int
id "Identification number" income "_ual income"
_line(3) _line(2)
sti int st_
educ age sex
"Highest educ level"
} end of mydata2p.dct
We would obtain the same re_ults--and just as quickly--the only difference being that our variables in the fin_ dataset would be n the order specified: id, income, educ, age, and sex. q
Technical!Note i.
You can use _newline tO specify where breaks occur, if you prefer: Z
........ i
i
!_
topof highway.dct,example5--
infile dictionary { acc_rate "Ac :. Rate/Million Miles" spdlimit "S )eed Limit (mph)"
>
_newline acc_pts
"Ac :essPts/Mile"
4.58 55 4.6 2.861 60 4.4 1.61. 2.2 3.02 i 60 4.7 end of highway.dct, example 5
The line th)at reads '1. 61 .' ould have been read 1.61 (without the period), and the results would have been unchanged. Since _tictionaries do not go to new lines automatically, a missing value is assumed for all values not foulnd in the record.
!
88 ---
i_file (fixed format) -- Read ASCII (text) data in fixed format with a dictionary =
]
Readingfied-format files Values _n formatted data are sometimes packed one against the other with no intervening For instande, the highway data might appear as I top of highway.raw,example 6
,,
:',
blanks.
4.58_54.6
2.86 04.4 1.61| 2.2 3.02604.7
":
end of highway.raw,example6 The first f_ur columns of each record represent the accident rate; the next two columns, the speed limit; and _he last three columns, the number of access points per mile. To read:: these data, you must specify the %infmt in the dictionary. Numeric Y,infints are denoted by a leadir_g percent sign (%) followed optionally by a string of the form w or w.d, where w and d stand for @o integers. The first integer, w, specifies the width of the format. The second integer, d, specifies ti_enumber of digits that are to follow the decimal point. Logic requires that d be less than or equal tqw. Finally, a character denoting the format type (f, g, or e) is appended. For example, %9.2f spe_zifies an f format that is nine characters wide and has two digits following the decimal point.
Numericformats The f f_rmat indicates that infile is to attempt to read the data as a number. When you do not specify th_%infmt in the dictionary, infile assumes the %f format. The missing width w means that infille is to attempt to read the data in free format. At the _mrt of each observation, to 1, indic moves the occurrence is left at tl
infile
reads a record into its buffer and sets a column pointer
ating that it is currently on the first column. When infile processes a %f format, it "olumn pointer forward through white space. It then collects the characters up to the next of white space and attempts to interpret those characters as a number. The column pointer e first occurrence of white space following those characters, If the next variable is also
free forma I, the logic repeats. When ypu space. Instead, the result @ a is, on the first
explicitly specify the field width w, as in %wf, infile does not skip leading white it collects the next w characters starting at the column pointer and attempts to interpret number. The column pointer is left at the old value of the column pointer plus w, that character following the specified field.
Example If the d tta above are stored in highway, the data:
raw, you could create the following
infi e dictionary using highway { acc_rate Y,4f "Acc. Rate/Million spdlimit acc_pts
dictionary to read
top of highway.dct,
example 6
end of highwa?.dct,
example 6
Miles"
Y,2f "Speed Limit (mph)" Y,3f "Access Pts/Mile
} 1
Wh:ncolu_s you explicitly field width, not skip intervening and characters. The first are usedindicate for the the variable ace_rate,infile the does next two for spdlim-it, the last three for acc_pts.
<1
i
•
Q
|
....
infile (fixed format,l-- Read ASCII (text) data in fixed format with a dictionary
89
The d specification in the i'/,w.df indicates the number of implied decimal places in the data. For Technica_ Note instance, the string 212 read tin a _3.2f
format repre_ems the number 2.12. You should not specifv
d unless _¢ourdata have ele@nts of this form. The w alone is sufficient to tell ±nfile
how to read
i '_ i
data in which the decimal P_]int is explicitly indicated" When iyou specify d, it is taken on13 as a su_,_estion. If the decimal point is explicitly indicated in the data, ihat decimal point a_wa3s m errides the d specification. Decimal points are also not implied
t 1 I
if the data contain an E, e, I], or d, indicating scientific notation. Fields i!are right-just, fled Otefore lmptymo dec,mal points. Thus, as 0 2 by the _3. If format.
I
2
,
2 . and
2 are all read
a TechnicalNote The g and e formats are the same as the f format. YoUCanspecify any of these letters interchangeably. The letters g and e are inch ided as a convenience to those familiar with Fortran. In Fortran. the e format i_icates scientific n, .ration. For example, the number 250 could be indicated as 2,5E+02 or 2.5D402. Fortran prograr Imers would refer to this as an ET. 5 format, and in Stata. this format would _ indicated as 7'7.5. _. In Stata. however, you need specify only the field width w. so you could react this number usin 7'7f, 7,7g. or '/,7e.
i ! i I
The gi format is really a :or_ran output format that indicates a freer format than f. In Stata. the two formtats are identical. !
i ! i i
i !
Throughout this section,
a
Technical
ou may freely substitute the g or e formats for the f format.
Note
Be careful to distinguish b__tween'/,tints and '/,infints: '/,tints are also known as display formats--_hey describe how a variable is :o look when it is outputted; see [u] 15.5 Formats: controlling how data are!displayed. 7,ilTfi,ts are also known as input formats--they describe how a variable looks when it _s inputted. For instance, there is an output date format 7,d, but there is no corresponding input format. (See [U] 27 C, remands for dealing Mth dates for recommendations on how to read dates.) Fbr the other formats we have attempted to make the input and output definitions as similar as possible. Thus. we includ g. e. and f }',i_!fints,even though they all mean the same thing, since g, e, and f are also '/,tints.
String formats The s format is for read ng strings. The syntax is gu:s where the w is optional. If you do not specify the field width, your strings must be enclosed in quotes (single or double) or the_ must not contain Nny characters other than letters, numbers, and '_. This may surprise you, ,ut the s format can be: used for reading numeric variables and the f tbrmat c_n be used for rea ring string variables! When you specify the field width u, in the '/,'u,f format, all embedded blank., in the field are removed before the result is interpreted: They are not removed by the Xws /ormat
90
' ,I
_nfile(fixed format) -- Read ASCII (text) data in fixed format with a dictionary
For instance, the _3f format would read '- 2', '-2 ', or ' -2' as the number -2. The _3s format would notl be able to read '- 2' as a number, since the sign is separated from the digit, but it could read ' -2" or '-2 '. The %wf format removes blanks; datasets written by some FORTRAN programs separate the sign from the number. There ge, however, some slde-effects of this practice. The stnng 2 2 will be read as 22 by %3f format. Most FORTRAN compilers would read this number as 202. The %3s format would issue a wamingland store a missing value. Now c6nsider reading the string 'a b' into a string variable. Using a Z3s format, it will store as it appears t a b. Using a Y,3f format, however, it wilt be stored as ab--the middle blank will be removed. I Examples using the Xs format are provided line numbers.
below, right after we discuss specifying column and
|
Specifying column and line numbers _colu_() jumps to the specified column. For instance, the documentation of some dataset indicates that the variable age is recorded as a 2-digit number in column 47. You could read this by coding
I
_column(47)
age Y.2f
After this,i you are now at column recording _ex as 0 or 1,
you
[
_column(47)
49, so if immediately
_column(47) _column(49)
I
were a 1-digit number
age Y.2f sex Y.lf
could instead code
age Y.2f sex Y, lf
It makes np difference. If at column 50 were a 1-digit code for race, skip readirlg the sex code, you could code _column(47)
age
could code
or, if you tvanted to be explicit about it, you I I
following
and you wanted to read it but
age Y,2f
column(50) race Y, lf
You couldlequivalently
skip forward using _skip ():
1
_colunm(47)
age _2f
I
_skip(1)
race Zlf
One advar_tage of column() over _skip is that it lets you jump forward or backward in a record. |
If you war,ted to read race
and then age, you could code
_column(50) race Y, lf _column(47) age Y,2f
If the d tta you are reading have multiple lines per observation (sometimes said as multiple records per observ _tion), tell infile how many lines per record there are using _lines (): _ lines (4)
_lines () appears only ()nee in a dictionary. Good style says it should be placed near the top of the dictionary, but Stata does not care.
;infile (fixed format -- Read ASCII (text) d_ta in fixed format with a dictionary
91
When you want to go to a particular line, includb the _line() directive. In our example, let's assume race, sex, and age are recorded on the second line of each observation: _lines(4) _line (2) _column(47)
a_e Y,2f
_column(50)
Tce
}'.If
Let's assume id is recordedlon line 1. 1 lines (4)
|
_line(l) i I
i
_column(I) _line(2)
d
Y,4f
Y,tf _co1_(47) ace goX2_ _column(50)
_line() works like _colu as well be read by
m() in that you can jump forward or backwardl so these data could just
_lines(4) _line(2) _column(47) _column(S0)
_ge %2f ,race %If
_line (I) _colnmn(1)
_d
7,4f
Remember that this dataset aas 4 lines per observatibn and yet we have never referred to line (3) or line(4). That is okay./Jso note that, at the endof our dictionary, we are on line t. not 4. That is okay, loo, infile will stll get to the next observation correctly.
E3TechnicalNote
i l.
Anotl!er way to move bet' _een records'is _newline ().._newline () is to _line () as _skip () is Io _column(), which is to say, ..mewline () can only go forward. There is one difference: _skip() has its uses; ._.newline () L' useful only for backward capability with older versions of Stata. _skip()has its uses bec Lusesometimes one thinks in terms of columns and sometimes one thinks in tern'ts:of widths. Some d Ia documentation might very well include the sentence "At column 54 are recorded the answers to the 25 questions, one column alloted to each." If we want to read the answers io questions 1 and 5, it would indeed be na_tutalto code _column(54)
tl
_.If
_skip(3)
is %1_ [
i
Nobody has ever read data _ocumentation with the siatement, "Demographics are recorded on record 2and, 2 records after that, _re the income " values. " The ' " documentanon " would instead " " sa3,' " Record ,_ '_ contains the demographic ir formation and record 4, irJcome." The _newline() way of thinking Is based on what is convenien for the computer which does, after all, have to eject a certain number of records. That, however, no reason for making yot_ think that way.
i
Before that thought occ fred to us, Stata users specified _mewline() to go forward records. They stiil can, so their old [ictionaries will work. When you use _neTaline() and do not speci_ _ -lines(), it is your respoasibility to eject the right number of records so that, at the end of the dictionary, you are on the last record. In'this mode. when Stata re-executes the dictionary to process
i
the next iobservation, it doe: forward one record.
I
!
[
...................
B
I
...........
92
infile (fixed format) -- Read ASCII (text) data in fixed format with a dictionary
'i':
Example of reading fixed-format files
:
> Example In thisIexample, "i each observation
i
Joh_ Dunbar 101 111111
P
i.
Sam! g. Newey,
i'
OlObOOOOO
!:
occupies two lines. The first two observations
Jr.
10001
101
North
10002
15663
42nd
in the dataset are
Street
Roustabout
Boulevard
The first )bservation tells us that the name of the respondent is John Dunbar; his id is 10001; his address i., 101 North 42nd Street; and that his answers to questions 1 through 10 are yes, no, yes, no, yes, _s, yes, yes, yes, and yes. The sei:ond observation tells us the name of the respondent is Sam K. Newey, Jr.; his id is 10002; his addre_ is 15663 Roustabout Boulevard; and that his answers to questions l through 10 were no, yes, no, _s, no, no, no, no, no, and no. (Probably John and Sam are not best friends.) In ord r to see the layout within the file, we can temporarily the appro!briate columns: ....
+ ....
Jol_
I ....
+ ....
2....
+ ....
3 ....
Dunbar
add two rulers to help our eyes see
+ ....
4 ....
+ ....
5 ....
+ ....
i0001
I01
North
42nd
Street
6 ....
10002
15663
+ ....
7 ....
+ ....
8
+ ....
7 ....
+ ....
8
lOlq 111111 Sam 010 ....
K. Newey, 000000 + ....
1....
Jr. + ....
2 ....
+ ....
3 ....
+ ....
4 ....
Roustabout + ....
5 ....
Boulevard + ....
6 ....
Each observation in the data appears in two physical lines within our text file. We had to check in our editorlto be sure that there really were newline characters (i.e., "hard returns") after the address. This is irr_ortant because some programs will wrap output for you and a single line may appear as many line_. The two seemingly identical files will differ in that one has a hard return and the other has a soft ireturn added only for display purposes. In our tidata, the name occupies columns 1 through 32; a person identifier occupies columns 33 through 37; and the address occupies columns 40 through 80. Our worksheet revealed that the widest address erred
in column 80.
The teat file containing these data is called fname, txt.
Our dictionary
file looks like
t
infi t e dictionary , I * Example * th_n
one
using
reading line.
fname.txt
in data The
where
next
line
{
top of fname.dct
observations
extend
tells
there
infile
across are
more
2 lines/obs:
_lin ,s(2)
,
_col mm(33) _ski _(2)
sir50
name
Z32s
"Name
long str50
id addr
Y,Sf Y,41s
"Person id" "Address"
of respondent"
_lin :(2) _col mm(1)
byte
ql
ZIg
"Question
!"
I
b}_te
q2
Y,lf
"Question
2"
i If
1
byte byte byte
q3 q4 q5
7,1f Y,lf Zlf
"Question "Question "Question
3" 4" 5"
byte
q6
Y,lf
"Question
6"
, I}
I"
7
infile (fixed form ) -- Read ASCII (text) data in fixed format with a dictionary byte
q8
%if
"Durst ion 8"
byte
q9
%If
"Du_stion
9"
byte
qlO
%If
"Question
I0"
93
} end of fname.dct
i
Up t6 five pieces of information may be supplied in the dictional3, for each variable: the location of the data, the storage tylie of the variable, the name of the variable, the input format, and the variable!label. !
i ! !
Thusl the str50 line sa_s that the first, variable, is to be given a.,, storage type of str50, should be called name, and have the '_ariable label Name of respondent. The %32s is the input format--this tells Stata how to read the _lata. The s tells Stata not to remove any embedded blanks: the 32 tells
li
Stata to go across 32 colur_ns when reading the data. The next line says that the second -,ariable- is to be assigned a storage type of long, named id, and labeled "Person id". Stata should start reading the information for this variable in column 33 The f tells Stata to remove any embedded blanks, and the 5 says to read across 5 columns.
I i • i i
The third variable is to De given a storage type of str50, called add.r, and labeled "Address". The _skip(2) directs Stat_ to skip 2 columns bef6re beginning to read the data for this variable, and the g4J.s instructs Stat_ to read across 41 colurnng and to not remove embedded blanks. line::(2)
instructs Stata o go to line 2 of the observation.
The remainder of the dzta is 0/1 coded--the answers to the questions. It would be convenient if we could use a shorthan to specify this portion of the dictionary, but we must supply explicit directives q
'3TechnicalNote i ! i
ii ! !
i
i
In the preceding exampl._, there were two pieces of information about location: where the data begin fo_ each variable (the _column(), _skip(), ,line ()) and how many columns the data span (the %32s, %5f, %41s, %lf). In our dictionary, some of this information was redundant. After reading name, Siata had finished with 32 columns of information. Unless instructed otherwise, Stata would _roceed to the next column--column 33--to begin reading information about id. The _column (33) was unnecessary. The _skip(2) was not.• however, unnecessary. Stata had read 37 columns of information and was ready to look at columl 38. Although the address information does not begin until column 40, columns "_8and 39 contain )lanks. Since these are leading blanks, instead of embedded blanks. $tata would jtist ignore them. Th .'re is no problem so far. The problem is with the %4is. If Stata begins reading _he address informa ion from column 38 and reads 41 columns, Stata would stop reading in column 78 (78 - 41 + 1 = 38), and the widest add?ess ends in column 80. We could have omitted the -skip(2) if we had specified an input format of X43s. The __l±ne(2) was necessary although we could have gotten to the second line by coding --newlineinstead. The _column(1)
could _ave been omitted. Afte_ the _line(),
Stata begins in column 1
See the following examp e for a dataset where both pieces of location infomaation are required.
r t
! mt.e (rlxeamrmatj -- Heaa ASCII (text) data in fixed format with a dictionary
7
D Example! The llowing file contains six variables in a variety of formats. Note that in the dictionary we read the _,ariables fifth and sixth out of order by forcing the column pointer. _
'r
topofexample.dct
in ile dictionary { -_I i double i ! i:
i .skip(2) ,column(21) ,_column(18)
i'_ _,
first second third
str4
%3f Zi.lf %6f
fourth %4s sixth 7,4.If fifth Y, if
1.2 L25.Te+252abcd 1 .232 1.3:35,7 52efgh2 5 1.4.457 52abcd 3 100. 1.5 155.TD+252efgh04 1.7 16 6 .57 52abed 5 t.71 [
end of example.dot
l
Assumin_| the above is stored in a file called example, dct, it can be infiled and listed by typing • i_file using example infile dictionary { i ! double i.skip(2)
str4
first second third fourth
7,Sf 7,2.if 7,6f
7,4s
sixth 7,4. If fifth _,2f
i.column(21) i_column(18)
} (5 _bservations
read)
list first i.2
second i.2
third 570
fourth abcd
sixth .232
fifth 1
2;{ 3 .i
1.3 I,4
1.3 I.4
5.7 57
efgh abcd
.5 i00
2 3
4.i 5.i
i.5 16.
i.5 1.6
570 .57
efgh abcd
I.7 1.71
4 5
1.
q {
Reading fi_(ed-block files
u Technical_ote The _l_'ecl (#) directive is used for reading datasets that do not have end,of-line delimiters (carriage return, lindfeed, or some combination): Such datasets are typical of IBM mainframes--where they are known as fixed block or FB. The word LRECLis IBM-mainframe jargon for logical record length. Fixed'block datasets are datasets where each # characters are to be interpreted as a record. For instance, @nsider the data 2 3 63
infile (fixed fo mat) -- Read ASCII (te_) data in fixed format with a dictionary'
95
In fixed-block format, tlq.'se data might be recorded top of mydata.ibm 1
212 423 63 end of mvdata.ibm
!
and you wouldbe told, m the side, thatthe LRECL is 4. If you then pass along that informationto inside,
it will be able
read the data: top of mydata.dct
infile dictionary using mydata.ibm { _Irecl(4)] int i_ int
}
a_e end of mydata.dct
When you do not spe( iCythe _lre¢l(#)
directive, in:file
i
assumes that each line ends with the
standardASCII delimiter whichcan be linefeed0r carriagereturn or linefeedfollowedby carriage return or carriage return 9llowed by finefeed). When you do specify _1reel data ih blocks of # char_ :ters and then acts as if that is a line.
i
(#), infile
reads the
A i:ornmon mistake ir processing fixed-block datasets is to be incorrect about the LRECL value, for instance, thinking the I.RECLis I60 when it is really 80. To understand what can happen, pretend we thought the LRECLin )ur data was 6 rather than 4, Taking the characters in groups of 6, the data
i
appearas 212
i
423 63 Stata has no way of verify ng that you have specifi_ the correct LRECLso, if the data appear incorrect,
verifyiyouhave the corre :t number. ThemaximumLRECLnfile allowsis 18,998withStatafor Unix,7,998with StataforWindows, and 3.998with Statafor /lacintosh.
References Gleason:, J. R. 1998. dm54: C tpturing comments from data dictionaries. in Siata Technical Bulletin Reprints, vol. 7. pp. 55-57. f
Stata Technical Bulletin 42: 3-4.
Gould, "iV,W. 1992. dml0: Inf ling data: Automatic dictionary' creation. Stata Technical Bulletin 9: 4-8. Stata Technical Bulletin Re_rints, vol. 2, pp. 28-34. Nash. J. D. 1994. dml9:
Me
4
ng raw data and dictionary, files. Stata Technical Bulletin 20: 3-5.
Technical Bulletin Reprints 1 vol. 4, pp. 22-25.
| AlsoSee Compl_mentary:
[R] utfile, [R] outsheet. [R] save
Related:
[R] afix (fixed format)
Background:
[u] ;4 Commands to input data, [R] afile
Reprinted
Reprinted in
Reprinted
in Stata
....................
i
Title '
...................
i
I infile ]i free format)-
Read unformatted ASCII (text)data [
i
]
Z'
Syntax ingil_
!i
]
varlist ['skip[(#)]
[varlist [_skip[(#)].
,.]I]
using filename [if exp] [in range]
If filename is specified without an extension, .raw is assumed
Descriptiln i
infil_reads into memory a disk dataset that is not in Stata format. Here _x!ediscuss using infile to read free-format data, meaning datasets where the knowledge of the forr_atting information is not necessary to make sense of them. Another variation on infile allows rea_ing fixed-format data; see [R] intile (fixed format). Yet another alternative is insheet, which is dasier to use if your data are tab- or comma, separated and contains one observation per line. Statalhas other commands for reading data, too. If you are not certain that infile is what you are lookin_ for, see IN] infile and [U] 24 Commands to input data, After tl_edata are read into Stata, the data can be saved as a Stata-format dataset; see [R] save.
Options
i
automati_causes Stata to create value labels from the nonnumeric data it reads. It also automatically widens the display format to fit the longest label. yvariable (#) specifies that the external data file is organized by variables rather than by observations. All the bbservations on the first variable appear, followed by all the observations on the second vanable_/ and so on. Time-series datasets sometimes come in this format 1
clear
spe}ifies that it is okay for the new data to replace what is currently in memory. To ensure
that youido not lose something important, infile will refuse to read new data if data are already in mem;ry, clear is one wa} you can tell infile that it is okay The other is to drop the data yourself_by typing drop _all before reading new data. i
Remarks infile_or,at least, the infilefeatures discussed here value forrn_tt.
reads data in free or comma-separated-
Remarkl are presented under the headings I I , ! t t
i 1
I
Reading free format data Reading comma-separated data Specifying variable types Reading string variables Skipping variables Skipping observations Reading time-series data
96
i I
infile (free formal -,--Read unformattedASCII (text) data
Readi
97
free format cmta
In ifree format, data • e separated by one or more white-space characters. White-space characters are blanks, tabs, and ne vlines (carriage return, linefeed, or carriage-retum_inefeed combinations). Thus. a single observatic may span any number:of lines. Numeric missing valu
Example
are indicated by single periods ('.').
J
In the file highway, r_w, you have information oft the accident rate per million vehicle miles along a stretch of highway, thelsoeed limit on that highway, and the number of access points (on-ramps and off-ramps) per mile.---Ifour file contains top of highway:raw, example1 _.58 55 4.6 2.86 60 4.4 i.61. 2.2
3.026o
4.7' endof highway•raw, example1 !
You can read these data by typing i infile ace_rate _ pdlimit {4 observations re_d)
acc_pts
using
highway
list I. 2. 3. 4.
acc_rate 4.58 2.86 1.61 3.02
s dlimit 55 60 60
acc.pts 4,6 4,4 2.2 4.7
Note that the spacing of tl e numbers in the original file is irrelevant.
[3TechniCalNote It isl not necessary that missing values be indicated by a single period. The third observation on the speed limit is missing in the previous example• The raw data file indicates this by recording a single period. Let's assume that instead the missing value was indicated by the word unknown. Thus, the raw data file appears a:
1
---
_ 4_58 55 4.6 2i86 60 4.4 1161 unknown 3 i02 60
top ofhighwayyaw,example22.2
4i7 endof highway.raw, example2 :
Here is the result of infilin
these data:
• _lnfite ace_rate s] llimit acc_pts using h_gh_ay ""/nk_ows" cannot be read as a number for spdlimit[3] (_ observations read
![
l
[1
98
iqfile (free format) m Read unformatted ASCII (text) data
infile wa_ed us that it did not know what to make of the word unknown, stored a missing, and then continhed to read the rest of the dataset. Thus, aside from the warning message, results are unchanged. ! Since not all packages indicate missing data in the same way, this feature can be useful when reading dat_ created by them. Whenever infile sees something it does not understand, it warns you, record_ a missing, and continues. If, on the other hand, the missing value were recorded not as unknown Nit as. say, 99. Stata would have no difficulty reading the number, but it would also store 99 rather than r_ssing. To convert such coded missing values to true missing values, see [R] mvencode.
i' .
[]
]
t
Reading comma-separated data In co_m_a-separated-value are are separated by either commas. You may intermix separated_'llue and free format.format, Missingdatavalues indicated bv single periods or by commamultiple commas which serve as place holders, or both. As with free format, a single observation may span any numbed of input lines.
Example We can imodify the format of highway, raw used in the previous example without affecting infile's ability to read it. The dataset can be read with the same command and the results would be the samd if the file instead contains 1
--
--
top of highway,raw. example 3
2.86, 4.58,1560,4.4 4.6 1.61, 12.2 3.02,_0 4.7 end of highway.raw,
example 3
<1
Specifying ,ariable types The variable names you type following the word infile are new variables. variable is
The syntax for a new
[type] new_varname[Llabel_name] A full discuJ;sion of this syntax can be found in [U] 14.4 varlists. As a quick review, new variables are, by defaldt, of type :float. This default can be overridden by preceding the variable name with a storage ty ,e (byte, int, long, float, double, or str#) or by using the set type command. A list of varia: _les placed in parentheses will be given the same type. For example. double_
_rst_var second_var...
causes first, var second_var _:
Ii
Iast_var)
... tast_var to all be of type double.
There is _lso a shorthand syntax for variable names with numeric suffixes. The varlist varl-var4 is equivalent to specifying varl vat2 var3 vat4.
I
,
infile (free format) -- Read unformattedASCII (text) data
99
> Example In!the highway exam _le, we could infile i
the data acc__rate,
spdtimit,
and acc_pts
and
force the variable spdli:Litto be of type int by typing •
infile
acc_rate
i(4 observations
int spdlimit
ace_pts
uaing
highway,
clear
r_ad)
We could force all vafia]:les to be of type double by typing • infile double(acc_rate (4 observations read)
spdlimit
aec_pts)
using
highway,
clear
We could call the three _ariables vl, v2, and v3 and make them all doubles bv typing i.
infile
double(vt-v3)
!(4 observations
using
highway,
clear
read)
q
Reading string variables B} explicitly specifyir g the types, we can read string variables as well as numeric variables.
Example Typing infile
str2(
; "Sherri Holliday" !Branton 32 1 "Bill
Ross"
name age sex using
myfile
would read top of myfile,raw
25 1
27,0
topof myfile.raw or even topof myfile.raw,variation2 "Sherri l,'Bill
golliday" 25,1 Ross', 27,
32
"Branton"
end of myfile.raw,
i
variation
2
Note ihat the spacing is i Televant and either single or double quotes may be used to delimit strings. The quotes_do not count when calculating the leng|h of strings, Quotes may be omitted altogether if the 'string contains no )lanks or other special characters (anything other than letters, numbers, or undergcores). Typing • infile str20 nam_ age sex using (3 observations re _d)
makes
name
a str20
an(
age
and
sex
myfile
floats.
• infile sir20 nam_ age int sex using (3 observations re Ld)
We
might have typed
myrtle
tomake sex an int or • infile striO nam._ int(age (3 observations re _d)
to make both age and sex ints.
seX) using
myfile
d
! ;_
100
infile (free format) -- Read unformatted ASCII (text) data
13Technical Note infile ican also handle nonnumeric data by using value labels. We will briefly review value labels, but you should see [U] 15.6.3 Value labels for a complete description. A value hbel
is a mapping from the set of integers to words. For instance, if you had a variable
J
called sex in your data that represented the sex of the individual, you might code 0 for mate and 1 for female. Yofi could then just remember that every time you see a value of 0 for sex, that observation refers to a male, whereas 1 refers to a female. Even better, you could inform State that 0 represents males and 1 represents lab_l
define
sexfmt
0 "Male"
females by typing
1 "Female"
Then you must tell State that this coding scheme is to be associated with the variable sex. This is typically ddne by typing • lab_l
values
sex
sex,mr
Thereafter, State will print Male rather than 0 and Female rather than 1 for this variable. State is :unique in that it has the ability to turn a value label around. Not only can it go from numeric c@es to words like "Male" and "Female", it can go from the words to the numeric code. We tell infite which value label goes with which variable by placing a colon (:) after the variable name and ts{ping the name of the value label. Before we do that, we use the label to inform S_tataof the coding.
define
command
Let's assume that we wish to infile a dataset containing the words Male and Female and that we wish to store numeric codes rather than the strings themselves. This will result in considerable data compression, especially if we store the numeric code as a byte. We have a dataset named persons .raw that contains name, sex, and age: top of persons.raw "ArthUrDoyle"Male 22 "Mary! Hope"Female37 "GuyFawkes"Male48 "Carrke
House"
Female
25 end of persons.raw
Here is hoW we read and encode it at the same time: label inf_le
define str16
sexfmt name
(4 observatlons
0 "Male" sex:sexfmt
i "Female" age using
persons
read)
list
i.
name
sex
age
Doyle
Male
22
Hope
Female
37
Guy Fawkes Carrie House
Male Female
48 25
Arthur
2.
Mary
3. 4.
The strl61in
the infile
command applies
only to the name variable: sex is a numeric variable.
as we can Orove by 1
lis_,
1. 2. 3. 4.
nolabel name
sex
age
Doyle
0
22
Mary Hope Guy Fawkes
1 0
37 48
1
25
Arthur
Carrie
House
_1
!
Int.e (_
tormm)-- -eaaumormmmaA:su. _jext)ama
,u,
D Technidal Note
i l
When infile is direct_d to use a value label arid it finds an entry in the file that does not match any ofithe codings record ;d in the label, it prints a warning message and stores missing for the observation. By specifyin_ the automatic option,you can instead have infileautomatically add entries to the value la _el, new Say!you have a dataset containing three variables. The first, region of the countr7, is a character string; the remaining two eariables, which we will just call varl and vat2, contain numbers. You have stored the data in a le called geog. raw: top of geog.raw
l
'_NE" _NCntrl"
31.23 29.52
South West
29.62 28.28
- -
87.78 98.92 114.69 218.92
.E
17.5o
44.3a
/_Cntrl
22.51
55.21 end of geog.raw
The easiest way to read tlris dataset is . infile str6 regica varl vat2 using geog
making region a string affable. You do not want to do this, however, because you are practicing for reaNng a dataset like his containing 20,000 observations. If region were numerically encoded and stored as a byte, th _.rewould be a 5-byte _aring per observation, reducing the size of the data by 100,000 bytes. Y( also do not want to bother with first creating the value label. Using the automatic option, infi: e creates the value label automatically as it encounters new regions. infile byte regi( a:regfmt varl vat2 usiflggeog, automatic (6 observations re_5) , list 1. 2. 3. 4. 5. 6,
vat1 31.23 29.52 29.62 28.28 17.5 22.51
region NE NCntrl South West NE NCntrl
vat2 87.78 98.92 114.69 218.92 44.33 55.21
±nfi_e automatically ere _tedand defined a new value label cal/ed regfmt. We can use tbe label list izomrnandto view i is contents: • label list regfmt :
regfm I 2 3 4
NE NCn ;rl Sou,;h Wes
It is not necessary that he value label be undefi_edprior to the use of infile with the automatic option. If the value label regfmt had been previOu._]ydefined as ;. label define reg Ymt 2 "West" i
the result of labellistafter the in_ilewould have been reEfmr : 2 3 4 5
West NE NCntrl South
•
.............
i
......
The automatic option is so convenient that you may see no reason for not using it. Here is one. 102 iinfile(free format)-- Read unformattedASCII(text) data Suppose _ou had a dataset containing, among other things, an individual's sex. You know that the sex variat_le is supposed to be coded male and female. If you read the data using the automatic 1 option and if one of the records contains fmlae, infile will blindly create a third sex rather than print a wdrning. 1 iIi Cl :
i t
.
Skippingariables Specifying _skip instead of a variable name directs infile to ignore the variable in that location. This feature makes it possible to extract manageable subsets from large disk datasets. A number of contiguou_ variables can be skipped by specifying _skip(#) Ignore. {
where # is the number of variables to
> Example In •the. _ighway example that started this section, the data file contained three variables: acc...xate, ] spcllamt; and acc_pts. You can read just the first two variables by typing t
• in_ile ace_rate
spdlimit _skip using highway
You can r_ad the first and last variables by typing in_ile ace_rate _skip acc_pts using highway, clear
You can r_ad just the first variable by typing :
• in_ile ace_rate _skip(2)using highway, clear {
ma_ be specified more than once. If you had a dataset containing four variables, say a, b, c, and d, _nd you wanted to read just a and c, you could type infile a _skip c _skip using filename, i _
,,
_skip
i I
1
Skipping observations Subsets! of observations can be extracted by specifying if exp, which also makes it possible to extract manageable subsets from large disk datasets. Do not, however, use the _variable __Nin exp. Use the ifirange modifier to refer to observation numbers within the disk dataset.
•
> Example
'
i
.Again r_ferring to the highway example, if you type • in] ile ace_rate spdlimit (2 oiservationsread)
ace_pts
iI
ace_rate>3
only obser 1 'ations for which ace_rate is greater than 3 will be infiled. You can type •
iniile
(30_
'
ace_rate
servations
to read onl_ the second,
spdlimit
acc_pts
in
2/4,
clear
read)
third,
and fourth
observations.
q }
I
infile (free format)-- Read unformattedASCII (text) data
Reading time-series ! ! ! i
103
clata
If you are dealing wilh time-series data, you may receive datasets organized by variables rather than by obser_'ations. All the observations on the first variable appear, followed by all the observations on the second variable, _nd so on. The byvariable(#)option specifies that the external data file is organized in this way. You specify the number of obseWations in the parentheses, since infile needs to know that numter in order to read the data properly. Alternatively, you can mark the end of one variable's data an:l the beginning of anottier's by placing a semicolon (';') in the raw data file. You may then specif a number larger than the number of observations in the dataset and leave it to 4nfile to determin, the actual number of observations. This method can also be used to read unbalanced data.
> Example YoU have time-series data on four years recorded in the file time.raw. information on year, amount, and cost and is organized by variable:
The dataset contains
top of time.raw i1980
1981 17 25
14 120
135
1982
198
30 150
180 end of time.raw
I
You can read these data _y typing :. infi!e year amot at cost using (4 observations re_d)
time,
byvariable(4)
list
I.
year 1980
amount 14
cost 120
2. 3.
1981
17
135
1982
25
4.
150
1983
30
180
If the ldata instead contai_ed semicolons marking the end of each series and had no information for amoum in 1983, the raw data might appear as 1980
I981
1982
14 17 25 ; i20 135 150 180
;
1983
; i
i
You could read these datI by typing • infile year amount (4 observations re_d) t . list
cost using
time,
I amount 14
cost 120
t.
year 1980
2.
1981
17
135
3.
1982
25
150
4.
1983
byvariable(lO0)
180
4
104
_
i:
infile (free format) -- Read unformatted ASCII (text) data
Also See! Complementary:
JR] outfile, JR] outsheet, [R] save
Related:
JR] infile (fixed format), JR] input, JR] insheet
Backgrodnd:
[U] 24 Commands to input data, [R] infile
[,n,x,,,.e0,oroa,,Re.d A=.extinxod fo nat t
_
i,,
i
i
i i
il
i
,
Syntax
in!fix specification sing filename [if exp] [in range] [, clear i i wherespecificationis # _irstlineoflile # lines #:
/ E
[byte lint Ifloat ilong ]double I str ] varlist and dfilename,
if it exists
[#-]#[-#]
contains
t
[
t
_
top of dictionary file --
I
infix dictionary Lsingfitename] { * comment, _recededby asterisk may appear freely specificafion_ _(yourdata might appear 5ere) end of dictionary file ......
4
If dfile_ame is specified wit
tan ex)ensiom .dot is assumed.
If fileng_me2or filename is specified without an extension, .rawis assumed. In the first svntax, if usingJ_lename 2 is not specified onthe command line and using file,atne is not specified in the _dictionarv_the data ard assumed to begin on the lifie following the close brace.
Description infix reads into memory
a disk dataset that is not in Stata format,
infix requires
that the data
_!
be in fixed-column forma You have alternatives t_ infix,
! !
(fixed format)--and it can read data in free format--see JR] infile (free format). Most people think infix is easier to use for reading fixed-format data, but infile has more features. If your data are not fi_ed-format,
another
is what you are looking In its first syntax, i !
i
is one. It can also read data in fixed-format--see
is insheet; See [R] insheet.
)r. see [R] irdile and [u]24 x reads the data in a two,step
Commands process.
[R] infile
If you are not certain that infix
to
input data.
You first create a disk file describing
how t_e data are recorde, t. You tell infix to read that file--called a dictionary--and from there infix goes on to read th_ data. The data can be in the same file as the dictionary, or a different file. In its second intermediate
i
inf
alternative
infile
syntax,
ou tell infix
how to read the data right on the command
file.
105
line with no
'i
106
:infix (fixedformat)
Read ASCII (text) data in fixed format
Options using(fi!ename2) specifies the name of a file containing the data. If using() is not specified, the data ate assumed to follow the dictionary in dfitename or, if the dictionary specifies the name of some other file, that file is assumed to contain the data. If using(fiIename2) is specified, filenamez is used to obtain the data even if the dictionary itself says otherwise. clear specifies that it is okay for the new data to replace what is currently in memory. To ensure that y+u do not lose something important, inf ix will refuse to read new data if data are already i in memory, clear is one way you can tell infix that it is okav. The other is to drop the data yourself by typing drop _all before reading new data. I
Specifications # first_ineoffile
(abbreviation first) is rarely specified. It states the line of the file where the
for its#lf, first is instead specified when only the data appear in a file and the first few lines of that fillecontain headers or other markers. data begin, first is not specified when the data follow the dictionary; infix can figure that out i firstl appears only once in the specifications.
i! i" :
# lines!states the number of lines per observation in the file. Simple datasets typically have '1 ; lines!. Large datasets often have many lines (sometimes called records) per observation, lines is optional even when there is more than one line per observation because infixcan sometimes figure _t out for itself. Still, if 1 lines is not fight for your data, it is best to specify the directive.
'
i, ,
lines Iappears only once in the specifications. #: tells infix to jump to line # of the observation. Consider a file with 4 lines, meaning four lines per observation. 2: says to go to the second line of the observation. 4: says to go to the fourth line of_the observation. You may jump forward or backward: infix does not care nor is there any inefficiency in going forward to 3:, reading a few variables, jumping back to 1:, reading anothei" variable, and jumping back again to 3 :. It is n0t your responsibility to ensure that, at the end of your specification, you are on the last line of!the observation, infix knows how to get to the next observation because it knows where you are and it knows lines, the total number of lines per observation #: may appear, and typically does, many times in the specifications. / is an alternative to #:. / goes forward one line. //goes forward two lines. We do not recommend the usd of / because #: is better. If you are currently on line 2 of an observation and want to get [_ "'
to linei6, you could type////, but your meaning is clearer if you type 6:. / may!appear many times in the specifications.
: : i
[byte I int Ifloat j long I double and, sdmetimes,_more than one.
I str ]varlist [#-]#[-#]
instructs infix
to read a variable
'_
Begin _y realizing that the simplest form of this is varname #, such as sex 20. That says that variabl_ varname is to be read from column # of the current line: variable sex is to be read from
: '_ t
column20 and here, sex is a one-digit number. varn " rn m fr m the column ran e s eclfied read ar_e#-#, such as age 21-23, says to readva a e o " g p " ; age frtm columns 21 through 23 and here, age is a three-digit number. You cab prefix the variable with a storage type. str name 25-44 means to read the string variable name _rom columns 25 through 44. If you do not specify str. the variable is assumed to be numeriC. "Youcan specify the numeric subtype if you wish.
infix (fixed format) _ Read A_Cll (text) data in fixeclformat-
i i
You can specify more than one variable, with or without a type. byte ql-q5 51-55 means read va_ables ql, q2, q3, q4. and q5 from column_; 51 through 55 and store the five variables as b_tes. Finally, you can spec fy the line on which the Variable(S) appear, age 2:21-23
i
107
says that age is
tO:be obtained from }he second line, column_ 21 through 23. Another way to do this is to put together the #: direct}ve _ith the input-_afiabte directive: 2: age 21-23. There is a difference. but not with respect t_ reading the variable age, Let s consider two alternatives: ;1: str name 25-4_ age 2:21-23 ql-q5 51-55 1:
[
str
name
25-44
2:
age
21"23
ql-q5
51-55
The difference is thai the first directive says variables ql through q5 are on line I whereas the seCond says they are an line 2. When the colon is p_t out front it says on which line variables are to be found when we do not explicitly say otherwise. Vc'hen the colon is put inside, it applies only to the variable under consideration.
Remafks There are two ways t9 use "infix il
One is to type the specifications that describe how to read the
fixed_format data on thelcommand line: .
infix
ace
rate
_-4
spdlimit
6-7
acc_pts
9-11
using
highway.raw /
The other is to type the specifications into a file Z
topof highway.dcI,exampleI
--i infix
dictionary acc rate spdlimit acc_pts
asing highway.raw t-4
{
3-7 I-II
} end of highway.dct,
example
I
and {hen, inside Stata. t, _e . infix
{
i
using
hig way.dct
Which you use makes r_o difference to Stata. The first form is more convenient if there are only a few variables and the second form is less prone to error if you are reading a big, complicated file The second form alkws two variations, the one we just showed--where file_and one where the data are in the same file as the dictionary:
the data are in another
topof highwav.dct,example2i
infix
dictionary acc_rate
{ i-4
spdlimit
_-7
acc_pts
)-II
} 4.58
55 .46
2.8660 1.61 3.02
4.4
2.2 60 4.7 --
>a ot6 that
in the first ex
ple, the top line of the file read infix
wheieas in the second toe line reads simply iMix {
dictionary.
data _.are it is implied t_at__the data follow the dictionary.
end of highway.tier
dictionary
example
using
2
highway.raw where the When you do not say.
108
infi]K(fixed format) -- Read ASCII (text) data in fixed format
'!.
> Example So let's complete the example we started. You have a dataset on the accident rate per million vehicle miles along a stretch of highway, the speed limit on that highway, and the number of access points per mile. You have created the dictionary file highway, dct which contains the dictionary and the data: top of highway.dct,example 2 infix d_ctionary { ace_rate I-4 spdlimit 6-7 acc_pts 9-11
} 4.58 2.86
55_ .46 6_ 4.4
1.61 3.02
i 2.2 6_ 4.7
|
! ! end of highway.dct,
example 2
You created this file outside of Stata using an editor or word processor. Inside Stata. you now read the data. infix lists the dictionary so you will know the directives it follows:
! i
: !
• infix_using highway infix dictionary { ace_rate 1-4 spdlimit 6-7 acc_pts 9-11
} (4 observations
read)
list 1. 2. 3. 4.
ace_rate 4.58 2.86 1.61 3.02
spdlimit 55 60 60
acc_pts .46 4.4 2.2 4.7
Note that we simply typed infix using highway rather than infix using highway.dct, When we do not specify the file extension, infix assumes we mean .dot. <1
Reading string variables When you do not say otherwise in your specification either on the command line or in the dictionary infix assumes variables are numeric. You specify that a variable is a string by placing str in front 9f its name: infix
id t-6
str name 7-36
age 38-39
str sex 41
uslng employee.raw
or top of emptoyee.dct infix d_etionary using employee.raw id t-6 isir name 7-36 age s_r sex
{
38-39 40
} end of empIoyee.dct
infix (fixedformat)--- Read ASCII(text) data in fixedformat
109
f Reading!multiple-lines-er-observation When a dataset has muir le lines per observation, sometimes said multiple records per observation. you specify the number ol lines per observation using lines and you specify on which line the elements appear using #:. . infix
2 lines
1: id 1-6
str name 7-36
2: age I-2
str sex 4
using emp2.raw
oF topofemp2Act iz_fixdictionary using emp2.raw { 2I:lines id sir name 2: age str sex
i"6 7' 36 1"i2 4
} end of emp2,dct
There _e lots of different _,ays to say the same thing.
,
> Example Consider the following
l
aw data:
i_ income educ / se_ age / rcode, 1024 25000 HS | Male 28 119503
top of mydata.raw answers
to questions
--
1-5
1025 27000 C Female 24 022113 1035 26000 HS Male 32 110321
f36 2sooo c Female 25 131232 ;
--
end of mydata.mw
This dmaset has 3 lines oer observation and the first line is just a comment. One possible set of specifi+ations to mad _is ktata'_is infix dictionary u i 2 first 3 lines I: id income str educ 2: str sex 3:
4
topof mydatalAct
ing mydata {
I-4 6-10 12-13 6-11
int age !13-14 rcode 16 ql-q5
7-16
I end of mydatal,dct
----_
although we pre_r i
110
infix (fixed format) -- Read ASCII (text) data in fixed format top of mydata2.dct infi_
dictionary
using
mydata
{
2 first 3 lines id
1:I-4
income
I: 6-10
' E_
sir
educ
1:12-13
i!i
sir
sex
2:6-11
I '
age rcode
2:13-14 3:6
ql-q5
3:7-16
} •end of mydata2.dct Either will read these data, so we will use the first and then explain why we prefer the second. • infix
using
mydatal
infix dictionary 2 first I:
using
lines id
mydata
{
1-4
income
6-10
str
educ
12-13
2:
str
sex
6-11
3:
int age rcode
13-14 6
ql-q5
7-16
} (4 observations • list
in
read)
I/2
Observation
1 id
1024
income
sex
Male
age
28
q2 q5
9 3
q! q4 Observation
1 0
25000
educ
HS
rcode
1
q3
5
2 id
1025
sex
Female
income
educ
C
age
27000 24
rcode
0
q3
1
ql
2
q2
2
q4
1
q5
3
Now, what is better about the second? What is better is that the location of each variable is completely documented on each line, in terms of both line number and column, Since infix does not care about the order in which we read the variables, we could take the dictionary, jumble the lines, and it would still work. For instance, .... infi:
dictionary first
using
mydata
top of mydata3.dct
{
lines str
sex
1
rcode
!
sir age id
educ
ql-q5 income
2:6-11 3:6 1:12-13 2:13-14 I: i-4 3:7-16 i: 6-10
}
t
end of mydam3.dct
!
[ ]
I
infix(fixedformat)--Read ASCII(text)datain fixedformat
111
wilt also read these data even though•for each observation, we start on line 2, go forward to line 3, jump back to line l, and end up on line 1. It is not even inefficient to do this because infix does not really jump to record 2, then record 3, then record 1 again, etc, infix takes what we say and organizes it efficiently. The order in which we say it makes no difference. Well, it does make one: the order of the variables in the resulting Stata dataset will be the order we specify. In this case the reordering is senselessbut, in real datasets, reordering variables is often desirable. Moreover, we often construct dictionaries, realize _at we omitted a variable, and then go back and modify them. By making each line complete in and of itself, we can add new variables anywhere in the dictionary and not worry that. because of our addition, something that occurs later will no longer read correctly. <1
Readingsubsetsof observations
i
If you wanted to read only the males from some raw data file, you might type • infix
id i-6
sir name 7-36
age 38-39
str sex 41
using employee.raw if sex=="M"
If your specification was instead recorded in a dictionary, you could type infix
using employee.dct i_ sex=="M"
In another dataset, if you wanted to read just the first t00 observations, you could type (
infix 2 lines > in i/i00
1:
id I-6
str name 7-36
2: age i-2
str sex 4
using empi.raw
Or, if the specification was instead recorded in a dictionary and you wanted observations 10l through 573, you could type • infix using emp2.dct in 101/573
Also See Complementary:
[R]outfile, [R] outsheet, [R] save
Related:
[R]intile (fixed format), [R]insheet
B_ckground:
[L] 24 Commands to input data, [R]intile
i
F °'; e
input -- Enter data from keyboard I
I II I
I III
I
I
I
Syntax input
[varlist]
[,_automatic label]
Description input allows you to type data directly into the dataset in memo_• alternative to input.
Also see [R] edit for a windowed
Options automatic causes Stata to create
value labels from the nonnumeric
data it encounters•
automatically widens the display format to fit the longest label. Specifying label even if you do not explicitly type the label option.
automatic
It also implies
label allows you to type the labels (strings) instead of the numeric values for variables associated with value labels. New value labels are not automatically created unless automatic is specified.
Remarks If there are no data in memory, when you type input you must specify a vartist• Stata will then prompt you to enter the new observations until you type end.
> Example You have data on the accident rate per million vehicle miles along a stretch of highway along with the speed limit on that highway. You wish to type these data directly into Stata: • input nothing to input r (104) ;
Typing input by itself does not provide enough information know the names of the variables you wish to create. • input ace_rate spdlimit 1. 2. 3. 4.
ace_rate 4.58 55 2.86 60 1.61 end
spdlimit
112
about your intentions.
Stata needs to
input -- Enter data from keyboard
113
:
! _ !
i
We typed input acc_rate spdlimit and Stata responded by repeating the variable names and then prompting us for the first observation. We then typed 4.58 and 55 and pressed Retth,'n. Stata prompte_ktusfor the second obsen, ation. We entered it and pressed Return. Stata prompted us for the third 6bservation. We knew that the accident rate is 1.61 per million vehicle miles, but we did not know the corresponding speed limit for the highway. We typed the number we knew, 1.61, followed by a period, the missing value indicator. When we pressed Return, Stata prompted us for the fourth 6bservation. We were finished entering our data, so we typed end in lowercase letters.
i i
We can now list
the data to verify that we have entered it correctly:
. list i. 2. 3.
acc_rate 4.58 2.86 1.61
spdlimit 55 60 Q
If you have data in memory and type input without a vartist, you will be prompted to enter adklitional information on all the variables. This continues until you type end. :
i
i
Examp You now have an additional observation you wish to add to the dataset. Typing input by itself tells Stata that you wish to add new observations: • i_ut 4, 5,
act_rate 3.02 60 end
spdlimit
St_ta rem/nded us of the names of our v_-iables and prompted us for the fomth observation. We entered 'the numbers 3,02 and 60 and pressed Return. Stats then prompted us for the fifth observation. We could add as many new observations as we wish. Since we needed to add only one observation, we typ_ _nd, Our dataset now has four observations. "xl
You may add new variables to the data in memory by typing input followed by the names of the new variables. Stata will begin by prompting yGu for the first observation, then the second, and so on, until you type end or enter the last observation.
'iExample i
! ,
In addition to the accident rate and speed limit, we now obtain data on the number of access points (omramps and off-ramps) per mile along each stretcl of highway. We wish to enter the new data.
I
• input acc_pts acc_pts t. 4.6 2. 4.4
3 2.2 I
i
4. _4.7
F
114 input -- Enter data from keyboard When we typed input acc_pts, Stata responded by prompting us for the first observation. There are 4.6 access points per mile for the first highway, so we entered 4.6 and pressed Return. Stata then prompted us for the second observation, and so on. We entered each of the numbers. When we entered the final observation, Stata automatically stopped prompting us--we did not have to type end. Stata knows that there are four observations in memory, and since we are adding a new variable, it stops automatically. We can, however, type end anytime we wish. If we do so, Stata fills the remaining observations on the new variables with m/ssing. To illustrate this, we enter one more variable to our data and then list the result: • input
junk
jun_ 1. 1 2. 2 3. end • list acc_rate 4.58
I.
spdlimit 55
acc_pts 4.6
60
4.4
2.86
2. 3,
1.61
4.
3• 02
junk 1 2
2.2 60
4.7
q
You can input string variables using input, but you must remember to explicitly indicate that the variables are strings by specifying the type of the variable before the variable's name.
> Example String variables are indicated by the types str#, where #represents the storage length, or maximum length, of the variable. For instance, a str4 variable has maximum length 4, meaning it can contain the strings a, ab, abe, and abed but not abede. Strings shorter than the maximum length can be stored in the variable, but strings longer than the maximum length cannot. You can create variables up to str80 in Stata. Since a str80 variable can store strings shorter than 80 characters, you might wonder why you should not make all your string variables str80. You do not want to do this because Stata allocates space for strings based on their maximum length. It would waste the computer's memory. Let's assume that we have no data in memory and wish to enter the following input
strl6
name
age
str6
name i.
"Arthur
2.
"Mary
3. Guy "Fawkes" 3.
"Guy
Hope"
Fawkes cannot
We first typed input sex a str6 variable.
sex age
sex
22 male
37
"female"
48 male be read
Fawkes"
4. "Kriste 5. end
:
Doyle"
data:
as a number
48 male
Yeager"
25 female
strl6 name age str6 sex, meaning that name is to be a strl6 variable and Since we did not specify anything about age, Stata made it a numeric variable.
Stata then prompted us to enter our data. On the first line, the name is Arthur Doyle, which we typed in double quotes. The double quotes are not really part of the string; they merely delimit the
_!lput
_
_l_,t;'w uam
llVli!
hVyL_)awu
J ,_,
beginning and end of the str ng. We followed that with Mr Doyle's age, 22, and his sex, male. We did not bpther to type doubk quotes around the word male because it contained no blanks or special characters. For the second o _servation,we did type the double quotes around female;it changed _othing. In the third observation w omitted the double quotes around the name, and Stata informed us that Fawkes c_uld not be read as number and repromptddus for the observation. When we omitted the double q_otes, Stata interpre:ed Guy as the name, Fa_rl_esas the age, and 48 as the sexl All of this would have been okay with Stata except for one problem: Fawkes looks nothing like a number, so Stata complained and gave :s another chance. This lime, we remembered to put the double quotes around ttie name.
i
Stata was satisfied, and _ continued. We entered lhe fourth observation and then typed end. Here is our dataset: • _ist 1. 2. 3. 4.
1 nam_ Arthur Doyle Mary Hope Guy Fawke. _ Kriste Yeagez
age 22 37 48 25
sex male female male female
q
I
>
I
Example Just as we indicated whic Lvariables Werestrings by placing a storage type in front of the variable name, we can indicate the .torage type of our numeric variables as well. Stata has five numeric storage types: byte, int, 1c ng, float, and double. When you do not specify the storage type. Stata assumes the variable is afl _at. Youmay want to review the definitions of numbers in [U] 15 Data.
! ' i i i i !i i i
,'_dditional Therei are two reasons you might The wantdefault to explicitly specify storage type: toforinduce precision or to co_vhy aserve memory. type float has the plenty of precision most circumstances because Stata performs all calculations in double precision no matter how the data are stored. I[ you were storing 9-digit Social Security Numbers, however, you would want to coerce a different storage type or else the last digit would be r0uhded, long would be the best choice: double would _,_orkequally well, b_]tit would waste memory. Sometimes you do not need to store a variable as float.If the variable contains only integers between -32,768 and 32,7i_6,it can be stored as an int and would take only half the space. If a variable contains only inti',gersbetween -127 and 126, it can be stored as a byte which would _:akeonly half again as mu( i space. For instance, in tile previous example we entered data for age ,_ithout explicitly specifyin, the storage type; hence, it was a float. It would have been better to _tore it as a byte.To do ti" tt. we would have typed input
strl6
name b _te age str6 nam _
_. "Arthur Doyle"
sex sex
12male
°I i
2. "Mary Hope" 37 'female" _. "Guy Fawkes" 48 male
i
4. "Kriste
Yeager"
age
25 female
5. end
i
Stata understands a number of shorthands. For instance,
_I
input
int(a
b) c
allows you to input three variables, Remember .{input
int
and c a float•
a b c
would make a an int *
a, b, and c, and makes both a and b ints
but both b and c floats.
. inputa longb double(cd) e would make a a float,b a long,c and d doubles,and e a float. Statahas a shorthandforvariable names withnumericsuffixes. Typingvl-v4 isequivalent to typing Vl v2 v3 v4. Thus, • linput
'
int(vl-v4)
inputs f6ur variables and stores them as ints. q
Q Technic,_l Note You may want to stop reading now. The rest of this section deals with using input with value labels. If you are not familiar with value labels, you should first read [U] 15.6.3 Value labels. Remdmber that value labels map numbers into words and vice versa. There are two aspects to the process. !First, we must define the association between numbers and words. We might tell Stata that 0 corresponds to male and 1 corresponds to female by typing label define sexlbl 0 "male" 1 "female". The correspondences are named, and in this case we have named the O_male l++female correspondence sexlbl. Next, iwe must associate this value label with a variable. If we had already entered the data and the variable was called sex, we would do this by typing label values sex sexlbl. We would have entered the data by typing O's and !'s, but at least now when we list the data, we would see the words rather than the underlying.numbers. We cab do better than that. After defining the value label, we can associate the value label with the type:variable at the time we input the data and tell Stata to use the value label to interpret what we l_bel • i_put
define strl6
I.
"Arthur
2.
"Mary
3.! "Guy
sexlbl name
byte(age
Hope"
1 "female"
sex:sexlbl),
name Doyle" 22 male
Fawkes"
4. "Kriste 5. end
0 "male"
age
label sex
37 "female" 48 male
Yeager"
25 female
After deft ing the value label, we typed our input command. Two things are noteworthy: We added the label option at the end of the command, and we typed sex:sexlbl for the name of the sex variable, T_e byte(...) around age and sex:sexlbl was not really necessary: it merely forced both age _nd sex to be stored as bytes. Let's first decipher sex : sexlbl, sex is the name of the variable we want to input. The : sexlbl part tells Stata thal the new variable is Lo be associated with the value label named sexlbl. The label option tells Stata that it is to look up any strings we type for labeled variables in their
input- Enter datafrom keyboard
117
corresponding value label and substitute the number when it stores the data. Thus, when we entered the first observation of ou • data, we typed male for Mr Doyle's sex even though the corresponding variable is numeric. Rather than complaining that ""mate" could not be read as a number", Stata accepted what we typed, 3oked up the number corresponding to male, and stored that number in the data.
i
The! fact that Stata has lctually stored a number rather than the words male or female is almost irrelevant. Whenever we ist the data or make a table, Stata will use the words male and female just as if those words were actually stored in the dht/set rather than their numeric codings: • list
I.
nm _e
age
se_
DoylLe
22
male
Ho],e
37
female
Guy Fawb is
48
male
95
female
Arthur
2.
Mary
3. i
, Kriste Yeag_ r tabulate sex sex
] req.
Percent
Cure,
male
2
50. O0
50.00
female
2
50. O0
I00. O0
Total
4
I00. O0
It is only almost irreleva at since we can make use of the underlying numbers in statistical analyses. For instance, if we were to ask Stata to calculate the mean of sex by typing sumrnarize sex, Stata would report 0.5. We woul interpret that to mean that one-half of our sample is female.
i i
Value labels are perman_ Itly associated with variables. Thus, once we associate a value label with a variaNe, we never have ti do so again. If we wanted to add another observation to these data, we
i
could type . input,
i i
label
i5. "Mark
Esman"
nam _ 26 male
age
sex
_. end
!_ i
[3Technical Note The automatic option ',utomates the definition of the value label. In the previous example, we _nformed Stata that male c, ,rresponds to 0 and female corresponds to 1 by typing label define sexlbl 0 "male" :t "female". It was not necessary to explicitly specify the mapping. Speci_,ing the aut6maticoption tells ;tara to interpret what we type as follows:
i i
ii
First, ;see if it is a numbeI If so, store that number and be done with it. If it is not a number, check
I I ! i_ i
I
the value label associated u th the variable in an attempt to interpret it. If an interpretation exists, store theIcorresponding nun: .tic code• If one does not exist, add a new numeric code corresponding to what was typed. Store th_ new number and update the value label so that the new correspondence is never t'orgotten. We can use these feature to reenter our age and sex data. Before reentering the data, we drop -all and label drop _all to prove tha_ we have nothing up our sleeve:
atop_an _abel
drop
_all
....
i
118
i input -- Enterdata from keyboard input
strl6
name
!
byte(age
sex:sexlbl),
name
i.
"Arthur
_.
"Mary
3.
"Guy
4.
"Kriste
Doyle" Hope"
22
37
Fawkes"
age
48
Yeager"
automatic sex
male
"female" male 25
female
. end i I
•
T We previouslydefinedthevaluelabelsexlbl so thatmale correspondedto 0 and female corresponded
to 1. Th+ label that Stata automatically created is slightly different but just as good: i
Sabel list sexlbl se: Ibl : I
male
0 2
female
Also See
'
Complementary:
[R] save
i
Related: i
[R] edit, [R] infile
Background:
[U] 24 Commands to input data
, i
t
I
i
, I
!
!_ !
i
_ in_iheet -- Read AS II (text)data created by a spreadsheet i i iHll i r i i iJ ii i iil ill [
i
Syntax i i
i i
i
insheet
[varlist] using
filename
[, _double [no]n_ames comma t__abclear
]
If filen_me is specifiedwithmt an extension, .raw is assumed.
Description
in_heet reads into rremory a disk dataset that is not in Stata format. ±nsheet is intended for readir_g files created by a spreadsheet or database program. Regardless of the creator, :i,nsheet reads text (ASCII) files where here is one observation per line and the values are separated by tabs or commas. In addition, the first line of the file can contain the variable names or not. The best thing
I i
[ about!insheet is that if you type . insheetusingill, name
i
insheet will read your lata; that's all there is to it. Stata has other comrr ands for reading data. If you are not sure that insheet
i
lookingfor, see [R] infih and [U] 24 Commands to input data. If y/ou want to save your data in "spregdsheet-style" forma
Options
is what you are
see [R] outsheet.
i
double forces Stata to st_age types.
t
tore variables as double_
rather than floats:
see IV] 15.2.2 Numeric
:-
[no]names informs Stata whether variable names are included on the first line of the file. Speci_,ing this option will speed insheet's processing--assuming you are right--but that is all. ±nsheet can determine for itse!f whether the file includes variable names.
1
comma tells Stata that the values are comma-separated. Specifying this option will speed insheet's pr0cessing--assumin_ you are right--but thai is all. insheetcan determine for itself whether the separation charact_ is a comma or a tab.
i i
!
tab prOcessing--assumin_ tells Stata the v_lues are right--but tab-separated. this can option will speed insheet's you are that Specifying ig all. insheet determine for itself whether the separation charact_:r is a tab or a comma.
i
clear specifies that it is okay for the new data |o replace what is currently in memory. To ensure that you do not lose sc mething important, insheetwill refuse to read new data if data are already in memory I clear
is _ne way you can tell ±nsheet
x_ourselfb_ typing drip
_all
that it is okay. The other is to drop _he data
before reading new data.
119
i
12o
Remarks
insheet-
Read ASCII (text) data created by a spreadsheet
There i_ nothing to using insheet.You type insheet and
insheet
using
filename
will read your data, That is, it will read your data if
1. It can find the file and 2. The file meets insheet's
expectations
as to the format in which it _s written.
Assuring I is easy enough; just realize that if you type infix using myfile, Stata interprets this as an instruction to read myfile.raw. If your file is called myfile.txt, type infix using myf ile. t,btt. As for the file's fo,-mat, most spreadsheets and some database programs write data in the form insheet expects, It is easy enough to look as we will show you--and it is even easier simply to try and see what happens. If typing • insheet
using
filenarrle
does not produce the desired result, you will have to try one of Stata's other infile commands: see [R] infile.
> Example You ha*e a raw data file on automobiles and can bd read by typing (5
called auto.raw.
This file was saved by a spreadsheet
insheet using auto vars, I0 obs)
That done, we can now look at what we just loaded: • describe Contains
data
obs:
I0 5
vats: size:
310
(99.8%
storage
of memory
free)
display
value
type
format
label
make_ price
strt3 int
%13s %8.0g
mpg
byte
Z8.0g
rep78
byte
%8.0g
foreign
strlO
ZlOs
variable
name
Sorted by: |Note:
dataset
has
changed
since
last
variable
label
saved
li_t
I. i 2._I 3,
make
price
mpg
AMC Concord AMC Pacer
4099 4749
22 17
3 3
foreign Domestic Domestic
Spirit
3799
22
4. Buick 5. Buick
Century Electra
4816 7827
20 15
3 4
Domestic Domestzc
6. Buick
LeSabre
5788
18
3
Domestzc
4453
26
7. !
AMC
rep78
Buick
Opel
Domestic
Domestic
insheet 8. BuickRegal 9. Buick Riviera 10. Buick
Read ASCII (text) data created by a spreadsheet
5189 10372
20 16
3 3
Domestic Domestic
4082
19
3
Domestic
Skylark
Note that these data contain a combination of string and numeric variables, insheet out by i_elfi
121
figured all that
i
i
3 Technical Note Now let's back up and look at the auto.raw screen: • Sype mal_e
These invisible
file. Stata's type command will display files to the
auto.raw mpg
rep78
foreign
AM¢ Concord
4099
22
3
Domestic
AMC Pacer
4749
17
3
Domestic
AMC Spirit
3799
22
Buick
Century
4816
20
3
Domestic
Buick Buick
Electra LeSabre
7827 5788
15 18
4 3
Domestic Domestic
Buick
Opel
4453
26
Buick Buick
Regal Riviera
5189 10372
20 16
3 3
Domestic Domestic
Buick
Skylark
4082
19
3
Domestic
data and
i
price
have
tab
hence
characters
Domestic
Domestic
between
indistinguishable
i !
]
i
values.
from
blanks,
Tab
characters
type's
showtabs
are
difficult option
to makes
see
since the
tabs
thev
are
visible: I
_ype
auto.raw,
showtabs
1
makepricempgrepZ8foreign AMC Concord4099223Domestic AMC Pacer4749173Domestic AMC Spirit379922.Domestic Buick Century4816203Domestic Buick Electra7827lS4Domestic Buick
LeSabre5788lS3Domestic
Buick
Opel4453<_>26.Domestic
Buick Buick
Rega!5189203Domestic KivieralO372i63Domestic
Buick
Skylark4082193Domestic
:
This is an example of the kind of data insheet is willing to read. The first line contains the variable names, Nthough that is not necessm/. What is nedessary is that the data values have tab characters between them. + insheet would be just as happy if the data values were separated by commas. Here is another vafiationi on auto. raw that insheet can read: type
auto2.raw
make,price,mpg,rep78,foreign AMC Concord,4099,22,3,Domestic AMC Pacer,4749,17,3,Domestic AMC Spirit,3799,22,,,Domestic Buick Buick
Century,48i6+,20,3,Domestic Electra,7827i, 15,4,Domestic
Buick
LeSabre,5788,18,3,Domestic
Buick
Opel,4453,26!,,Domestic
i
Buick Buick
Regal,5189,20,3,Domesti¢ Riviera,lO37_,16,3,Domestic
i
Buick
Skylark,4082.,19,3,Dom_sZic
["
It is way one easieror for the us other. human beings to see the commas rather than the tabs. but computers 122
insheet-
do not care O
Read ASCII (text) data created by a spreadsheet
!
D Example The file does not have to contain variable names. Here is another variation on auto. the first line. this time with commas rather than tabs separating the values:
raw without
type auto3, raw AMC Concord, 4099,22,3, Domestic AMC Pacer, 4749,17,3 ,Domestic (output omitted ) Buick
Skylark. 4082,19,3, Domestic
Here is what happens when we read it: insheet using auto3 you must start with an empty dataset r(I8);
Oops; we still have the data from the last example in memory. • insheet using auto3, clear (5 vars, I0 obs) . describe Contains data obs : vats : size:
variable name
10 5 310 (99.8Y,of memory free) storage type
display format
vl
strl3
Y,13s
v2 v3 v4 v5
int byte byte strlO
7,8.0g ]/,,8.0g XS. 0g Y, IOs
Sorted by : Note:
value label
variable label
dataset has changed since last saved
list vl AMC Concord AMC Pacer
v2 4099 4749
v3 22 17
v4 3 3
v5 Domestic Domestic
(output omitted ) i0. Buick Skylark
4082
19
3
Domestic
I. 2.
'j
The only difference is that rather than the variables being nicely named make, price, mpg, rep78, and foreign,they are named vl,v2, ..., v5. We could now give our variables nicer names: • rename vl make . rename v2 price
insheet-- Read ASCII (text) data created by a spreadsheet
123
!
Another alternative is to specie' the variable names when we read the data: • insheet make price mpg rep78 foreign u_ing auto3, clear
(5 vats,I0obs) list make AMC Concord AMC Pacer
i. 2.
price 4099 4749
mpg 22 17
rep78 3 3
4082
19
3
foreign Domestic Domestic
!
i
(outputomi,ed ) 10.
,i
Buick Skylark
Domestic
ii
If we use this approach, we must not specify too few variables • insheet make price mpg rep78 using aut03, clear too few variables specified error in line 11 of file r,(102) ;
i 4 I
|
or too many.
|
. insheet make price mpg rep78 foreign weight using auto3, clear too many variables specified e_ror in line 11 of file r,(103);
mat is why we recommend . insheet using
i
filename
/
It is not difficult to rename your variables afterwards should that be necessary,
q
> Example
I
About the only other thing that can go wrong is that the data are not appropriate for reading by insheet. Here is yet another version of the automobile data: type auto4.raw, showZabs "_AMCConcord" 4099 22 '_,AMC Pacer" 4749 17
3 3
Domestic Domestic
3 4
Domestic Domestic Domestic
"!AMCSpirit" '_BuickCentury" "Buick Electra"
3799 4816 7827
22 20 15
"Buick LeSabre" "Buick 0pel" "Buick Regal" ".'Buick Riviera"
5788 4453 5189 10372
18 26 20 16
3 3 3
Domestic Domestic Domestic Domestic
"_uick Skylark"
4082
19
3
Domestic
] ]
i
Note that we specified type's showtabs option and no tabs are shown. These data are not tabdelimited or comma-delimited and are not the kind of data insheet is designed to read. I,et's try insheetanyway:
] 1
(Continued on next page)
i
124
insheet -- Read ASCII (text) data created by a spreadsheet insheet using auto4, clear (I vat, I0 obs) describe Contains data obs : vats:
10 1
size:
430 (99.8Y,of memory free)
variable name
storage type
vl
display format
sir39
Sorted by: Note:
value label
variable label
7,39s
dataset has changed since last saved
• list vl I. AMC Concord 2. AMC Pacer (output omitted) 10, Buick Skylark
4099 4749
22 17
3 3
Domestic Domestic
4082
19
3
Domestic
When £nsheet tries to readdatathathave no tabsor commas, itisfooledintothinkingthedata contain justone variable. Ifyou had thesedata,you would have toreadthedatawithone ofStata's other commands such as infile (free format). <3
Also See Complementary:
[R] outfile, [R] outsheet, [R] save
Related:
[R] infile (free format)
Background:
[U] 24 Commands [R] intile
to input data,
Ilrt"w
1
e
r
I i;insect -- Display simple summary of data's ci laracteristics I IU ( i i II t _ _
m I,
_
I / "
I
i
yntax, :in.nspeict [varlist_ [±fexp] [in range] i byi ...
i
: may be used
with
inspect;
see [R] by.
)esCriiOn
i 1
i
The inspect command provides a quick summary of a numeric variable that differs from that provided by summ_arizeor tabulate. It reports ttie number of negative, zero, and positive values;
I
the nunlber of integers and nonintegers; the number of unique values: the number of missin_ and produces _asmall histogram. Its purpose is not anal)tical but to allow you to quickly gain familiarity with u_own data.
'
! l
ilt_vartisl, i
! i 1
Typing inspect
Example inspect
by itself produces an inspectio_ for all the variables in the dataset. If you specify
aninspectionofjustthosevafiablesisp resented. is not a replacement or substitute for s!!rnma/'ize and tabulate.
It is instead a data
management or information tool that lets you quickly g_in insight into the values stored ir_a variable, For instance, you receive data that purport to be on automobiles, and among the variables in tile dataset is one called mpg.Its variable label is l_ileage (mpg), which is surely suggestive. You i_aspect the variable • use
auto,
(1978
Automobile
inspect m]_g:
Data)
mpg
Mileage
! (mpg)
Number
of
Observations Non-
# #
Negative Zero
Total-
# #
Positive
74 .....
74
-
# #
#
#
74
74
-
#
#
#
,
i
clear
Total #
12
Missing
-
Integers -
-
41 (21 unique
Integers
74
values)
and you discover that the variable _s never missiflg; all 74 observations in the dataset have some value for rapg. Moreover. the values are all positive and they are all integers as well. Among those 74 observations are 2t unique (different) values. The variable ranges from 12 to 41. and you are provided with a small histogram that suggests the variable appears to be what it claims
i
lzs
[
_zo
znspec[ -- ulspiay slmpm summary ot data's characteristics
> Example Bob, a co-worker, presents you with some census data. Among the variables in the dataset is one called region which is labeled Census Region and is evidently a numeric variable. You inspect this variable. • use bobsdata • inspect region region:
Census region
Number of Observations Non-
#
# #
# # # #
Negative Zero Positive
#
#
#
#
Total
#
#
#
#
Hissing
# # #
5
Total 50 50
Integers
Integers
50 50
-
50
(5 unique values) region is labeled but I value is NOT documented in the label.
In this dataset something may be wrong, region takes on five unique values. The variable has a value label, however, and one of the observed values is not documented in the label. Perhaps there is a typographical error. A call to Bob would be in order.
> Example You call Bob and there was indeed an erron He fixes it and returns the data to you. Here is what inspect produces now: inspect region region:
# # #
Census region
# # # #
# # # # # #
Number of Observations NonNegative Zero Positive
# # # #
1
Total Missing 4
Total 50 50 -
Integers
Integers -
50 50
50
(4 unique values) region is labeled and all values are documented in the label.
4
I
inspect-- Displaysimplesummaryof data'scharacteristics
i
'
127
Example You receive data on the climate in 956 U.S. Cities. The variable tempjan records the Average January temperature in degrees Fahrenheit. The results of inspectare • inspect tempjan tempjan:
Average January temperature
Number of Observations Non-
i Negativ@ Zero Positiv4
# # # # # #
# # #
# # #
Total 954
Total Hissing
954 2
2.2 72.6 (More than 99 unique values)
Integers 78 78
Integers 876 ........ ! 876
i i
956
, i
In two of the 956 observations, tempjan is nlissiQg. Of the 954 cities that have a recorded tempjan, all are positive and 78 of them are integer valuesl tempjanvaries between 2.2 and 72.6. There are more than 99 unique values of tempj an in the dathset. (Stata stops counting unique values after 99.) <1
i
1
SaVediResults inspect
saves
i in r()"
I
Scalars r(N) r(N..Jaeg)
number of observatiqns number of negative 0bservations
r(N_O)
number of observations equal to 0
i ]
r (N_pc_s) number of positive observations r (N_negint) number of negative, Integer observations r(N_posint)
number of positive, integer observations
r(N_tmique) r(N_undoc)
number of unique values or. if more than 99 number of undocumented values or. if not labeled
AlsoSee
t
i i
Related: i
i !
[R] codebook, [R] compare, [R] describe, [R] Iv, [R] summarize, [R] table, [R] mbsmn, [R] tabulate
i ii
1
1 lue [ ipolate -- Linearly interpolate(extrapolate) I I values
]
I
]
I
Syntax ipolate war war by ...
[if
exp]
: may be used with ipolate;
[in range],
generate(newvar) [ epolate
]
see [R] by.
1
Description ipolate interpolated
creates newvar = yvar where yvar is not missing and fills in newvar (and optionally extrapolated) values of yvar where yvar is missing.
with linearly
Options generate (newvar) epolaZe
is not optional; it specifies the name of the new variable to be created.
specifes values are to be both interpolated and extrapolated.
Interpolation
only is the default.
Remarks > Example You have data points on y and x although, in some cases the observations on y are missing. You believe y is a function of x justifying filling in the missing values by linear interpolation: 1ist 1. 2. 3. 4. 5. 6. 7.
x 0 1
y 3
1.5 2 3 3.5 4
6
18
ipolate y x, gen(yl) (I missing value generated) • ipolate y x, gen(y2) epolate • list i. 2. 3. 4. 5. 6. 7.
x 0 1 1.5 2 3 3.5 4
y
yl
3
3 4.5 6 12 15 18
6
18
y2 0 3 4.5 6 12 15 18
128
8
!
!
i_i
l l
ipolate-- i_inearlyinterpolate(extrapolate)values
129
]
> E_amNe i
i
You have a dataset of circulation of a magazine from 1970 through 1993. Circulation is recorded in a Variablecalled circ and the year in year. in a few of the years, the circulation is not known
i
_o you want to fill it in by linear interpolation: . ipo!ate
!
cite year, gen(icirc)
Now assume you have data on the circulations of 50 magazines; the identity of the magazines is i recorded in magazine (which might be a string variable--it does not matter): i
by magazine: ipolate circ year, gen(ic_rc)
!i
! if the by ... : prefix is specified, interpolation is performed separately for each group.
]
> E_ample You have data on y and z although some of the y values are missing. You wish to smooth y(x) using lowess (see [R] ksm) and then fill in missifigwdues of y using interpolated values: I
. ksm y x, gen(yls) lowess 'ipolate yls x, gen(iyls)
i i i
q
"
:i]
/
]
MethodsandFormulas
:
i
ipolateis implemented as an ado-file. The value Y at x is found by finding the closest points (xo, yo) and (xt.yl), and xl > x, where Yo and 91 are observed, and +alculating
such that x0 < x
Y] - Yo y= _ (x-- Xo) +yo X 1 -- X 0
if epoZate is specified and if (xo, Yo) and (xt,yl) cannot be found on both sides of x, the two closest points on the same side of x are found arid the same formula is applied.
t
!
AlSO _e Complementary:
[t_]ksm
i
I I
i
ii
I Itle
[ ivreg -- Instrumental ,, variablesmidtwo-stageleast " squaresregression
i
Syntax ivreg
depvar
[vartish
] (varlist2
= varlistiv)
[weight]
[if
exp]
[in
range]
[, level(#) _beta_hasconsnoconstant _robustcl.uster(varname) firs_ noheader eform(string)depname(varname) msel 1 by ... : may be used with ivreg; see [R] by. aweights, freights, iweights, and pweights are allowed; see [U] 14.1.6 weight. depvar, varlistt, varlist2, and varlistiv may contain time-series operators; see [U] 14.4.3 Time-series varlists. ivreg shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands.
Syntaxfor predict predict
[l-ype] newvarname
[if
exp]
[in range!
[, statistic j
where statistic is
xb re s iduals
xjb, fitted values (the default) residuals
p_r(a,b)
Pr(a < Yi < b)
e(a,b)
E(yjla < Yi < b)
ystar(a,b) stdp stdf
(Yj-), yj = max{a, min(yj,b)} standard error of the prediction standard error of the forecast
where a and b may be numbers or variables; a equal to '.' These aresample. available both in and out of sample; type predict the statistics estimation
means -2; ...
b equal to ' ' means -roe.
if e(sample)
...
if wanted only for
Description ivreg estimates a linear regression model using instrumental variables (or two-stage least squares) of depvar on varlistl and varlist2 using varIisti,, (along with varlish) as instruments for varlist2. In the language of two-stage least squares, varlist_ and varlistiv are the exogenous varlist2 the endogenous variables.
130
variables and
ivreg -- Instrumentalvariablesand two-stage least squares regression
131
OptiOns level (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as set by set level: see [U] 23.5 Specifying the width of confidence intervals.
(95)
i
beta asks that normalized beta coefficients be reported instead of confidence intervals,
i I
hascons indicates that a user-defined constant or its equivalent is specified among the independent variables. Some caution is recommended when specifying this option as resulting estimates may not be as accurate as they otherwise would be. See [R] regress for more information. noconstant
suppresses the constant term (interCept) in the regression.
robust specifies that the Huber/White/sandwieh estimator of variance is to be used in place of the traditional calculation (White, 1980). This alternative variance estimator produces consistent standard errors even if the data are weighted or if the residuals are not identically distributed. robust combined with cluster() further allows residuals which are not independent within ctu_ter (although the3, must be independent between clusters). See [u] 23.11 Obtaining robust variance estimates.
i i
i t
If you specie, pweights,
I
robust
is implied[ see [u] 23.13 Weighted estimation.
cluster (varname) specifies that the observations are independent across groups (clusters) but not necessarily independent within groups, varname specifies to which group each observation belongs; e.g., cluster(personid) in data with repeated observations on individuals, cluster () affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficientsL see [U] 23.11 Obtaining robust variance estimates, cluster () can be used with pweights to produce estimates for unstratified cluster-sampled data, see [U] 23.13 Weighted estimation. Also see [R] svy estimators for a command designed especially for sur_'ey data. c!us't;er() by itself. first
implies robust:
specifying robust
cluster()
is equivalent to typing cluster()
requests that the first-stage regression results be displayed.
noheader suppresses the display of the ANOVAtable and summary statistics at the top of the output, di:splaying only the coefficient table, This option is often used in programs and ado-files,
'
eform(string) is used only in programs and ad0-files that employ ivreg to estimate models other than instrumental variable regression, eformO specifies the coefficient table is to be displayed in "exponentiated lbrm" as defined in [R] maximize and that string is to be used to label the exponentiated coefficients in the table. depname (varname) is used only in programs and ado-files that employ ivreg to estimate models other than instrumental variable regression, depname () may be specified only at estimation time. varname is recorded as the identity of the dependent variable even though the estimates are ca'tculated using depvar. This affects the labeling of the output--not the results calculated_but could affec_ subsequem calculations made by predict, where the residual would be calculated as deviations from varname rather than depvar, depname() is most typically used when depvar is a temporary variable (see [el macro) used as a proxy for varname. msel is used only in programs a_d ado-files that employ ivreg to estimate models other than instrumental variables regression, msel sets the mean square error to 1, thus forcing the variancecovariance matrix of the estimators to be (X'DX) -1 (see [R] matsize Methods and Fonnulas) and so affects calculated standard errors. Degrees of freedom for t statistics are calculated as _ ra_her than n - k.
t
132
ivreg -- Instrumental variables and two-stage least squares regression
Options for predict xb, the default, calculates
the linear prediction.
res±duals calculates the residuals; that is, gj -xjb. These are based on the estimated equation when the observed values of the endogenous variables are used--not the projections of the instruments onto the endogenous variables. pr(a,b) calculates interval (a, b).
Pr(a < xjb
+ uj < b), the probability
that yjlxj
would be observed
in the
a and b may be specified as numbers or variable names; tb and ub are variable names; pr(20,30) calculates Pr(20 < xjb + uj < 30); pr(lb,ub) calculates Pr(/b < xjb -+.uj < ub); and pr(20,ub) calculates Pr(20 < xjb + uj < ub). a =. means -_zxz; pr(. ,30) calculates Pr(xjb + uj < 30); pr(/b,30) calculates Pr(xjb + uj <: 30) in observations for which It) =. (and calculates Pr(/b < xjb + uj < 30) elsewhere). b =. means +c_; pr(20, .) calculates Pr(xjb + uj > 20); pr(20,ub) calculates Pr(xjb + uj > 20) in observations for which ub =. (and calculates Pr(20 < x3b + uj < ub) elsewhere). e(a,b) calculates E(xjb + uj I a < xjb + uj < b), the expected value of 9jlxj yj[xj being in the interval (a,b), which is to say, 9jlxj is censored. a and b are specified as they are for pr ().
conditional
on
ystar(a,b) calculates E(y_) where y_ = a if xjb + uj < a, y_ = b if xjb + uj > b, and 9j" = xjb + uj otherwise, which is to say, #_ is truncated, a and b are specified as they are for prO. strip calculates the standard error of the prediction. It can be thought of as the standard error of the predicted expected value or mean for the observation's covarlate pattern. This is also referred to as the standard error of the fitted value. sgdf calculates the standard error of the forecast. This is the standard error of the point prediction for a single observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdl are always larger than those by strip; see [R] regress Methods and Formulas.
Remarks ivreg performs instrumental variable_ regression (or two-stage least squares), and weighted instrumental variables regression. For a general discussion of two-stage least squares, see Johnston and DiNardo (1997), Kmenta (1997), and Wooldridge (2000). While computationally identical. Davidson and MacKinnon (1993, 209-224) present their discussion using instrumental variables terminology. Some of the earliest work on simultaneous systems can be found in Cowles Commission monographs--Koopmans and Marschak (1950) and Koopmans and Hood (1953)--with the first development of two-stage least squares appearing in Theil (1953) and Basmann (1957). The syntax for ±vreg assumes you want to estimate a single equation from a system of equations, or an equation for which you do not want to specify the functional form of the remaining system, if you want to estimate a full system of equations, either using two-stage least squares equation-by-equation or using three-stage least squares, see [R] reg3. An advantage of ±vreg is lhat you can estimate a single equation of a multiple-equation equations.
system without speci_ing
the functional form of the remaining
i;] ivreg -- Instrumentalvariables and two-stage least squares regression
133
EXample i
Let us assume that you wish to estimate hsngval = so + _lfaminc + _2reg2 + _3reg3 + _4reg4 rent = _0 + _lhsngval+ fi2pcturban + u
'
1
i 5%u have state data from the 1980 Census. housing, and rent is median monttfly gross income (famine) and region of the country flmction of hsngval and the percentage of
} i
+ E
hsng#al is the mextian dollar value of owner-occupied rent. You postulate that hsngval is a function of family (reg2 through reg4). You also postulate that rent is a the p_ulation living in urban areas (pcturban). i
If you are familiar with multi-equation modelS, you have probably already noted the triangular structure of our model. This triangular structure ]is not required. In fact, given the triangular (or i
recursive) structure of the model, if we were to assume that c and u were uncorrelated, either of the equations could be consistently estimated by ordinaD' least squares. This is strictly a characteristic of triangular systems and would not hold if hsngval were assumed to also depend on rent,regardless of assumptions about e and u. For a more detailed discussion of triangular systems see Kmenta (1997. _19-720). "
i
You tell Stata to estimate the rent equation by specifying the structural equation and the additional exogenous variables in a specific form. The depeddenT variable appears first and is followed by the exogenous variables in the structural model for rent, These are followed by a group of variables in parentheses separated by an equal sign. The variables to the left of the equal sign are the endogenous regressors in the structural model for rent and those to the right are the additional exogenous variables that will instrument for the endogenous variables. Only the additional exogenous variables need to be specified to the right of the equal sign: those already in the structural model wilt be automatically included as instruments, • As the following command shows, this is more difficult to describe than to perform. In this example, rent is the endogenous dependent variable, hsngvat is an endogenous regressor, and
i
• ivreg
i
I
rent
pcturban
(hsng_al = famine
:
i
'
]
i
r_g2-reg4)
famine,Instrumental reg2, reg3,variables reg4, and peturban are th_ exogenous variables. (2SLS) regression
R-squared = Number of obs = Adj R-squared =
0.5989 50 0.5818
1249.851)59
Root MSE
221862
18338.7_)17
Repidual Source
24565.7167 SS
47 df
Total
61243.12
49
F( 2, Prob > F
i
h_ngval pc_urban _cons Ins¢runented: Instrunents:
i
522,6741_23 MS
2
rent
1
42.66 0.0000
30677.4033
i .
47) = =
Model
Coef.
Std. Err.
t
P>lt[
.0022398 .081516
.0003388 '. 3081528
_. 61 _. 26
0.000 O. 793
120.7065
15.70688
_.68
O.000
=
[957,Conf. Interval] .0015583 -. 5384074 89.10834
.00292i3 .7014394 152.3047
hsngval i pc_urban fam_nc reg2 rag3 #eg4
i
I,
134
ivreg -- Instrumental variables and two-stage least squares regression
> Example Given the triangular nature of the estimated system, we might wonder if there is sufficient correlation between the disturbances to warrant estimation by instrumental variables. (We might have a similar question, even if the system were fully simultaneous.) Stata's hausman command (see [R] hausma.) will allow us to test whether there is sufficient difference between the coefficients of the instrumental variables regression and standard OLS to indicate that OLS would be inconsistent for our model. To perform the Hausman test, we use hausman to save the ivreg estimates, perform an OLS regression, and compare the two using hausmma. •
hausma11,
save
• regress
renl
hsngval
Source
pcturban SS
Model Residual
df
MS
Number
of obs
50
40983.5269
2
20491.7635
20259.5931
47
431.055172
K-squared
=
0.6692
1249.85959
Adj R-squared Root MSE
= =
0.6551 20.762
Total
61243,12
rent
49
Coef.
Std.
Err.
t
47)
=
F( 2, Prob > F
P>It _
[95_, Conf.
= =
47.54 0.0000
Interval]
hsngval
.0015205
.0002276
6.68
O. 000
.0010627
.0019784
pcturban _cons
.5248216 125.9033
.2490782 14. 18537
2.11 8.88
O. 040 O. 000
.0237408 97. 36603
i.025902 154.4406
hausman,
constant
sigmamore Coefficients Prior
Current
Difference
S.E.
hsngval
.0022398
.0015205
.0007193
pcturban _cons
.081516 120.7065 (b)
.5248216 125.9033 (B)
-.4433056 -5. 196801 (b-B)
I
b = less
efficient
B = fully Test:
Ho:
efficient
difference
estimates
in coefficients
chi2(I)
obtained
estimates not
The Hausman test clearly indicates that
=
OLS
previously
obtained
from
from
ivreg
regress
systematic
= (b-B)" [(V_b-V_B)'(-I)] = 12.08 Prob>chi2
.000207 .1275655 I. 49543 sqrt (diag(Vib-V_B))
(b,B)
O. 0005
is an inconsistent
estimator for this equation.
As opposed to a direct test of hsngval's endogeneity, Davidson and MacKinnon (1993. 236-242) have noted that this Hausman test is best interpreted as evaluating whether O[S is a consistent estimator for the model. The null hypothesis is that the model was generated by an OLS process and the test is performed under the assumption that the instrumental variables estimates are consistent. As an alternative to the Hausman test, Davidson and MacKinnon suggest an augmented regression test that is based on the same asymptotic requirements as the Hausman test. We can easily fornl the augmented regression by including the predicted values of each endogenous right-hand-side (rhs) variable, as a function of all exogenous variables, in a regression of the original model. For our hsngval model, we regress hsngval on all exogenous variables and include the prediction from this regression in an OLS regression of the hsngval equation. • regress hsngval (outputormttcd ) predict (option
faminc
reg2-reg4
hsng_hat xb assumed;
fitted
values)
pcturban
r
ivreg -
Instrumentalvadables and two-stage least squaresregression
.prediCt
hsng_res,
. regress
rent
res
hsng_al
Source
pcturbknhsng_hat SS
df
Number of obs F( 3 46) Prob > F R-squared Adj R-squared Root MSE
MS
: Model Residual
46189.152 15053.968
Total
61243.12
rent
Coef.
hs_val pctttrban hsns_hat _cons
135
.0006509 .0815159 .0015889 120.7065
3 46
15396.384 327.260173
49
1249.85959
Std. Err.
t
P>ttl
.0002947 .2438355 .0003984 12.42856
2.21 0.$3 3.99 9._1
0.032 0.740 0.000 0.000
= = = = = =
50 47.05 0.0000 0.7542 0.7382 18.09
[95_ Cong. Interval] .0000577 -.4092994 .000787 95.68912
.00124_2 .57233113 .0023908 145.72_9
Since we have only a single endogenous rhs variable, our test statistic is just the t statistic for the hsng__hat variable. If there were more than one endogenous rhs variable, we would need to perform a joint test of _illtheir predicted value regressors being zero. For this simple case, the test statement w_ld be • ,test _sng_hat (1) ,
Itsng_hat= 0.0 _( 1, 46) = Prob > F =
15.91 0.0002
While the p-value from the augmented regression test is somewhat lower than the p-value from the Hausman test, both tests clearly show that OLS is nor indicated for the rent equation (under the assumption that the instrumental variables estimator is a consistent estimator for our rent modeD.
_!Example Robust sta_ard • ivreg
rent
errors are availab_ with ±vreg: pcturban
(hsngval
= famine
reg_-reg4),
robust
IV (2SLS) regression with robust standard errors
--_
rent
Coef.
hsngval pcturban _cons
.0022398 .081516 120.7065
Robust Std. Err.
t
P>It I
.0006931 .4585635 15.7348
3.23 O.18 7.67
O.002 O.860 O.000
Number of obs = F( 2, 47) = Prob > F =
50 21._4 O.O(YO0
R-squared Root MSE
O.5989 22.882
= =
[95_ Conf. Interva_l] .0008455 -. 8409949 89.05217
.0036342 i.004027 152.3609
T
InstzRlmented: hsngval In_tra_ments: pcturban famine reg2 reg3 re$4
The robust star_darderror for the coefficiem on housing value is double what was previously estimated.
_
_
13u
wreg -- mstrumemal variables and two-stage least squares regression
Q Technical Note You may perform weighted two-stage instrumental variables qualifier with irreg. You may perform weighted or unweighted variable estimation, suppressing the constant, by specifying the constant is excluded from both the structural equation and the
estimation by specifying the [weight] two-stage least squares or instrumental noconstant option. In this case, the instrument list. Cl
Acknowledgments The robust estimate of variance with instrumental Mead Over, Dean Jolliffe, and Andrew Foster (1996).
variables was first implemented
in Stata by
Saved Results ivreg saves in e() : Scalars e (N) e (ross) e(df_m) e(rss) e(df.._r) Macros
number of observations mode] sum of squares mode] degrees of freedom residual sum of squares, residual degrees of freedom
e(r2) e (r2_a) e(F) e(rmse) e(N_clust)
e(cmd)
ivreg
e(clustvar)
e(version)
version number of ivreg name of dependent variable iv weight type weight expression
e(vcetype)
e(depva.r)
e(model) e(wtype) e (wexp) Matrices e (b)
coefficientvector
e(instd)
e(insts) e(predict)
e (V)
R-squared
adjusted R-squared F statistic root mem_square error number of clusters name of cluster variable covariance estimation method instrumented variable instruments program used to implement predict
variance-covanance matrix of the estimators
Functions e (sample)
marks estimation sample
Methods and Formulas ivreg
is implemented
as an ado-file.
Variables printed in lowercase and not boldfaced (e.g., x) are scalars. Variables printed in lowercase and boldfaced (e.g., x) are column vectors. Variables printed in uppercase and boldfaced (e.g., X) are matrices. Let v be a column vector of weights specified by the user. If no weights are specified, then v -- 1. Let w be a column vector of normalized weights. If no weights are specified or if the user specified fweights or iweights, w= v. Otherwise, w = {v/(I'v)}(ltl).
i
i
1 i
J
!
ivreg -- Instrumentalvariablesand two-stageleast squares regression
137
The number of observations, n, is defined as l'w. In the case of iweights, this is truncated to an integer. The sum of the weights is l'v. Define e = t if there is a constant in the regression and zero otherwise. Define k as the number of right-hand-side (rhs) variables (including the constant). Let X denote the matrix of observations on the ths variables, y the vector of observations on the left-hap,d-side (lhs) variable, and Z the matrix of observations on the instruments. In the following formulas, if the user specifies weights, then X'X, X ! y, y'y, Z'Z, Z'X, . and Z'y are replaced by X'DX; X'Dy, y'Dy, Z'DZ, Z'DX, and Z'Dy, respectively; where D is a diagonal matrix whose diagonal elements are the elements of w. We suppress the D below to simpli_ the notation. Define A as X'Z(Z'Z)-I(x'z)
j
i
' and a as X'Z(Z'Z)-IZ'y.
The coefficient vector b is defined as A-la. Although not shown in the notation, unless hascons is specified, A and a are accumulated in deviation form and the constant is calculated separately. This comment applies toall statistics listed below. The total sum of squares, ySS, equals y'y if there is no intercept and y'y - { (l'y)2/n The degrees of freedom are n - c. The error sum of squares, ESS, is defined as y'_ - 2bX'y
k.
i
aren
The mode/sum
+ b'X'Xb.
} otherwise.
The degrees of freedom
of squares, MSS, equals TSS- ESS. The degrees of freedom are k - c.
The mean square error, s2. is defined as ESS/(n 2 k). The root mean square error is s, its square root. If c - 1, then F is defined as F = (b - c)iA(b - c) (k _ 1)s 2 where e is a veclor of k - 1 zeros and h'th element l'y/n. this case, you may use the test
Otherwise, F is defined as missing. (In
command to construct any F test you wish )
]
The R-squared, R 2, is defined as R 2 = 1 - ESS/TSS.
i
The. adjusted R-squared, R2a, is 1 - (1 - R2)(n- c)/(n - k). If robust is not specified, the conventional estimate of variance is s2A -1.
i ]
For a discussion of robust variance estimates in the context of recession and regression with instrumental Variables. see [R] regress. Methods and Formulas. See this same section for a discussion of: the formulas for predict after irreg,
i i i]
References Baltagi,B. H. 1998. Econometrics.New York: Springer-Verlag. Basmann. R. L. t057. A generalizedclassical method of iinear estimationof coefficientsin a structural equation. -
Econometrica25: 77-83. Davidson.R, and J. G. MacKinnon.t993. Estimationand Inferencein Econometrics.New York:OX%rdUniver_iU Press:
Johnston.J. and J. DiNardo.I99Z EconometricMethods.4lh ed New York:McGraw-Hill. Kmenta,J. 1997. Elememsof Economemcs.2d ed. Ann Arbor:Universityof MichiganPress. Koopmans,T. C. and W C, Hood. I953. Studiesin EconometrkMethod,New York:John Witey& Son_. Koopmans.T. C and J. Marschak.I950. StatisticalInferencein DynamicEconomic Models. New York:John Wiley & SOns.
I
138
ivreg -- Instrumental variables and two-stage least squares regression
Over, M.. D. Jolliffe. and A. Foster. 1996. sg46: Huber correction for two-stage least squares estimates. Stata Technical Bulletin 29: 24-25. Reprinted in Sta_aTechnical Bulletin Reprints, vo]. 5. pp. 140-142. Theil, H. 1953. Repeated Least Squares Applied _oCompleteEquation Systems. Mimeographfrom the Central Planning Bureau, Hague. White, H. 1980. A heteroskedasticity-consistentcovariance matrix estimator and a direct test for heteroskedasticity. Econome_ca 48: 817-.838. Wooldridge, J. M. 2000. Introductory Econome;_cs: A Modern Approach. Cincinnati. OH: South-Western College Publishing.
Also See Complementary:
[R] adjust, [R] lincom, JR] linktest, [R] mfx, [R] predict. [R] testnl, [R] vce, [R] xi
Related:
[R] anova, [R] areg, JR] ensreg,
[R] mvreg, [R] qreg, [R] reg3, JR] regress,
[R] rreg, [R] sureg, [R] svy estimators, [P] _robust Background:
[U] [U] [U] [U]
[R] test,
[R] xtreg, [R] xtregar,
16.5 Accessing coefficients and standard errors. 23 Estimation and post-estimation commands, 23.11 Obtaining robust variance estimates, 23.13 Weighted estimation
1
!
Title
i jknife_-- J_kknife ]
i i
i 'i
,i
estimation
'
i
I
i
i ,I
_
exp_list [if exp] [in ra,ge]
[,
i
N I
,, f
i
Syntax jtmife
"cmd"
[r_class
1 e_class
t n(exp) ]
!
level(#) keep ] expJist
Contains
] i i
newvarnarne = (exp) (exp) eexp
] i i
eexp is specname [eqno]specname specnarne is _b
i I
_b[]
1
_se _se []
eqno is ## /lan'td
Distinguish between [ ], which are to be typed, and [], which indicate optional arguments.
Iscription jknife
performs jack&nile estimation.
cmd defines the statistical command to be executed, cmd must be bound in double quotes. Compound double quotes ("" and "") are needed if the command itself contains double quotes exp_[ist specifies the statistics to be retrieved after the execution of cmd. on which jackknife statistics will lie calculated.
Qptions rclass, eclaSS, and n(exp) specify where crnd saves the number of observations on which it based the calculated results. You are strongly advised tO specify one of these options.
i i
rclass
specifies that cmd saves the number of dbservations in r(N).
ectass
specifies that cmd saves the number of observations in e(N).
n(exp) allows you to specify an}, other expression that evaluates to the number of obser_'ations used. Specifying n(r(N)) is equivalent to spedifying option rclass. Speci_'ing n(e(N)) is equi'falent to specifying option eclass. If cmd saved the number of observations in r(Nl), specify n(_(Ni) ). Y
139
140
Jknife-- Jackknifeestimation
If you specify none of these options, jknife assumes that all observations in the dataset contribute to the calculated result. If that assumption is incorrect, the reported standard errors wilt be incorrect. For instance, say you specify • jknife "regress y xl x2 x3" coef=_b[x2]
and pretend that observation 42 in the dataset has x3 equal to missing. The 42nd observation plays no role in obtaining the estimates, but jknife has no way of knowing that and will use the wrong N. If, on the other hand, you specify jknife "regress y xl x2 x3" coef=_b[x2], e
will correctly notice that observation 42 plays no role. Option e is specified because regress is an estimation command and saves the number of observations used in e (N). When jknife runs the regression omitting the 42nd observation, jknife will observe that e(N) has the same value as when jknife previously ran the regression using all the observations. Thus, jknife will know that regress did not use the observation. jknife
In this example, it does not matter whether you specify option eclass eclass is easier.
or n (e (N)). but specifying
level(#) specifies the confidence level, in percent, for the confidence intervals. The default is level(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. keep specifies that new variables are to be added to the dataset containing the pseudo-values of the requested statistics. For instance, if you typed . jknife "regress y xl x2 x3" coef=_b[x2], e keep
new variable cool would be added to the dataset containing the pseudo-values for _b [x2]. Let b be defined as the value of _b [x2] when all observations are used to estimate the model, and let b(j) be the value when the jth observation is omitted. The pseudo-values are defined as pseudovaluej = N • {b- b(j)} + b(j) where N is the number of observations used to produce b.
Remarks While the jackknife--developed in the late 1940s and earl}, 1950s--is of largely historical interest today, it is still useful in searching for overly influential observations. This feature is often forgotten. In any case, the jackknife is 1. an alternative, first-order unbiased estimator for a statistic; 2. a data-dependent way to calculate the standard error of the statistic and to obtain significance levels and confidence intervals; and 3. a way of producing measures called pseudo-values for each observation, reflecting the observation's influence on the overall statistic. The idea behind the simplest form of the jackknife the one implemented here is to calculate the statistic in question N times, each time omitting just one of the dataset's observations. Write S for the statistic calculated on the overall sample and S(j) for the statistic calculated when the jth observation is removed. If the statistic in question were the mean, then S=
(N - 1)5'(3) + sj N
i
_
jknife-- J_ffe where
estinl_.i_
141
is the value of the data in the jth observation. Solving for sj, we obtain
sj
sj = NS - (N-
1)S(j)
These are the pseudo-values the jackknife calculates, even though the statistic in question is not the mean. The jac_nife estimate is _, the average of the sj's, and its estimate of the standard error of the statistic is the corresponding standard error of the mean (Tukey 1958). The jackknife estimate of variance has been largely replaced by the bootstrap (see [R] bstrap), which is widely viewed as more efficient and robust. The use of iackknife pseudo-values to detect outliers is too o[ten forgotten and is something the bootstrap is unable to provide. See M0stetler and ,Tukey (1977, 133-t63) and Mooney and Duval (1993., 22-27) for more information.
I
JaCkknifedStandard deviation Example Moste!ler and Tukey (1977, 139-140) request a 95% confidence interval for the standard deviation of the eleven v_ilues: 0.t,
0.1,
0,1,
0.4,
0.5,
t.0,
1.1,
1.3,
1.9,
1.9,
4.7
Stata's summarize command calculates the mean and standard deviation and saves them as r (mean) and r (sd), To obtain the jackknifed standard deviation of the eleven values and to save the pseudovalues asia new variable sd, type • i input
x X
1.0.1 2. 0.1 3. 0.1 4. 0.4 5.0.5 6. i.O 7.!.1 8. t.3 9. 1.9
lo. 1.9 11. 12.
4.7 end
j:knife"summarize x" sd=r(sd), r keep command: statistic: n():
summarize x sd=r(sd) r(N)
Variable
Obs
jknife sd
overall
Statistic
i.489364 II
Std. Err.
[95Y,Conf. Interval]
.6244049
.0981028
2.880625
I.343469
Interpreting the, oulpu[, the standard deviation repoded by snmmaxize mpg is 1.34. The jackknife estimate is 1.49 with standard error 0.62. The 95% confidence interval for the standard deviation is .10 to 2.88. By spectfyii_g the keep option, jknife pseudo-valu_es.
creates a new variable in our dataset, sd. for the
• list -' ,
142
jknife-- Jackknife estimation x sd 1. 2. 3. 4. 5. 6. 7. 8. 9, 10. 11.
•I •1 •1 •4 •5 1 i.1 I.3 1.9 1.9 4.7
1.1399778 1.1399778 1,1399778 . 88931508 .8242672 • 63248882 •62031914 •62188884 .8354195 . 8354195 7.7039493
The jackknife estimate is the average of sd,so sd contains the individual "values" of our statistic. We can see that the last observation is substantially larger than the others, The last observation is certainly an outlier, but whether that reflects the considerable information it contains or indicates that it should be excluded from analysis is a decision that must be based on the context of the problem. In this case, Mosteller and Tukey created the dataset by sampling from an exponential distribution. so the observation is quite informative. ,q
> Example Let us repeat the above example using the automobile dataset, obtaining the standard error of the standard deviation of mpg. • use auto, clear (1978 Automobile Data) jknife "summarize mpg" sd=r(sd), r keep command: statistic: n() :
smmmarize mpg sd=r(sd) r(N)
Variable
Obs
Statistic
74
5.785503
Std. Err.
[95_ Conf. Interval]
sd overall jknife
5.817373
,607251
4.607124
Looking at sd more carefully, summarize sd, detail r(sd) pseudovalues
i_ 57 107 25_ 50Z 75_ 90_ 95_ 997
Percentiles 2.870485 2.870485 2.906249 3•328494
Smallest 2.870485 2.870485 2.870485 2.870485
3.948327 6.844408 9.597005 17.34316 38.60905
Obs Sum of Wgt. Mean
Largest 17.34316 19.76169 19.76169 38.60905
Std. Dev. Variance Skewness Kurtosis
74 74 5.817373 5.22377 27.28778 4.07202 23.37822
7.027623
_
jknife-- Jackknifeest_ • ]/istimake mpg
sd if sd > 30
,.pg
_ake 71•
_
143
Diesel
41
sd 38.60@052
Inthi s case,the_v Dieselistheonlydiesel carinourdataset.
q
!Collectingmultiple statistics i>Example : jknife is not limited to collecting just one ;tatistic. For instance, you can use s-T.marize, detail and then obtain the jackknife estimate of _e standard deviation and skewness, m_rnmarize, detail saves the standard deviation in r(sd) and the skewness in r(skewness), so you might type i
• _se (I_78
auto, clear _utomobile
• jkni_e
"summarize
comm_: statistic n():
Data) mpg, detail"
summarize :
sd=r(sd)skew=r(skewness),
r
mpg, detail
sd=r (sd) skew=r (skewness) r(N)
Variable
Obs
Statistic
74
5.78550_
Std. Err.
[95_, Conf.
Interval]
sd overall jknif
e
5. 817379
•607251
4. 607124
7• 027623
.3525056
1. 694686
skew 74
overall
.948717_ 1. 023596
jknife
.3367242
q
!Collectingcoefficients and standard errors Example
, jkni_eCan also collect coefficients and standard errors from estimation commands. For instance, using auto. klta we wish to obtain the jackknife e,_timate of the coefficients and standard errors from a regression in which we model the mileage of a _ar by its weight and trunk space. To do this. we refer io the Coefficients and the standard errors as _b [weight], _b [trunk], _se [weight], and _se [ttrumk] in the exp_list, or we could simplify them by using the extended expressions _b and
i 1
_SO. • Use _uto, clear (1978 iAutomobile Data) • _kniife "reg mpg weight
trunk"
co_iman61:
reg mpg weight
statistic:
b_weight=_b
_b _se, e
trunk
[weight]
se_weight=_se [weight] b_trunk=_b [trunk] b cons=
n() :
b[_cons]
se
Zrtmk=_se
se
cons=_se
_(_)
[trunk] [_cons]
] i
144
jknife -- Jackknife estimation Variable
1
Obs
Statistic
74
-.0056527
Std. Err.
[95X Conf. Interval]
b_weight overall jknife
-.0056325
.O010216
-.0076684
-.0035965
se_weight overall
74
.0007016
jkaife
.0003496
.000111
.0001284
.0005708
b_trunk overall
74
-.096229 -.0973012
jknife
.1486236
-.3935075
.1989052
b_cons overall
74
39.68913
jknife
39.65612
1.873323
35.92259
43.38965
.0218525
.0196476
.1067514
.2690423
.2907921
1.363193
se_trunk overall
74
.1274771
jknife
.0631995
se_cons overall
1.65207
74
jknife
.8269927
q
Saved Results jknife saves in r(): of observations
used in calculating
#
r(N#)
number
r(stat#)
value of statistic # using all observations
statistic
f
r(me_n#) r(se#)
jackknife estimate of statistic # (mean of pseudo-values) standard error of the mean of statistic #
|
Methods and Formulas jknife is implemented
as an ado-file.
References Gould, W. 1995. sg34: Jackknife estimation. Reprints, vol. 4, pp. 165-170.
Stata Technical
Bulletin
24: 25-29.
Mooney, C. Z. and R. D. Duval. Park, CA: Sage Publications.
1993. Bootstrapping:
A Nonparametric
Mosteller, E Company.
1977. Data Amdysis
and Regression.
and J. W. Tukey.
Tukey, J. W. 1958. Bias and confidence 614.
in not-quite
large samples.
Reprinted
Approach Reading.
Abstract
to Statistical
Related:
[R] bstrap, [R] statsby
Background:
[U] 16.5 Accessing coefficients and standard errors, [u] 23 Estimation and post-estimation commands
Inference.
MA: Addison-Wesley
in Annals
Also See
in Stata Technical
of Mat,_ematical
Bulletin
Newbury Publishing
Statistics
29:
rT!tle jointly --- l_orm all p_rwise combinations within groups
Syntax joinby
[varIist] using
nqlabe_. ....... , update
filena,ne
replace
[, _atehed({none
_merge(varname)
] b_oth [ re_aster
using
})
]
DescripUon j oinby joiqs, within groups formed by varlist, observations of the dataset in memory withfiiename, a Stata-format dataset. By join we mean "form all parwise combinations", filename is required to be sorted by varti_t. Iffilename is specified without an extension, '. dta' is assumed. If rarlist isnot specified, joinby memory antt in filename.
takes as varligt the set of variables common to the dataset in
Observations unique to one or the other dataset are ignored unless unmatched () specifies differently. Whether you load one dataset and join the other or vice versa makes no difference in terms of the number of resalting observations. If there ar_ common variables between the two datasets, however, the combined dataset will contain the va]ues from the master data for those observations. This behavior can be modified with the update and replace options.
Options
z
unmatched({llone I both !master I using }) specifies whether observations unique to one of the datasets are to be kept, with the variables from the other dataset set to missing. Valid values are none both m_.stier using
all unmatched observations are ignored (default) unmatched observations from the master and using data are included unmatched obse_'ations from ihe master data are included unmatched observations from the using data are included
}
no!abe! Nevents Stata from copying the value label definitions from the disk dataset into the dataset in memory. Even if you do not specify this option, in no event do label definitions from the disk dataset tephce label definitions already in memory. update i
varies the action that joinby
data_et is tield inviolate--values
takes when an observation is matched. By default, the master from the master data are retained when the same variables are
found in both datasets. If update is specified, however, the values from the using dataset are retained in cases where the master dataset contains missing. i '-
replace,
fillowed with update only, specifies that even when the master dataset contains nonmissin_ values, the_' are to be replaced with corresponding values from the using dataset when the corresp0ndjng values are not equal. A nonmissi_g value, however, will never be replaced with a
missing value. -merge(varname) specifies the name of the variable that will mark the source of the resulting observation. The default is _.merge (_merge) . To preserve compatibility with earlier versions of joir_b_, ___erge is only generated if unmatched is specified. 145
l
Remarks
146 joinby -- Form all pairwise combinations within groups The following, admittedly artificial, example illustrates j oinby.
> Example You have two datasets: child, dta and parent .dta, identifies the people who belong to the s_ne family. .
use
Both contain a family_id
child on
(Data
Children)
describe Contains
data
from
child.dta
obs:
5
Data
vats:
4
13
size:
50
(99.9_
storage variable
name
type
of memory
on Children
Jul
display
value
format
label
variable
family_id
int
_8.0g
Family
child_id
byte
_8.0g
Child
xl
byte
_8.0g
x2
int
_8.0g
by:
Sorted
2000
15:29
free)
label Id Number
Id Number
family_id
list family~d 1025
I.
child_id 3
xl II
x2 320
2.
1025
1
12
300
3.
1025
4
10
275
4.
1026
2
13
280
5.
1027
5
15
210
. use (Data
parent, clear on Parents)
describe Contains obs:
data
from
parent.dta 6
Data
vats:
4
13 Jul
size:
108 storage
(99.9_
of memory
on Parents 2000
15:31
free)
display
value
type
format
label
family_id
int
_8.0g
Family
Id Number
parent_id
float
_9.0g
Parent
Id Number
xl
float
_9.0g
x3
float
_9.0g
variable
Sorted
name
variable
by:
list
1.
fsmily-d 1030
2. 3.
1025 1025
4. 5. 6.
parent_id I0
xl 39
x3 600
11 12
20 27
643 721
1026
13
30
Z60
1026
14
26
668
1030
15
32
684
label
variable which
joinby-- Form ail pairwisecombinationswithin groups
147
You want tO "join" the information for the parents and their children. The data on parents are in memory;the data on children are on disk. dhild.dta has been sorted by family_id, but parerit._ti has not, so first we sort the parent _data on famity_id: • Sort
i family_id
• joinby
family_id
using
child
• describe Co,tails
data
o.bs:I
8
vats:
6 168
Data (99.4_, of memory
free)
i
i
storage
on Parents
,
display
value
type
format
label
family¢id
int
Y,8.0g
Family
Id Number
paz_nt_id
float
Y,9.0g
Parent
Id Number
Xl
float
%9.0g
x3
float
Y.9.0g
child__d
byte
XS.0g
x2
int
Y,8.0g
variable
name
Sorted iby : Npte:
dataset
has changed
variable
Child
since
last
label
Id Number
saved
l_st 1.
family-d 1025
parent_id 12
xl 27
x3 721
child_id 3
2;.
1025
11
20
643
3
320
3. 4.
1025 1025
12 11
27 20
721 643
1 1
300 300
5,
1025
li
20
643
4
275
6. 7.
1025 1026
12 13
27 30
721 760
4 2
275 280
8.
1026
14
26
668
2
280
x2 320
Notice that
I , I
1. fami_y__d of I027, which appears only in child.dta, and family_id only in Narent. dta, are not in the combined dataset. Observations variable(s) are not in both datasets are omitted.
of 1030, which appears for which the matching
2. The x_ v_riable is in both datasets. Values for this variable in the joined dataset are the values from par_nt.dta--the dataset in memory when we issued the joinby command. If we had cMld.d_a in memory and parent.dta on di_k when we requested joinby, the values for xl wouldiha_'e been from child.dta. Values from the dataset in memory take precedence over the datasel o_i disk. q
Methods joinby
Formulas ii implemented as an ado-file.
t
148
joinby-- Formall pairwisecombinationswithin groups
Acknowledgment joinbywas written by Jeroen Weesie, Department of Sociology, Utrecht University, Netherlands.
Also See Complementary:
[R] save
Related:
JR] append, [R] cross, JR] fillin, JR] merge
Background:
[U] 25 Commands for combining data
1
!
-¥itle kappa,
interrater agreement 4
1
i Syntax :
}
kap va_ai_el varname2 varnarne3 [...] [weigh3] [if exp] [in range] kappa i,ariist [if exp] [in range] fweights
i
a_e aliowed; see [U] 14,1.6 weight.
DescriptiOn kap (first s_,ntax)calculates the kappa-statistic measure of interrater agreement when there are two unique raters and two or more ratings. /
kapwgt: defines weights for use by kap in measuring the importance of disagreements. kap (secoqd syntax) and kappa calculate the kappa-statistic measure in the case of two or more (nonuniqu¢) r_atersand two outcomes, more than two outcomes when the number of raters is fixed, and more thah two outcomes when the number of raters varies, kap (second syntax) and kappa produce the same results: they merely differ in how they expect the data to be organized. kap assurrie's that each observation is a subject, varnamel contains the ratings by the first rater, varname2 'by ihe second rater, and so on. kappa also_assumesthat each obse_'ation is a subject. The variables,however, record the frequencies with which r@ngs were assigned. The first variable records the number of times the first rating was assigned, the gecond variable records the number of times the second rating was assigned, and so on.
Options tab displays a tabulation of the assessmentsby the two raters. wgt(wgtid) _pecifies that wgtid is to be used to weight disagreements. User-defined weights can be created using kapwgt: in that case, wgt() specifies the name of the user-defined matrix. For instance, you might define . kapwg_ mine i \, .8 1 \ 0 .8 I \ 0 0 .8 I'
and them . k_p
rgta
ratb,
wgt(mine)
14g
i i
150
kappa -- lnterrater agreement
In addition, two prerecorded
weights are available.
wgt(w) specifies weights 1 - [i -jl/(k - 1), where i and j index the rows and columns of the ratings by the two raters and k is the maximum number of possible ratings. wgt(w2)
specifies weights 1-{(i-
j)/(k-
1)} 2.
absolute is relevant only if wgt () is also specified; see wgt () above. Option absolute modifies how i, j, and k in the formulas below are defined and how corresponding entries are found in a user-defined weighting matrix. When absolute is not specified, i and j refer to the row and column index, not the ratings themselves. Say the ratings are recorded as {0, 1, 1.5, 2}. There are 4 ratings; k = 4 and i and j are still 1, 2, 3, and 4 in the formulas below. Index 3, for instance. corresponds to rating = 1.5. This is convenient but can, with some data, lead to difficulties. When absolute is specified, all ratings must be integers and they must be coded from the set {1,2, 3,...}. Not all values need be used; integer values that do not occur are simply assumed to be unobserved.
Remarks The kappa-statistic measure of agreement is scaled to be 0 when the amount of agreement is what would be expected to be observed by chance and 1 when there is perfect agreement. For intermediate values, Landis and Koch (1977a, 165) suggest the following interpretations: below 0.0 0.00-0.20 0.21-0.40 0.41-0.60 0.61- 0.80 0.81- 1.00
Poor Slight Fair Moderate Substantial Almost Perfect
The case of 2 raters > Example Consider the classification by two radMogists of 85 xeromammograms as normal, benign disease. suspicion of cancer, or cancer (a subset of the data from Boyd et al. 1982 and discussed in the context of kappa in Altman 1991, 403-405). . tabulate Radiologist A's assessment
rada
radb
Radiologist Normal
B's
benign
suspect
cancer
Total
12
0
0
33
benign
4
17
1
0
22
suspect cancer
3 0
9 0
15 0
2 1
29 1
38
16
3
85
Normal
Total
21
assessment
28
Our dataset contains two variables: Each observation is a patient.
rada,
radiologist A's assessment: radb.
We can obtain the kappa measure of interrater agreement
by typing
radiologisl B's assessment.
kap
-- lnterrat_ agreement
151
• kap rada radb ; Agreement i
Expected Agreement
Kappa
Std. Err,
30. 2z 0.472a 0.0694 !
Prob>Z
6.81
0.0000
Had each radiologist made his determination randomly (but with probabilities equal to the overall proportions), _we would expect the two radiologist_ to agree on 30.8% of the patients. In fact, they agreed on 6}.5% of the patients, or 47.3% of the way between random agreement and perfect a_reemenL _he amount of agreement indicates that we can reject the hypothesis that the?, are making their detetrni lations randomty.
Example l
Z
,I
i
There is a difference between two radiologists disagreeing whether a xeromammogram indicates cancer or th_ ,_uspicion of cancer and disagreeing whether it indicates cancer or is normal. The weighted kappa attempts to deal with this. kap provides two "prerecorded" weights, w and w2: . k_p _ada radb, wgt(_;) Ratlng_ weighted by: 1.0¢00 O, 666? 0.61_67 1.0000 O. 3_33 O. 6667 O. od,oo o. 3333
O.3333 0.6667 1.0000 o. 6667
i Expected /{gr¢_em_nt Agreement
l i i I
i !
O. 0000 0.3333 O. 6667 1. 0000
Kappa
Std. Err.
Z
Prob>Z
The w vJe_ghts are given by 1 - li - jt/(k - 1) where i and j index the rows of columns of the ratings by th_ two raters and k is the maxinmm _umber of possible ratings. The weighting matrix
i i
ratings normal, benign, suspicious, and cancerous. i the table. In our "case, the rows and columns of the 4 × 4 matrix correspond to the is prin_ed above A weight Of 1 indicates an observation should count as perfect agreement. The matrix has l s down the dia_ofials!--when both radiologists make the s_me assessment, they are in agreement. A weight of, say_0J66_7 means they are in two-thirds agreement. In our matrix they get that score if they are one aparl -+one radiologist assesses cancer and the other is merely suspicious, or one is suspicious and the o_herisays bemgn, and so on, An emry of 0.3333 means they are in one-third agreement or, } if you prefer,!two-thirds disagreement. That is the gcore attached when they are "two apart". Finally, they are in c_mplete disaueement when the weighi is zero, which happens only when the3, are three apart--one says cancer and the other says normal.: <1
i!
Example The other prerecorded weight is w2 where the weights are given by 1 - {(i• kap ttating_
i
_ada radb,
wgt(w2)
weighted by:
l.oqoo
0.8889
0.5556
0.0000
.s 89 oooo o.88s9 o.sss6 55_56 (),.oclooo. O. 8889 1. 0000 O.8889 s56 o.8889 oooo
j)/(_-
1)}2:
152
kappa -- lnterrater agreement Expected Agreement
Agreement 94.77X
The w2weight
Kappa
84.09X
0.6714
makes the categories
Std.
Err.
0.1079
even morealike
Z 6.22
Prob>Z 0.0000
and is probably inappropriate
here.
> Example In addition to prerecorded weights, you can define your own weights with the kapwgt command. For instance, you might feel that suspicious and cancerous are reasonably similar, benign and normal reasonably similar, but the suspicious/cancerous group is nothing like the benign/normal group: • kapwgt xm i \ .8 1 \ 0 0 1 \ 0 0 .8 1 • kapw_ xm 1.0000 0.8000 1.0000 O. 0000 O.0000 i.0000 O. 0000 O.0000 O.8000 1.0000
You name the weights--we named ours xm and after the weight name, you enter the lower triangle of the weighting matrix, using \ to separate rows. In our example we have four outcomes and so continued entering numbers until we had defined the fourth row of the weighting matrix. If you type kapwgt followed by a name and nothing else, it shows you the weights recorded under that name. Satisfied that we have entered them correctly, we now use the weights to recalculate kappa: . kap rada radb, wgt (xm) Ratings weighted by: I.0000 O.8000 O.8000 I.0000 O.0000 O.0000 O.0000 O.0000
Agreement 80.47)',
O.0000 O.0000 I.0000 O.8000
O.0000 0.0000 O. 8000 1.0000
Expected Agreement
Kappa
52.677.
O. 5874
Std, Err, O.0865
Z
Prob>Z
6.79
O.0000
4
[3 Technical Note In addition to weights for weighting the differences in categories, you can specify Stata's traditional weights for weighting the data. In the examples above, we have 85 observations in our dataset one for each patient. If all we knew was the table of outcomes that there were 21 patients rated normal by both radiologists, etc. it would be easier to enter the table into Stata and work from it. The easiest way to enter the data is with tabi; see [R] tabulate.
) ) ) ( ,
kappa-- lnterrateragreement
!
153
. tabi 21 12 0 0 \ 4 17 I 0 \ 3 9 15 2 \ 0 0 0 I, replace col row
1
2
3
4
Total
1 2
21 4
12 17
0 1
0 0
33 22
3 4
3 0
9 0
15 0
2 1
:29 1
3
85
,(
T_tal
28
Pearson cM2(9)
38 =
16
77.8111
Pr _ 0.000
tabi
felt obligated to tell us the Pearson X2 for this table, but we do not care about it. The important thing is tlaat,with the replace option, tabi left the table in memory: • list )in I/5 row 1 1
col 1 2
pop 21 12
3_
I
3
0
4_ 5,
1 2
4 1
0 4
1, 2.
The variable row is radiologist A's assessment: so assesse_ _ both. Thus, •:kap _ow col [freq=pop] ; Expected Agreement Agreement Kappa : j
Std. Err.
!
63.53'/,
30.827,
col,. radiologist B's assessment: and pop the number
O.4728
O.0694
Z
Prob>Z
6.81
O.0000
If we are going to keep these data, the names row and col are not indicative of what the data reflects. We could (seb [U] 15.6 Dam.set, variable, a,d value labels) • rename row rada • rename col radb . label var rada "Radiologist A's assessment" label var radb "Radiologist B's assessment" . label define assess I normal 2 benign 3 suspect 4 cancer l&be] values rada assess label values radb assess l&be] data "Altman p. 403"
kap's
tab
option, which can be used with or withont weighted data. shows the table of assessments: i
• kap _ada radb
[freq=pop],
Radiolqgist iA's assessment
Radiologist B's assessment normal benign suspect cancer
!
tab
Total
21
n
0
o
33
bez_ign
4
17
1
0
22
Suspect
3
9
15
2
29
cancer ) T_tal
0
0
0
I
1
28
38
18
3
85
_o_mal
)
]:_
Kappa -- mmrramr agreement Expected Agreement
Agreement 63.53_
Kappa
30.82_
Std.
0.4728
Err.
Z
0.0694
Prob>Z
6.81
0.0000
0
Q Technical Note You have data on individual patients. There are two raters and the possible
ratings are I, 2, 3,
and 4, but neither rater ever used rating 3: . tabulate ratera raterb raterb •
ratera
I
2
4
Total
1 2 4
6 5 1
4 3 1
3 3 26
13 11 28
12
8
32
52
Total
In this case, kap would determine the ratings are from the set {1,2, 4} because those were the only values observed, kap would expect a use_defined weighting matrix to be 3 x 3 and, were it not, kap would issue an error message. In the formula-based weights, the calculation would be based on i,j -- I, 2, 3 corresponding to the three observed ratings {1,2, 4}. Specifying the absolute option would make it clear that the ratings are 1, 2, 3, and 4; it just so happens that rating = 3 was never assigned. Were a use_defined weighting matrix also specified, kap would expect it to be 4 × 4 or larger (larger because one can think of the ratings being 1, 2, 3, 4, 5, ... and it just so happens that ratings 5, 6, ... were never observed just as rating -- 3 was not observed). In the formula-based weights, the calculation would be based on i,j -- I, 2, 4. • kap ratera raterb, wgt(w) Ratings weighted by: 1.0000 0.5000 0.0000 0,5000 1.0000 0.5000 0.0000 0.5000 1.0000
Agreement 79.81_
Expected Agreement 57.17Z
Kappa
Z
Prob>Z
4.52
0.0000
Z
Prob>Z
Std. Err.
0.5285
0.1169
. kap ratera raterb, wgt(w) absolute Ratings weighted by: 1.0000 0.6667 0.0000 0.6667 1.0000 0.3333 0.0000 0,3333 1.0000
Agreement
Expected Agreement
81.41Z
55.08X
Kappa 0.5862
Std. Err. 0.1209
4.85
0.0000
If all conceivable ratings are observed in the data, then whether absolute is specified makes no difference. For instance, if rater A assigns ratings { 1,2, 4} and rater B assigns {1,2, 3, 4}, then the complete set of assigned ratings is {1,2, 3, 4}, the same as absolute would specify. Without absolut e, it makes no difference whether the ratings are coded { 1,2, 3, 4}, {0.1.2, 3 }, {1,7, 9, 100}, {0, 1, t.5, 2.0}, or otherwise.
O
kappa-- lnterrateragreement
The case
155:,
more than two raters
In the c,se of more than two raters, the matha aatics are such that the two raters are not considered unique.!Fol " instance, if there are three raters, there is no assumption that the three raters who rate I
the are the the three ratersraters that rate thanflrst_suSject two r_iters case, it same can beasused with two whenthe thesecond. raters' Although identities we vary.call this the more The 'norlunique rater case can be usefully broken down into three subcases: (a) there are two possible raiings which we will call positive and negative; (b) there are more than two possible ratings but _thenumber of raters per subject is the same for all subjects; and (c) there are more than two possiblle ratings and the number of raters per subject varies, kappa handles all these cases. To emphasize that there is no assumption of constant identity of raters across subjects, the variables specified contain counts of the number of raters rating the subject into a particular category.
!
{ i
_ Example (Two; ratings.) Fleiss (1981, 227) offers the following hypothetical ratings by different sets of raters on 25}subjects:
Subject 1 2 3 4 5 6 7 8 9 t0 11 i
l
NO.of No. of raters pos. ratings 2 2 2 0 3 2 4 3 3 3 4 1 3 0 5 0 2 0 4 4 5 5
12 13
34
34
No. of No. of Subject raters pos. ratings 14 4 3 15 2 0 16 2 ' 2 17 3 1 18 2 t 19 4 t 20 5 4 21 3 2 22 4 0 23 3 0 24 3 3 25
2
2
We have entered these data into Stata and the variables are called subject, raters, and pos. kappa, however, re@ires that we specify variables containing the number of positive ratings and negative ratings; that i_s,pos and raters-pos: gen
_eg
kapp_
= raters-pos
pos neg
TWo4ou_comes, Kappa 0.5415
multiple
raters: Z 5.28
Prob>Z 0.0000
We wouldlha_e obtained the same results if we had typed kappa neg pos.
Example (More thanitwo ratings, constant number of raters,) Each of ten subjects is rated into one of three categories by five raters (Fleiss 1981, 230): li_t
I i
156
kappa-- Interrateragreement subject 1. 2. S. 4. 5. 6, 7. 8. 9. 10.
cat1 1 2 3 4 5 6 7 8 9 10
cat2 1 2 0 4 S 1 5 0 1 3
cats 4 0 0 0 0 4 0 4 0 0
0 S 5 1 2 0 0 1 4 2
We obtain the kappa statistic: • kappa earl-cat3 Outcome
Kappa
Z
Prob>Z
catI cat2 cat3
O.2917 0.6711 0.3490
2.92 6.71 3.49
O. 0018 0.0000 O. 0002
combined
0.4179
5.83
O.0000
The first part of the output shows the results of calculating kappa for each of the categories separately against an amalgam of the remaining categories. For instance, the cat I line is the two-rating kappa where positive is carl and negative is cat2 or catS. The test statistic, however, is calculated differently (see Methods and Formulas). The combined kappa is the appropriately weighted average of the individual kappas. Note that there is considerably less agreement about the rating of subjects into the first category than there is for the second. q
> Example Now suppose that we have the same data as in the previous example, but that it is organized differently: • list 1. 2. 3. 4. 5. 6. 7. 8. 9. i0.
subject 1 2 3 4 5 6 7 8 9 i0
raterl 1 1 3 I 1 1 1 2 1 1
In this case, you would kap
use
kap
rater2 2 I 3 1 1 2 I 2 3 1
rater3 2 3 3 1 1 2 1 2 3 1
rather than
kappa:
raterl rater2 raterS rater4 rater5
There are 5 raters per subject: Outcome
Kappa
Z
Prob>Z
1 2 3
0.2917 0.6711 O. 3490
2.92 6.71 3.49
0.0018 0.0000 O. 0002
combined
O. 4179
5.83
O. 0000
rater4 2 3 3 1 3 2 1 2 3 3
rater5 2 3 3 3 3 2 1 3 3 3
_
,
kappa -- Interrater agreement
157
Note that thfe information of which rater is which is not exploited when there are more than two raters.
q
_, Example (More' tha_ two ratings, vmo,ing number of raters!) In this unfortunate case, kappa can be calculated, but there is _o test statistic for testing against _ > 0. You do nothing differently--kappa calculates the total nun{bet of raters for each subject and, if it is not a constant, it suppresses the calculation of test statisttics[ .
1,ist
1,
subject 1
cat 1 1
cat 2 3
2.
2
2
0
3
3.
3
0
0
5
4.
4
4
0
1
5.
5
3
0
2
6.
6
1
4
0
7.
7
5
0
0
8_
8
0
4
1
9;
9
1
0
2
10.
10
3
0
2
• k_pp_
0
catl-cat3 Outcome
Kappa
cat i
O. 2685
cat2
O. 64,57
cat3
O. 2938
combined note:
cat3
Z
Prob>Z
O. 3816
}Number of ratings
per
subject
vary;: cannot
calculate
test
Istatistics,
q
Example This case _s similar to the previous example, but the data are organized differently: • list
I.
_ubject i
raterl I
rater2 2
rater3 2
2.
2
3.
3
4.
1
1
3
3
3
3
3
3
3
3
4
1
1
t
1
3
5-
5
1
1
1
3
3
6. 7.
6 7
1 1
2 1
2 1
2 1
2 1
8.
8
2
2
2
2
3
9.
9
1
3
10.
10
1
t
1
3
In this
cas_,
|
we
specify
kap,
instead
of
kappa:
rater4
rater5 2
3 3
i i
158
kappa -- Interrater agreement • kap raterl-rater5 There are between 3 and 5 (median = 5.00) raters per subject: 0utcome
Kappa
1 2 3
0.2685 0.6457 0.2938
Prob>Z
0.3816
combined note:
Z
Number of ratings per subject vary; cannot calculate test statistics.
Saved Results kap and kappa save in r(): Scalars r(N)
number
of subjects
(kap only)
r(prop_o) observed proportion of agreement (kap only) r(prop_e)expected proportion of agreement (kap only)
r(kappa)
kappa
r(z) r(se)
z statistic standard error for kappa statistic
Methods and Formulas kap, kapwgt,
and kappa
are implemented
as ado-files.
The kappa statistic was first proposed by Cohen (1960). The generalization for weights reflecting the relative seriousness of each possible disagreement is due to Cohen (1968). The analysis-of-variance approach for k = 2 and ra _> 2 is due to Landis and Koch (1977b). See Altman (1991, 403-409) or Dunn (2000. chapter 2) for an introductory treatment and Fleiss (198t, 212--236) for a more detailed treatment. All formulas below are as presented in Fleiss (1981). Let rn be the number of raters and let k be the number of rating outcomes.
kap: m = 2 Define wij (i = 1.... , k, j = 1,..., k) as the weights for agreement and disagreement (wgt ()) or, if not weighted, define wiz = 1 and wij = 0 for i ¢ j. If wgt (_r) is specified, u'ij -- 1-l i-jt/(k1). If wgt (_r2) is specified, wij -- 1 - {(i-j)/(k The observed proportion of agreement
1)} 2.
is k
k
Po = _ _ wijPij i=1 3=1
where Pij is the fraction of ratings i by the first rater and j by the second. The expected proportion of agreement is k 'De-
_ i=1
k _wijPi-P.j j=l
Ii
_
-- Interrater agreement
159
where Pi. = ___jPij and p.j = Y'_-iP/J" f Kappa is _iven by _ = (Po - Pe) / (I - Pe). The standard error of _ for testing against 0 is
s0 (1-
j
where n is th, number of subjects being rated and Ni. = _j statistic Z= _/'so is assumed to be distributed N(0, 1).
p.jwij
and ¥.j = _i Pi':wiJ" The test
kappa: m > 2,!k = 2 Each sUbjeCt i, i = 1,...,
n, is found by' xi of mi raters to be positive (the choice as to what is
labeled:positiVe being arbitrary). The overail proportion of positive ratings is _ = _i xi/(nN), between-s_bj@s mean square is (approximately)
B
where _
= _-_i rni/n.
_
The
!1 1
--
n
t
i
r°'i
and the w_thla-subject mean square is
W = n(_--
1 1i
E i
xi(mimi
xi) ] i
Kappa is thent defined
i
i
The standard !error for testing against 0 (Fleiss arid Cuzick 1979) is approximately calculated as 1 _'0 = (N.-
1)_
/
(_-
(2(_H z 1)+
_H)(1
equal to and
- @-_)
mpq
ii
i
where Nh, is the harmonic mean of rni and _ = 1 - _. The test st',ttistic Z = "/^_/Sois assumed to be distributed N(0, 1).
i
kappa: m >2,ik> 2 Let .rij be ithe number or ratings on subject i, i = 1,...,
n, into category j, j = 1,...,
k. Define
i
_j as the overall proportion of ratings in category j, _j = 1 - _._, and let _j be the kappa statistic given above f_r k = 2 when category j is compared with the amalgam of all other categories. Kappa
I
is (Landis an_ Koch 19778)
_=
A____PJ-qjr'JJ
t
160 kappa m lnterrater agreement In the case where the number of raters per subject _j xij is a constant m for all i, Fleiss, Nee, and Landis (1979) derived the following formulas for tile approximate standard errors. The standard error for testing
_j against
0 is
/ V /
and the standard
error
for testing
1)
g is
_= _. _f_jx/nm(m-J
2
i)
PJqJ
j
PJ'qJ('qJ-
_j)
References Altman, D. G. t991. Practical Statistics for Medical Research. London: Chapman & Hall. Boyd, N. F., C. Wolfson, M. Moskowitz, T. Carlile, M. Petitclerc, H. A. Ferri, E. Fishell. A. Gregoire, M. Kieman. J. D. Longley, I. S. Simor, and A. B. Miller. 1982. Observer variation in the interpretation of xeromammograms. lournaI of the National Cancer Institute 68: 357-63. Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37-46. 1968. Weighted kappa: Nominal scale agreement with provision for scaled disagreement Psychological Bulletin 70: 213-220.
or partial credit.
Dunn. G. 2000. Statistics in Psychiatry. London: Arnold. Fleiss, J. L. 1981. Statistical Methods for Rates and Proportions. 2d ed. New York: John Wiley & Sons. Fleiss, J. L. and J. Cuzick. 1979. The reliability of dichotomous judgments: unequal numbers of judges per subject. Applie 4 Psychological Measurement 3: 537-542. Fleiss, J. L., J. C. M. Nee, and J. R. Landis. 1979. Large sample variance of kappa in the case of different sets of raters. Psychological BuIletin 86: 974-977. Gould, W. 1997. stata49: Interrater agreement. Stata Technical Bulletin 40: 2-8. Reprinted in Stata TechnicaJ Bulletin Reprints, rot. 7, pp. 20-28. Landis, J. R. and G. G. Koch. 1977a. The measurement of observer agreement for categorical data. Biometrics 33: 159-174. 1977b. A one-way components of variance model for categorical data. Biometdcs 33: 671-679. Steichen, T. J. and N. J. Cox. 1998a. sg84: Concordance correlation coefficient. Stata Technical Bulletin 43: 35-39. Reprinted in Stata Technical Bultetin Reprints, vol. 8, pp. 137-143. 1998b. sg84.t: Concordance correlation coefficient, revisited. Stata Technical Bulletin 45: 21-23. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 143-145. 2000. sg84.2: Concordance correlation coefficient: update for Stata 6. Stata Technical Bulletin 54: 25-26. Reprinted in Stata Technical Bulletin Reprints, vol. 9. pp. 169-170.
Also See Related:
[R] tabulate
i
Title !
i
• kdenslty
Univariate kernel density estimation
i
Syntax kdensity varname [weight] [ifexp][inrange] [, nJ _raphgenerate(neuwarznewvardinsity) n(#) _width(#) [ ibiweight I cQsineI eppanlgausl] parzenl rectangle I triangle ] ! n(_rmal stud(#) at(varz) s_ymbol(...) _connect(...) title(string) [ [
fweigh_s
ggoph_optwns ? i J i and _weights are allowed;see [U] 14.1.6weighi.
Descriptien kdensity!produces
kernel density estimates and graphs the result. /
Options i
nograph suppresses the graph. This option is often Used in combination with the generate () option. generate(n_wvarz newvardensitv) stores the results of the estimation, newvardensity will contain
i t_ I
the densit>l estimate, newvarz will contain the pbints at which the density is estimated. n(#) specifie_ the number of points at which the d_nsity estimate is to be evaluated. The default is min(N; 501, where N is the number of observations in memory.
i i ! i
width(#) sNcifies the halfwidth of the kernel, the width of the density window around each point. If w() is hot specified, then the "optimal" width is calculated and used. The optimal width is the width [hat would minimize the mean integrated square error if the data were Gaussian and a Gaussiar_ kernel were used, so is not optimal in any global sense. In fact, for multimodal and highly skeived densities, this width is usually too wide and oversmooths the density (Silverman
i
1986). bi_reight,
I cbsine,
default, e_t_, [ l i
I i
i
gauss,
parzen,
rectangle,
and triangle
specify the kernel. By
specifying the Epanechnikov kernel, is used.
normal requd?ts that a normal density be overlaid on the density estimate for comparison. stud(#)
i.
speOfies that a Student's _ distribution with # degrees of freedom be overlaid on the density
estimate f_r comparison, at(varz) specifies a variable that contains the v_lues at which the density should be estimated. This optiot0 allows you to more easity obtain density estimates for different variables or different subsamples of a variable and then overlay the e_t_mated densmes for comparison. symbol(...)
i
epan,
!is graph,
is symbollo);
two,ray's symbol()
see [G]graph options.
°pti°h for specifying the plotting symbol. Tile default :
i
(
connect(...)isgraph, twoway'sconnect ()estimation optionforhow pointsareconnected. The default is 162 kdensity -- Univariate kernel density connect (1), meaning points are connected with straight lines: see [G] graph options. title(string) is graph, twoway's title() option for speci_'ing the title. The default title is "Kernel Density Estimate"; see [G] graph options. graph_options
are any of the other options allowed with graph,
twoway; see [G] graph
options.
Remarks Kernel density estimators approximate the density f(z) from observations on z. Histograms do this too, and the histogram itself is a kind of kernel density estimate. The data are divided into nonoverlapping intervals and counts are made of the number of data points within each interval. Histograms are bar graphs that depict these frequency counts the bar is centered at the midpoint of each intervalwand its height reflects the average number of data points in the interval. In more general kernel density estimates, the range is still divided into intervals and estimates of the density at the center of intervals are produced. One difference is that the intervals are allowed to overlap, One can think of sliding the intervalPcaUed a window along the range of the data and collecting the center-point density estimates. The second difference is that, rather than merely counting the number of observations in a window, a weight between 0 and 1 is assigned--based on the distance from the center of the window and the weighted values are summed. The function that determines these weights is called the kernel. Kernel density estimates have the advantages of being smooth and of being independent choice of origin (corresponding to the location of the bins in a histogram).
of the
See Salgado-Ugarte, Shimizu, and Taniuchi (1993) and Fox (1990) for discussions of kernel density estimators that stress their use as exploratory data analysis tools.
Example Goeden investigate histogram. is roughly
(1978) reports data consisting of 316 length observations of coral trout. We wish to the underlying density of the lengths. To begin on familiar ground, we might draw a In [G] histogram, we suggest setting the bins to min(v/-_. 10-loglcn ). which for n = 316 18:
graph length, xlab ylab bin(18)
2-
15"
05"
m 0
length
kdensity -- Univariatekernel density estimation
163
The kernel density estimate, on the other hand, is smooth. . kdens_ty
length,
xlab
ylab
006 -_
004
121
£,02
i
\
1
o
Kernel
Density
tdngth Estimate
Kernel den_ity)stimators are. however, sensitive to an assumption just as are histograms. In histograms, we specify a _umber of bins. For kernel density estimators, we specify a width. In the graph above, we used the d_fault width, kdensity is smarter than graph, histogram in that its default width is not a fixed _:onstant. Even so, the default width is not necessarily bei. i kder_sity !ayes the width in the return scalar width, so typing display Doing this, wd discover that the width is approximately 20.
r(width)
reveals it.
! i
Widths are(ketail. isimilarThe to units the inverse of thearenumber of ofz, bins in histogram in analyzed. that smaller provide more of the width the uhits the avariable being The widths width is specified as ia halfwidth, meaning that the kernel density estimator with halfwidth 20 corresponds to sliding a w!ndow of size 40 across the data.
I
We can specify halfwidths for ourselves using the t the density as imuch. • kdens_ty
length,
epan
width()
i ]
option. Smaller widths do not smooth
w(lO)
xlab ylab
i
I
]
OOB I
oo6_I
\
/
,004
/
e
_oo
5
_Jo
,oo I_ng(h
Kernel
Density
Estimate
s_o
do
• kdensity length, epam xlab ylab w(15)
164
kdensity -- Univariate kernel density estimation
•
006
/_
.004 >.
j
0 200
"\ 3(_0
Kernel
4(_0 length
Density
560
6;0
Estimate
q
> Example When widths are held constant, different kernels can produce surprisingly different results. This is really an attribute of the kernel and width combination; for a given width, some kernels are more sensitive than others at identifying peaks in the density estimate. We can see this when using a dataset with lots of peaks. In the automobile dataset, we characterize the density of we £ght, the weight of the vehicles. Below, we compare the Epanechnikov and Parzen kernels. kdensity weight, epan nogr g(x epan) kdensity weight, parzen nogr g(x2 parzen) • label vat epan "Epamechnikov Density Estimate" • label vat parzen "Parzen Density Estimate" • gr epan parzen x, xlab ylab c(ll)
o Epanechnikov .0008
Density
Estimate
_ Parzen
Density
Estimate
"_
oooo
i'll!
0
1ooo
2o'oo
3ooo Weight
(l_s.)
4otoo
5ooo
!
kdensRy-_ Univariatekerneldensityestimation
165
!
We did not s_ecify a width and so obtained the d_fault width. That width is not a function of the selected kerneil but of data. See the Methods and Formulas section for the calculation of the optimal
I
width.
q
> Example In examining the density estimates, we may wi_h to overlay a normal density or a Student's t • ari_" Mng automobile weights, we can get an idea of the distance from normality density for col_p _u,,. U__ with the normal option. , kdens_ty
weight,
epam
normal
xlab ylab
,0006
i
.ooo4
,ooo2 t
°t 1 1000
il 2000
I 3_00 Weigh!
Kernel
Density
I 4000
5000
(}bs,)
IEstimate
Example Another conmon desire in examining density estimates is to compare two or more densities. In this example, _,e will compare the density estimatesof the weights for the foreign and domestic cars. I
kdensi_y
i
.• kdensi_y kdens@y
i
weight,
negr
weight weight
gen(x
fx)
if gen(f_O) if foreign==O, foreigxl==l, nogr nogr gen(fXl)
label
_ar fxO
"Domestic
label
_ar fxl
"Foreign
at(x) at(x)
cars" cars"
(Continued on twxt page)
"
166
• gr
fxO fxl c(ll) s(TS) xlab ylab kdensity --x, Univariate kernel density estimation
i :
Domestic
cals
D Foreign
cars
OOl.
{
I !
.0005
l
_a_
fX
'
0" 1000
20100
3000' Weight
40t00
5000r
(Ibs.)
<1
0 Technical Note Although all the examples we included had densities
less than I. the density may exceed t.
The probability density f(z) of a continuous variable z has the units and dimensions of the reciprocal of z. If z is measured in meters, f(x) has units 1/meter. The density is thus not measured on a probability scale, so it is possible for f(x) to exceed I. To see this, think of a uniform density on the interval 0 to 1. The area under the density curve is 1: this is the product of the density, which is constant at 1, and the range, which is 1. If then the variable is transformed by doubling, the area under the curve remains 1. and is the product of the density, constant at 0.5, and the range, which is 2. If conversely the variable is transformed by halving, the area under the curve also remains at 1, and is the product of the density, constant at 2, and the range, which is 0.5. (Strictly, the range is measured in certain units, and the density in the reciprocal of those units, but the units cancel on multiplication.) D
Saved Results kdensity saves in r(): Scalars r(width) kernel bandwidth r(n) number of points at which the estimate was evaluated r(scale) density bin width Macros r(kernel)
name of kernel
g
i
kdensity_ Univariatekerneldensityestimation
167
Methods iar,d Formulas kdensit_r is implemented as an ado-file. A kernel density estimate is formed by summihg the weighted values calculated with the kernel
}
function K is in
= nh"i=1 --
l
:
where we may define various kernel functions, kdens±ty includes seven different kernel functions, The Epanec_nikov is the default function if no Otherkernel is specified and the most efficient in minimizing _he mean integrated squared error. Kernel
l i
FOrmula
Biweight
K[z] =
ig 0ta(1
Cosine
K[z] = {1 + eds(27rz) 0
z2) _
= i
i !
l i
1I
g 0 { 3(1
lz2]/x/_
if lzl < 1/2 otherwise iflzl
Epanechnikov
K[z]
Gaussian
K[z]=
Parzen
K[z] =
8(1 _ [z[)3/3 { 04 _ 8_zz + 8[z13
if 1/2 < !z[ < 1 if [z[ _<1/2 otherwise
Rectangular
K[z]
,,f1/2 t 0
if Iz! < 1 otherwise
0
otherwise
l
_
,,
if Izl < 1 otherwise
_-z_ /2
Triangular K[z] = { 1 -[z I if Izi < 1 From the definitions given in the table one can see that the choice of h will drive how many values are included in estimating the density at each point. This value is called the window width or bandwidth I_fthe window width is not specified, then it is determined as
m = rain i
h=
varian_,
interquartile 1.349 rang%
0.9m
1
i i
.. i
)
i
where :r is tile variable tbr which we wish to estimate thekernel and n is the number of observations. Most researchers agree that the choice of kernel is not as important as the choice of bandwidth,
i
i
There is a grea_deal of literature on choosing bandwidths under various conditions: see. for example, Parzen (196_) or Tapia and Thompson (1978}. Als6 see Newton (1988) for a comparison with sample
i i
i
spectral den, itv estimation in time-series applications.
i
i
Acknowledgments 168 kdensity -- Univariate kernel density estimation We gratefully acknowledge the previous work by Isa/as H. Salgado-Ugarte Autonoma de Mexico. and Makoto Shimizu and Tom Taniuchi Ugarte, Shimizu. and Taniuchi (1993), Their article provides subject of univariate analysis.
kernel
density
estimation
and presents
of Universidad
Nacional
of the University of Tokyo; see Salgadothe reader with a good overview of the arguments
for its use in exploratory
data
References Fox, J. 1990. Describing univariate distributions. In Modem Methods of Data Analysis. ed. J. Fox and J• S. Long, 58-125. Newbury Park, CA: Sage Publications. Goeden, G. B. t978. A monograph of the coral trout, Plectropomus leopardus (Lac6pbde). Res. Bull. Fish• Serv. Queens1. 1:42 p. Newton, H. J. t988, TIMESLAB: A Time Series Analysis Laboratory. Belmont, CA: Wadsworth & Brooks/Cole. Parzen, E. 1962. On estimation of a probability density function and mode. Annals of Mathematical Statistics 32: 1065-1076. Salgado-Ugarte, I. H., M. Shimizu, and "12,Taniuchi. 1993. snp6: Exploring the shape of univariate data using kernel density estimators. Stata Technical Bulletin 16: 8-19. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 155-173. 1995. snp6•t: ASH, WARPing, and kernel density estimation for univariate data. Stata Technical Bulletin 26: 23-31. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 161-172. 1995. snp6.2: Practical rules for bandwidth selection in univariate density estimation. Stata Technical Bulletin 27: 5-19. Reprinted in Stata Technical Bulletin Reprints, vot. 5, pp, 172-190. • 1997. snpl3: Nonparametric assessment of multimodality for univariate data. Stata Technical Butletin 38: 27-35. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 232-243. Scott, D. W. I992. Multivariate Density Estimation: Theory, Practice, and Visualization. New York: John Wiley & Sons, Silverman, B. W. 1986. Density Estimation for Statistics and Data AnaIysis. London: Chapman & Hall. Simonoff, J. S. 1996. Smoothing Methods in Statistics. New York: Springer-Verlag, Steichen, T. J. 1998. gr33: VioIin plots. Smm Technical Butletin 46: 13-18. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 57-65, Tapia, R. A. and J. R. Thompson. 1978. Nonparametric Probability Density Estimation. Baltimore: Johns Hopkins University Press. Wand, M. P and M, C. Jones. 1995. Kernel Smoothing. London: Chapman and Hall.
Also See Related:
[R] hist
Background:
Smta Graphics
Manual
Title I
ksm
--
Sm°°thing
iJ_cluding
l°v_;ess
i
Syntax ksm 3,v,tr xvar [ifexp] ]
[inrange]
[, line weight !o,wessb_wwidth(#) loogit
a_dj_stgenerate(newvar)nograph graph_options ]
i
Description
! ii
ksm car'ies out unweighted and locally weighted smoothing of yvar on .n,ar, displays the graph, and optionally saves the smoothed variable. Among ksm's capabilities are lowess (robust locally weighted r,:gression, Cleveland 1979). See Cleveland (1993, 94-101) for a discussion of lowess.
i i
%Vsmini: ksm is computationally intensive and may therefore take a tong time to run on a slow computer. Lowess calculations on 1,000 observations, for instance, require estimating 1.000
l
regressions_.
i
Options line speci_es running-line least-squares smoothing; the default is running mean. weight
specifies use of Cleveland's (1979) tricube weighting function: the default is unweighted.
lowess is_equivatent to specifying line I smootheir. bwidt;h(#)i
weight
and requests Cleveland's
specifies the bandwidth. Centered subsets of bwidth.
lowess running-line
3\r obse_'ations
are used for
} i
calcula@g smoothed values for each point in the data except for the end points, where smaller, uncente_d subsets are used. The greater the bwidth, the greater the smoothing. The default is
i i
0.8, logit transforms the smoothed war into logits. Predicted values less than .000t or greater than .9999 ar_ set to I/N and 1 - l/N. respectively, before taking logits, adjust adiusts the mean of the smoothed war to equal the mean of yvar by multiplying by an appropriate factor. This is useful when smoothing binary (0/1) data. i generate (inewvar) creates newvar containing the smoothed values of yvar in addition to or instead of displaying the graph. i t
t
i
nograph s_ppresses displaying the graph. ,eraph_optiSns are any of the options allowed wiih graph, 1{69
twoway; see [G] graph options.
] i
i ] i i ]
170
,,
ksm-
Smoothing including Iowess
Remarks The most common use of ksm is to provide lowess--locally weighted scatterplot smoothing. The basic idea is to create a new variable (newvar) that, for each yvar Yi, contains the corresponding smoothed value, The smoothed values are obtained by running a regression of war on xvar using only the data (zi, yi) and a small amount of the data near the point. In lowess, the regression is weighted so that the central point (zi,Yi) gets the highest weight and points farther away (based on the distance tzj - zil) receive less weight. The estimated regression is then used to predict the smoothed value _'i for Yi only. The procedure is repeated to obtain the remaining smoothed values, which means that a separate weighted regression is estimated for ever 3, point in the data. Lowess is a desirable smoother because of its localitymit tends to follow the data. Polynomial smoothing methods, for instance, are global in that what happens on the extreme left of a scatterplot can affect the fitted values on the extreme right.
Example The amount of smoothing is affected by the bwidth different values. For instance. • ksm hl
depth,
lowess
ylab
Lowess
xlab
and you are warned
s(Oi)
smoother,
bandwidth
= .8
14 o co
o
o
oo 0
12
o
o
-
o
0 o
000
°o
% 10
O0
0
0 0
0
_:
0
/i,
o
O0
_/
cO cOo
o o
o
8 o
o o
o
o
co
o
6
0
,60
26° depth
Now compare that with
(Continued
on next page)
360
,;o
to experiment
with
r i
,.o_
,,,
]
. ksmlhl depth, lowess ylab xlab s(Oi) bWidth(.4) Lowess
smoother,
bandwidth=
.4
i
o cl_
o
___oo
0
10
O0
_
0
0 0
O 0
0
o o oo fc_
o
eo
o°o
0
0
0%00//" o_ ooo
/ 0
C
0
O
0
oo
0
8
I c
c
o
o
co
6
I 300
i
l
4{_0
depth
In the first c _se,the default bandwidthof 0.8 is used, meaning that 80% of the data is used in smoothing each ,p°int- In the second case, we explicitly specifieda bandwidth of 0.4. Smaller bandwidths follow the ofigmalldata more cteselv.
Example !
Two ks_ options are especially useful with binary (0/1) data: adjust and logit, adjust adjusts the resultin_ curve (by multiplication) so that themean of the smoothed values is equal to the mean
l l
of the unsm_oothedvalues, logit specifies the smoothed curve is to be in terms of the log of _he odds ratio: i
i
. ksmiforeign mpg, lowess ylab xlab jitter(5) Lowess
smoother,
bandwidth
adjust
= .8
75
_//
& • o u.
5 /
i
// 25
/P--/
,
c
i i
1
,
X(o ,o ,o
o 2;
_ o
M_)leage (mpg)
_'o
,;
I
• ksm
f
172
foreign
mpg,
lowess
ylab
xlab
logit
ksm -- Smoothing LowesS including smoother, Iowess bandwidth
yline(O)
= .8
)l !
_g,
o
/ /t-
_
/
o,
/
/
/ -4
lo
2'0
3_o Mileage
4_
(mpg)
With binary data, if you do not use the logit option, it is a good idea to specify graph's jitter() option; see [G] graph options. Since the underlying data (whether the car was manufactured outside the United States in this case) take on only two values, raw data points are more likely to be on top of each other, thus making it impossible to tell how many points there are. graph's jitter() option adds some noise to the data to shift the points around. This noise affects only the location of points on the graph, not the lowess curve. When you do specify the logit
option, the display of the raw data is suppressed. <1
Q Technical Note ksm can be used for other than lowess smoothing. Lowess can be usefully thought of as a combination of two smoothing concepts: the use of predicted values from regression (rather than means) for imputing a smoothed value and the use of the tricube weighting function (rather than a constant weighting function), ksm allows you to combine these concepts freely. You can use line smoothing without weighting (specify line), or mean smoothing without weighting (specify no options), or mean smoothing with tricube weighting (specify weight). Specifying both weight and line is the same as specifying lowess. Q
Methodsand Formulas ksm is implemented
as an ado-file.
Let Yi and z_ be the two variables and assume the data are ordered i = 1.... , N 1. For each Yi, a smoothed value _t_ is calculated.
_'
so that z_ _< zi+l
for
The subset used in calculation of fl_ is indices i_ -- max(l, i k) through i.__= min(i + k, N), where k - L(N.bwidth-0.5)/2j. The weights for each of the observations between j -- i ...... i+ are either 1 (default) or the tricube (weight):
ksm -- Smoothingincludingi_
i
1_
where A = 1.000t max(z+ - x_, x, - z_). The smoothed value y] is then the (weighted) mean or the (weighted) regression prediction (line). !
Ackledgrnent
]
ksm was written by Patrick Royston of tile MRC Clinical Trials Unit, London.
i
ReferenCes ! Chamber_,J. M., W. S. Clevi_land,B. Kleiner,and E A. Tukey.I983. GraphicalMethodsfor Data AnalySis.Belmont, CA: WadsworthInternatiOnalGroup. Clevetan&W. S. 1979. Robust locally weightedregressionand smoothingscatterplots.Journal of t_ Amer/can Staris_catAssocia6on 74i 829-836. , 1993. VisualizingData. Summit,NJ: HobartPress. ............ . 1994. The Elementsdf GraphingData. Summit,NJ! HobartPress. Goodatl,i'C. 1990.A surveyOfsmoothingtechniques.In ModemMethodsof Data Analysis. ed. L Fox and J. S. Long, 126-!76. NewburyPark;CA: Sage Publications. Royston, P. 199t. gr6: Lowess smoothing.Stata TechnicalBulletin 3: 7-9. Reprinted in Stata TechnicalBulletin
i
Re ,s, voL1,or.41- 4.
Salgado'Ugarte,1. H. and M. Shimizu.1995.snp8: Robusiscatterplotsmoothing:enhancementsto Stata's ksm. Staia TechhicalBulletin25: 23-26. Reprintedin Stata TechnicalBulletin Reprints,vol, 5, pp. 190-194. Sasieni.P. 1994. snpT:Naturalcubic splines.Stata TechnicalBulletin22: 19-22. Repnntedm Stata TechnrcalBulletin Reprints,vol. 4, pp. 17b-174.
AlSo [
!
Related:
[R] ipOlate, [lk] smooth
BackgrOund:
Stata Graphics Manual
I i(le
ksmirnov -- Kolmogorov-Smimov I
equality of distributions I test
II
]
Syntax ksmirnov
varname = exp [if exp] [in range]
ksmkrnov varname [if exp] [in range], by(groupvar)
[ e_xact ]
Description ksmirnovperforms one- and two-sample Kolmogorov-Smirnov tests of the equality of distributions. In the first syntax, varname is tile variable whose distribution is being tested and exp must evaluate to the corresponding (theoretical) cumulative. In the second syntax, groupvar must take on two distinct values. The distribution of varname for the first value of groupvar is compared with that of the second value. When testing for normality, please see [R] sktest and [R] swilk.
Options by (groupvar) is not optional, in the second syntax. It specifies a binaD' variable that identifies the two groups. exact specifies the exact p-value is to be computed. This may take a long time if n > 50.
Remarks D Example You have data on x that resulted from two different experiments, labeled as group==1 group==2. Your data contain
and
list group 1. 2. 3. 4. 5. 6, 7.
x 2 0 3 4 5 8 10
2 1 2 1 1 2 2
You wish to use the two-sample Kolmogorov-Smirnov in the distribution of z for these two groups:
test to determine if there are any differences
- ksmirnov x, by(group) Two-sample Kolmogorov-Smirnov Smaller group 1: 2: Combined K-S :
D O.5000 -0. 1667 0,5000
test for equality of distribution P-value
Corrected
O.424 O.909 O. 785
O.735
174
functions:
ksmimov-- Kolmogorov-Smirnov equalityof distributions test
175
Thefirst lineteststhe hypothesis thatx for group] containssmaJtervaluesthangroup2. Thelargest difference between the distribution functions is 0.5. The approximatep-value for this is 0.424, which is not fignificant. The second line tests the hypothesis that x for group 1 contains Iarger values than group 2. The largest_ difference between the distribution p-vaLuefor this small difference is 0.909. functions in this direction is 0.1667. The approximate Finally,the approximate p-value for the combined test is 0.785, corrected to 0.735. The p-values ksmirnov calculates are based on the asymptotic distributions derived by Smimov (1939). These approximations are not very good for small samples (n < 50). They are too con_servative--real p-values tend to be substantially smaller. We have also included a less conservative approximationfor the nonidirectionalhypothesis based on an empirical continuity correction. That is the 0.734 reported in the third column• That number, too, is only an approximation. An exact value can be calculated using the exact option: • ksmirnov x, by(group) exact Two-sample Kolmogorov-Smirnov test for eo_ualityof distribution functions: Smaller group D P-value Exact 1;: 2: Combined K-S :
O.5000 -0.1667 O.5000
O. 424 0.909 O.785
O.657
I> Example Lefs now test whether x in the example above is distributed normally. Kolmogorov-Smirnov is not a particularly powerful test in testing for normality and we do not endorse such use of it; see [R] sktest and [R] swilk for better tests. In any case. we will test against a normal distribution with the same mean and standard deviation: • Summarize x Variable
Obs
Mean
Std. Dev.
x 7 4.571429 3.457222 _smirnov x = norm((x-4.571429)/3.457222)
Min
Max
0
10
One-sample Kolmogorov-Smirnov test against theoretical distribution norm( (x-4.571429)/3.457222) Smaller group x: Cumulative : Combined K-S:
D O. 1650 -0.1250 O.1650
P-value
Corrected
0.683 O. 803 O,991
O. 978
Since Stata has no way of knowing that you based this calculationon the calculated mean and standard deviation of x. the test statistics will be slightly conservative in addition to being approximations, Nevertheless. they cleartv indicate that the data cannot be distinguished from normally distributed data. q
]
, .v
i
rtoJls.
Jwvw
--
._v_lv_jvzvv--_.TlUnllUlVV
V_lMClllty
UI
Ul_itrlDU[lOrl$
[eS[
Saved Results ksmirnov
saves in r():
Scalars _
r(D_l)
D from line l
r(D)
combined D
r(p_l)
p-value from line 1
r(p)
combined ;,-value
r(D._2) r(p_2)
D from line 2 p-value from line 2
r(p_exact)
combined significance (x 2 or exact)
name of group from line 1
r(group2)
name of group from line 2
Macros r(groupl)
Methodsand Formulas ksmirnov is implemented as an ado-file. In general, the Kolmogorov-Smimov test (Kolmo_orov 1933; Smirnov 1939; also see Conover 1999, "Statistics of the Kolmogorov-Smimov type", 42)-465) is not very powerful against differences in the tails of distributions. In return for this, it is f_irly powerful for alternative hypotheses that involve lumpiness or clustering in the data. The directional hypotheses are evaluated with the statistics
where Fix ) and G(x) are the empirical distribution t_unctions for the sample being compared. The combined statistic is
The p-value for this statistic may be obtained by evaluating the asymptotic limiting distribution. Let m be the sample size for the first sample, and let rt be the sample size for the second sample. Smirnov (1939) shows that C_
lira rr_, _z_--40G
Pr{v/mn/(m.
n)Dra,n< z} = i-
2'
i=1
(-
1) i-1 exp (-
2i2z 2)
The first 5 terms form the approximation Pa used b_ Stata. The exact p-value is calculated by a counting algorithm; see Gibbons (1971, 127-131). A Corrected p-value wa_ obtained by modifying the asymptotic p-value using a numerical approximatic_n technique Z = ¢-1 (Pa)+
1.04/rain(m,
n)+ 2.09i/max(re,
p-value =
n)-
].35/v/_m/(m
+ n)
IL ksmimov
•
Kolmogorov-Smirnov
equality of dJstdbution,s test
177
References Conover, W. J. 1999. Pr_cOcaINonparameNc Statistics. 3d ed. New York: John Wiley & Sons. Gibb0n_, J. D. Y971.Nonparametric S_6srical Inference. New York: McGraw-Hill. Kolrn_g0rov, A. N. 1933. Sulla determinazione empirica di una legge di distribuzione. Giomale dell' fstituto Italiano degli Am_ari 4:83-91 Smirr_v, N. V. 1939. Estimate of deviation between ernt_irical distribution functions m two independeat samples (in R_sst_). Bulletin Moscow University 2(2): 3-16.
I
i
AlSo See
'
Related:
[R]runtest,[R]sktest,[R]swilk
ll_
! ltle [ kwallis -- [ Kruskal-Waltis equality of population,_•i rank test [ I /Ill
]
I
i
Syntax kwallis
varname [if exp] [in rangeJ, by(grou_var)
Description kwallistests the hypothesis that several sample_ are from the same population. In the syntax diagram above, varname refers to the variable recortling the outcome and groupvar refers to the variable denoting the population. Note that the by () "option" is not optional.
Remarks > Example You have data on the 50 states. The data contain tlhemedian age of the population medage and the region of the country region for each state. You _,vishto test for the equality of the median age distribution across all four regions simultaneously: • kwallis medage, by(region) Test: Equality of populations region NE N Cntrl South West chi-squared = probability =
_Obs 9 12 16 13
(Kruskal-Wallis _test)
_KankSum 376.50 294.O0 398.00 206.50
17.041 with 3 d.f. 0.0007
chi-squared with ties = probability = 0.0007
17.062 with 3 d.f.
From the output we see that we can reject the hypotl_esis that the populations are the same at any level below 0.07%.
Saved Results kwallissaves in r(): Scalars r(df)
degrees of _freedom
r (chi2)
X2
r(chi2__adj)
X 2 adjustedfor ties
178
kwallis -- Kruskal-Wallis equality of populationsrank test
179
MMhodSand Formulas kUallisis implemented as an ado-file. The Kruskat-Wallis test (Kruskat and Wallis 1952; atso see Conover 1999. 288-297 or Airman 1991, 213-21'5) is a multiple-sample generalization of the two-sample Wilcoxon (also called Mann-Whitney) rank sum test (Wilcoxon 1945: Mann and Whitney 1947). Samples of sizes n3, 3 = I,..., m, are combined and ranked in ascending order of magnitude. Tied values are assigned the average ranks. Let n denote the overall sample size and let Rj denote the sum of the ranks for the jth sample. The Kruskat-Wallis one-way analysis-of-variance test H is defined as
H=
12 _., P_ - 3(n + 1) n(n + l ) j_l' nj
The sampling distribution of H is approximately X2 with m - 1 degrees of freedom.
tl raneas A_ttman. D. G. I991. Practical Satistics
for Medical Research.
Conover. W. J. 1099. Practical Nonparametric
London: Chapman & Hall.
Statistics. 3d ed. New York: John Wiley & Sons.
Ktuskal, W. H and W A. Wallis. 1952. Use of ranks in one-criterion Stalistical Association 47: 583-621.
variance anah,sis. Journal
of the American
Mann. H?B. and D. R. Whitney. 1947. On a test of whether one of two random variables is stochastically the other. Annals of Mathematical Sta6stics 18: 50-60. Wilcoxon. F. 1945. Individual
comparisons by ranking methods. Biometrics
Also Sea Related:
[R] nptrend, [R] oneway, [R] runtest, [R] signrank
1: 80-83.
larger than
Title label -- Label manipulation I I O
II
g_
I
i
Syntax la_bel data
["label"!
label define Iblname # "label"
[# "label"...
] [, _addmodify nofix ]
label dir label drop {Ibtname label
list
[lblname
... ]l-all
[Iblname [Iblname ... ] ]
label save [Iblname [Iblname.,. label
}
values
varname
[lblname]
label variable varname
]] using [, nofix
filen_me
[, replace ]
]
["label"]
Description label data attaches a label (up to 80 characters) to the dataset in memory. Dataset labels are displayed when you use the dataset and when you describe it. If no label is specified, any existing label is removed. label define defines a list of up to 65,536 (1,000 for Small Stata) associations of integers and text called a value label. The value label is attached te variables by label values. label
dir
lists the names of value labels stored in memory.
label
drop
eliminates value labels.
label
list
lists the names and contents of value labels stored in memory.
label
save
saves value labels in a do-file.
label values attaches a value label to a variable. If no value label is specified, any existing value label is detached. The value label, however, is r_ot deleted. label variable attaches a label (up to 80 characters) existing variable label is removed.
to a variable.
If no label is specified, any
Options add allows additional #++ label correspondences to be added to Iblname. If add is not specified. only new Iblnames may be created. If add is specified, you may create new lblnames or add new entries to existing Iblnames.
i
180
i
allows modification or deletion of existing #_ label correspondences and also allows additional correspondences to be added, Specifying modify implies add even if you do not type tl_eiadd option.
modify
no:fix prevents display formats from being widened according to the maximum length of the value label. Consider label values myvar mylab and pretend that myvar has a Y.9.0gdisplay format right now. Pretend that the maximum length Ofthe strings in mylab is t2 characters Then label values would change the format of myvar from Y.9.0g to %t2.0g. no:fix prevents this, nofix is also allowed with label define, but it is relevant only when you are modifying an existing value label. Withoutthe no:fix option, label define finds all the variables that use this value label and considers widening their display formats, no:fix prevents this. replace
allows filenalne to be replaced even if it already exists.
Remarks See:[U] 15.6 Dataset, variable, and value labels for a complete description of labels. This entry deals Onlywith details not covered there. label dir lists the names of all defined value labels, The label list contents of_a value label.
command displays the
Example Although describe shows the names of the value labels, those value labels may not exist. Stata does not consider it an error to label the values of a variable with a nonexistent label. When this occurs. Stata still shows the association on describe but otherwise acts as if the variable's values are Unlabele&This way, you can associate a value label name with a variable before creating the corresponding label, Similarly, you can define labels that you have not yet use& label dir shows you the labels that you have actually defined: label dir y_sno _exlbl
We have two value labels stored in memory: one called yesnoand the other called sexlbl We can display the contents of a value label using the labellistcommand: label list yesno
y_sne: 1 yes 2 no
The value label yesno labels the values l as yes and 2 as no. If you do not specify the name of the value label on the label list value labels is produced: label list y,esno_: 1 yes 2 no sexlb_
: 0 Male 1 Pemaie
command, a listing of all
r
182
label -- Label manipulation
[] Technical Note Since Stata can have more value labels stored in memory than are actually
used in the dataset,
you may wonder what happens when you save the dat_set, tn that case. Stata stores with the dataset only those value labels actually associated with variables. When you use a dataset, Stata eliminates the dataset.
all the _mlue labels stored in memory before loading []
You can add new codings to an existing value label _sing the add option with the label define command. You can modify existing codings using the modify option.
_>Example The label yesno codes 1 as yes and 2 as no. Perhaps at some later time you wish to add a third coding: 3 as maybe. Typing label define without any options results in an error: label label
define
yesno
yesno
already
3 maybe defined
r(llO) ;
If you do not specify the add or modify options, label define can only be used to create new value labels. The add option lets you add codings to _n existing label: . label label
define list
yesno
3 maybe,
add
yesno
yesno : 1 yes 2 no 3 maybe
Perhaps you have accidentally mislabeled a value. For instance. 3 may not mean "maybe" may instead mean "don't know", add will not allow you to change an existing label: label
define
yesno
invalid attempt r(180) ;
3 "don't
to modify
know",
but
add
label
Instead, you specify the modify option: label
define
label
list
yesno
3 "don't
know",
modify
yesno
yesno: I yes no 3 don_t
know
In this way, Stata attempts to protect you from yourself. If you type label define without any options, you can only create a new value label--you cannot accidentally mutilate an existing one. If you specify the add option, you can add new labelings to a label, but you cannot accidentally change one of the existing labelings. If you specify the modify optiom which you may not abbreviate, you can do whatever you want. You can even use the modify option to eliminate numeric code to a null string, that is. '....
existing labelings.
To do this. you map the
label-
Label m_mipul_lo.
183
1
. label define yesno 3 .... , modify label list yesno yesno: i yes 2 no
You can eliminate entire value labels using the label
drop command.
Example We currentlY have two value labels stored in memory--sexlbl and yesno.The label dir comman_ reports that: • label,dir
y_sno sexlbl
The da_a_t that we have in memory uses only one of the labels
sexlbl,
describe
reports that:
describe Centains data from emp.dta obs: vats: size:
7 4 224 (99.87,o_ memory free)
variable name
storage type
displ&y format
n_me
str16
7,16s
empno sex salary
float float float
X9.0g X9.0g Y,9.0g
S_rted
value label
sexlbl
1992 Employee Data 14 Jul 2000 14:28
variable label
Employee mumber O--male;l=female Annual salary, exclusive of bonus
by:
We can:eliminate the yesno label by typing label drop yesno: • ilabelidrop yesno • !la_el dir sexlbl
We could elin_inate all the value labels in memory by typing • .label_rop _ail label Idir
Remember that the value label sexlbl,which no longer exists, was associated with the variable sex. E_en after dropping the value label, sexlbl is still associated with The variable:
_04
taoe_-- Lane_manlpulatton . describe
i i
obs: Contains vats: size:
data
variable
name
7 :from emp.dta 4 224 (99.8_
1992 of memory
2000
Data 14:28
free)
storage
display
value
t_e
format
label
name
sir16
empno
float
7.16s _9.0g
sex
float
7,9.0g
salary
float
_,9.0g
Sorted
Employee
14 Jul
sexlbl
variable
label
Employee
number
O--_nale; 1=re, hale Annual salary, bonus
exclusive
of
by :
As stated earlier, Stats does not mind if a nonexistent value label is associated with a variable. When Stats uses such a variable, it simply acts as if it is not labeled: • list
1. 2.
in
1/4 name
empno
sex
salary
Hank Rogers Pat Welch
57213 47229
0 1
24000 27000
57323
0
24000
57401
0
24500
3.
Bob
4.
Richard
Underhill Doyle
q The label save command creates a do-file containing label define commands for each label you specify. If you do not specify the Iblnames, all value labels are stored in the file. If you do not specify the extension for fitename, . do is assumed.
_, Example Labels are automatically stored with your dataset when you save it. Conversely, the use command drops all labels before loading the new dataset. You may occasionally wish to move a value label from one dataset to another. The label save command allows you to do this. For example, assume we currently have the value label yesno label
list
in memory:
yesno
yesno: I yes 2 no 3 maybe
You have a dataset stored on disk called survey, dta to which you wish to add this value label. One alternative is to use survey and then retype the label define yesno command. Retyping the label would not be too tedious in this case, but if the value label in memory mapped, say, the 50 states of the union, retyping it would be irksome, label save provides an alternative: label save yesno using file ynfile.do saved
ynfile
Typing label save yesno using ynfile the definition of the yesno label.
caused Stata to create a doqilc called ynfile,
If we want to see the contents of the file, we can use the Stata type
command:
do containing
label-
Label manip?iation
IFK
type ynfile.do l_be_ define yesno 1 ""yes'", modify labe_ define yesno 2 "no"", modify label define yesno 3 ""maybe"', modify
Weean:now use our new dataset, survey.dta: • :us_ survey (]Ioudeholdsurvey data) • la_el dir
Using the new dataset causes Stata to eliminate all value labels stored in memory. _ label yesno isnow gone.Sincewe saveditinthefile ynfile,do,however,we cangetitbackby typingeither do runexecute ynfile.Ifwe execute. Ifwe type run,y_file:,or _he file will silently: typedo.we willsecthecommandsinthefile , runiynfile . -lab_l yeSno
dir
The libel islnow just as if we had typed it from the keyboard. q
0 Techni_aiNbte Yola can also use the label save command to make the editing of value labels easier. You can save a: label in a file. leave Stata and use your word processor or editor to edit the label, and then return ,to Stata. Using do or run, you can load the edited values,
Gleason, J. R. J998. dm56: labels editor in Stata Technical ButletmA Reprints, vol, for 8, Windows pp. 5-t0. and Macintosh. Stata Technical Bulletin 43: 3-6.
--'vo1.1999"9, p.dm_6"l:]_.Update to labedk
Stata Technical Bulletin 5I: -.') Reprinted in Stata Technical Bulletin Reprints,
Weesie, L 1997. din47:vot. Verifying value label mappings. Stats Technical Bulletin 37: 7-8. Bull_tin Reprints, 7, pp. 39-40.
Also:See Background:
Reprinted
[u] 15.6 Dataset, variable, and value labels
Reprinted in Stata Technical
ladder -- Ladder of powers
Title []
Syntax ladder
varname [if
gladder
varname
qladder
varname
symbol(string)
exp]
[in range]
[if exp] [in range] [if
exp]
[in
margin(string)
range]
[" g_enerate(newvar) [, bin(#)
graph_options
noadjust
]
]
[, grid scale(#)
saving([ilename[,
replace])
]
by ... : may be used with ladder; see [R] by.
Description ladder searches a subset of the ladder of powers (Tukey 1977) for a transform that converts varname into a normally distributed variable, sktes_: is used to test for normality; see [R] sktest. Also see [el boxcox. gladder
displays nine histograms
of transforms Of varname according
to the ladder of powers.
qladder displays the quanfites of transforms of vamame according to the ladder of powers against the quantiles of a normal distribution.
Options generate(newvar) saves the transformed values co_esponding to the minimum chi-squared value from the table. Its use is not, in general, recommended since generate() is quite literal in its interpretation noadjust
of minimum, thus ignoring nearly equal but perhaps more interpretable
is the noadjust
option to sktest;
see [R] sktest.
bin (#) specifies the number of bins for the histograms. for you (see Methods and Formulas below). graph_options grid scale
transforms.
If not specified, an intelligent choice is made
are any of the options allowed with graph,
histogram;
see [G] histogram.
adds grid lines at the .05, .I0, .25, .50, .75, .90, and .95 quantiles. (#) specifies the size of text used to label the graphs, scale(1.25)
symbol(string)
is the default.
specifies the symbol used in the graph.
margin(string) specifies the margin to be placed around each graph as a percenl of graphical area. The default is 0. saving_Iename[,
replace
]) saves the graph.
i
186
,
lad_ler-- Ladder of powers
18/
IrrrkS Example You have data on the mileage rating of 74 automobiles and wish to find a transform that makes the varihble n_rmally distributed:
.-ila e =pg T_an_Io_mation
formula
chi2(2)
P(chi2)
cube square raw squaxe-_oot log reciprocal root r_icil_rodal
mpg'3 mpg'2 mpg sqrt(mpg) log (_g) I/sqrt(mpg) llmpg
43.59 27.03 10.95 4.94 O.87 0.20 2.36
(}. 000 O. 000 O.004 O.084 O. 647 0.905 O.307
recipro_al square rdci_rodal cube
i/(mpg-2) I/(mpg'3)
1I. 99 24.30
O. 002 O, 000
Had we t>ped '_adder mpg, gen (mpgx). the variable mpgx would have been automatically generated for us containing 1/_ m,'_pg.This is the perfect example of why you should not. in general, specify the generate() option. Note that we also cannot r_iect the hypothesis that the reciprocal of mpg is normally distributed and i/mpg gallons per mile has a better interpretation. I! is a measure of energy ¢onSun_ption. q
Example glad_lerexplores the same transforms as ladderbut presents results graphically: • gladd_
mpg
Histograms
Mi{eage(@pg) by Transformation
q
188
ladder -- Ladder of powers
Q Technical Note
!_
gladder is useful pedagogically, but some caution must be exercised when using it for research work, especially with large numbers of observations; For instance, consider the following data on the average July temperature in degrees Fahrenheit for 1954 U.S. cities:
ii+.i
. ladder tempjuty Transformation
formula
chi2 (2)
cube square raw square-root log reciprocal root reciprocal reciprocal square reciprocal cube
tempjuly_3 tempjuly'2 tempj uly sqrt (tempiuly) log (tempjuly) 1/sqrt (tempjuly) I/tempjuly 1/(tempjuly'2) I/(tempjuly'3)
47.49 19.70 3.83 1.83 5.40 13.72 26.36 64.43
P (chi2) O.000 O.000 0.147 O.400 O. 067 O.001 O. 000 O. 000 O. 000
The period in the last line indicates that the ;g2 is very large; see [R] sktest. From the table, we see that there is certainly a difference, normality-wise, between the square and square-root transform. If, however, you can see the difference between the transforms in the diagram below, you have better eyes than we do: • gladder tempjuly
eubo
squir*
_9_23
B_eo28
menl_+_/
337_
sqn
5B,_
,og
_
_q+l
+_ .+?044
+ + 7 S+2_14
-+
+ i ++7471
o 4.0+++P'
m¥@rll
'
*
• ++1+m3
+
l/Ioum+l
_S89_
i . °,+++,,
+++055
.2704
t
t I +o,o,,,+,
. +_+3+31+p
!tcm_l
:+J..... oo_.,+
Average
Histograms
duty
+,_°, +J
tem_eralure
by Trafisformation
CI
Example
+i
A better graph for seeing normality is the quantile-normal
graph which can be produced by qladder.
ladder-- Ladder of powers
,1
qladder tempjuly
g2_}0_6 I
--a_
1381B1 15 _ 13B'_31
B7 BO._6
.... 721058
r
3110.81 311:0 81
sqrl
7 68
58,1 5B.1481
8215.65
Jog ....
53
9.63357
4.0J_963
_nverse
-.017212 _ ..016441
93,6
tlsqrt
4.54141
-.128761
I tsquere
$.
t -000296 - 01035
-.00{_263
Average
Ouantile-Normat
-,000098
91.9594
-.102562
1/cube
-5.1e.06 -4.2e-06
-7 4e-07
Ju y temperature
PlOts by Transformation
This graph shows that for the square transform, the upper tail, and only the upper tail, diverges from,what would be expected, This is detected bs_sktest as a problem with skewness, as we would learn from using sktest to examine tempjuly squared and square-rooted, ,3
:SavedtResults l_dd_r saves in r(): Scalars r(l_)
number of observations
r(cube)
X_ for cube transformation
r(P_cube)
significance level for cube transformation
r(square)
_2 for square transformation
r(P_square)
significance level for square transformation
r (raw)
X_ for untransfom_ed: data
r(P_raw)
significance level for iuntransformed data
r(sqrt)
_:e for square-root
r(P_sqrt)
significance level for square-root
r(log)
.x: for log transformation
r(P_log)
significance level for log transformation
r(invsqrt)
:ts for reciprocal root transformation
r(P_invsqrt)
significance level for reciprocal
r(inv)
X2 for reciprocal transformation
r(P_inv)
significance level for reciprocal
r(invsq)
;_2 for reciprocal square transformation
r(P_invsq)
significance
r(invcube)
k 2 [br reciprocal cube transformation
r(P_invcube)
s_gnificance level for reciprocal cube transformation
transformation transformation
root transformation transfommtion
level tbr reciprocal square transformation
190
ladder -- Ladder of powers
Methods and Formulas ladder,gladder,and qladder are implemented For ladder, results are as reported by sktest; transform with the minimum X 2 value is chosen.
as ado-files. see [R] sktest. If generate
() is specified, the
gladder sets the number of bins to min(v/-_ , 10 log_o n), rounded to the closest integer, where n is the number of unique values of varname. See [G] histogram for a discussion of the optimal number of bins. Also see Findley (1990) for a ladder-of-powers variable transformation program that produces one-way graphs with overlaid box plots, in addition to histograms with overlaid normals. Buchner and Findley (1990) discuss ladder-of-powers transformations as one aspect of preliminary data analysis. Also see Hamilton (1992, 18-23).
Acknowledgment qladder
was written by Jeroen Weesie, Utrecht University,
Netherlands.
References Buchner, D. M. and T. W. Findley. 1990. Research in physical medicine and rehabilitation: viii preliminary data analysis. American Journal of Physical Medicine and Reliabilitation 69:154-169. Findley, T. W. 1990. sod3: Variable transformation and evaluation. Smta TechnicalBu!letin 2: t5. Reprinted in Smm TechnicalBulletin Reprints, vol. 1, pp. 85-86. Hamilton. L. C. 1992. Regression with Graphics. Pacific Gro_'e.CA: Brooks/Cole Publishing Company. Tukey, J. W. 1977. Exploratory Dam Analysis. Reading, MA: Addison-Wesley Publishing Company.
Also See Related:
IR] hoxcox, [R] diagplots,
Background:
Stata Graphics Manual
[R] Insl_ew0, [R] Iv, [R] sktest
Title
level _ Set default confidence level
Syntatx S_t!leVel#
DescriptiOn •
i
set It#el specifies the default confidence level for confidence intervals for all commands that repoia Confidence intervals. The initial value is 95. meaning 95ck confidence lnter_a],. " , _ # may . be betweeri 10 and 99.
Remarks i
I
TO change the width of confidence intervals reported by a particular command, it is not necessary to re_et!theldefault confidence level. All commands that report confidence intervals have a level (#) option. 'W_en you do not specify the option, the confidence intervals are calculated for the default level sei b_ set level
or 95% if you have not reset it.
> Example You tlse the c± command to obtain the confidence interval for the mean of mpg: . ,ci
Impg
mpg _ariable
I
74 Obs
21.2973 Mean
.6725511 Std. Err.
19.9569 [95Y, Conf.
22.63769 Interval]
[90]/,Conf.
Interval]
To obtain _0% confidence intervals, you could type • {el_pg, level(90) _ariable
Obs
mpg
74
Mean 21.2973
Std.
Err.
.6725511
20.17683
Std.
[90Y, Conf.
22.41776
Or iset i level
90
• :ci _pg g,_riable
Obs
mpg
74
Mean 21.2973
Err.
,6725511
20. 17683
Interval] 22.41776
If y'ou opt for the second alternative, the next time you estimate a model (.say with regress). OOe/_ confidence intervals will be reported. If you wanted 95% confidence intervals, you could specify' level (95) I on the estimation command or you dould reset the default by typing set level 95. 1
191
Also See
192 level -- Set default confidence level Complementary: [R] query
ii
Related:
[R] ci
Background:
[U] 23 Estimation and post-estimation commands. [O] 23.5 Specifying the width of confidence intervals
•
,
!
' J i
-- Quick reference for limits ,i i, ri i I i r i_lll
H
I,,,,lliHillllll,
i
i
!
Descdp{tidn ' i T!fisientry provides a quick reference for the size limits in Stata. Note that most of these limits are So _igt that you will never encounter them.
Remarks Maximum size limits for Small Stata and Intercooled Stata
_
l
_
,
Small Stata
Nu_be[ of observations Nu_bei of variables WiSh 0f a dataset Valde df matsize : I Numberof characters _n a command
about t,000 99 200 40
67,800
50
50
t,600 100 8
1,600 100 8
66 50
200 t 50
256 80 5
512 80 5
3,400
67,784
32
32
1,000 37,296
3,500 135,600
_ngthlof variable name Length tof ado-command name
32 32
32 32
Eer_gthtof global macro name _ngth!ofi local macro name
32 31
32 3t
Lengthi of a string variable
80
80
Numberf 1)ilirr ited°fconditionSby memory.inan if statement (Confi,ued on nero page)
30
100
Nu,be t of elements in a numlist Nurbbe i of unique time-series operators in a command Numbe_cof seasonal suboperators per time-series operator Nur0be [i of dyadic operators in an expression Nurpbe I of numeric literals in an expression Nur0'be_-,of string literals in an expression _ngth of string in string expression Numbe[ of sum functions in an expression Numbe_ of characters in a macro Nu_be_ of nested do-files Nu,be_ of lines in a program Nu_be[ of characters in a program
I
2.147,483,647 (1) 2,047 8,192 8(30
3,500
Nu_be i of options for a command
[i
Intercooled Stata
1
]
1 ' 1
i
1 ] t
Z
193 i i
!
194
limits -- Quick reference for limits Maximum size limits for specific commands Small State
Intercooled Stata
8 4
8 4
anova
Number of terms in anova model test statement Number of terms in the repeated() option char
Maximum length of a single characteristic
3.400
67,784
constraint Number of constraints encode
1,000
1,000
1,000
65,536
and decode
Number of uniquevaluesfora string variable estimates hold Number of stored estimation results
10
10
13
13
_.N/2
_.NI2
graph (See State Graphics Manual for graph limits) greigen Number of eigenvalues
plotted by greigen
grmeanby Number of unique values in varlist hist Number of unique values in varname impute Number of variables in varlist
(Table continued
on next page)
50
50
31
3t
!
V
limits -- Quick reference for lhnits
195
Maximum size limits for _pecific commands, continued
Small Intercooted St,am Stata iRecord length without data dictionary Record length with data dictionary (_)
none 7.998
none 7,998
none 7,998
none 7.998
80 80 80 32 1.000
80 80 80 32 65.536
Infix !Record length without data dictionary Record length with data dictionary (2) labal Length of dataset label Length of variable tabel Length of value label Length of name of value label Number of codings within a single Value label mat :ix Size of a single matrix
40x40
tnajimize options Number of iterations specified with iterate()
16.000
16.000
10
10
20
50
8
8
t.000 9.999 9.999
67.784 9.999 9.999
1.600
1.600
_l_git Number of outcomes in model
20
50
Op_obit Number of outcomes in model
20
50
800×800
mer_e
, ,
Number of variables that you can specify in a match-merge ml_git Number of outcomes in model nl4git and nlogittree Number of levels in model i_o_es
,
! Maximum length of a single note _ Number of notes attached to _dta Number of notes attached to each variable
nu_list i Number of elements in the numeric list
(2) :or Stata for Unix. the maximum record length is 19.998.
i'l
i _!
Maximum size limits for specific commands, limits -- Quick reference for limits
196
continued
plot Number of columns specified with column() option Number of lines specified with lines() option reg3, set
sureg, and other system estimators Number of equations
Small Stata
[ntercoolcd Stata
133 83
133 83
40
800
500K
500K
5
5
4
4
3.000
3,000
500 160 20
3,000 300 20
375
375
40
800
adosize
Maximum
amount of memory that ado-files may consume
sts graph Number of by variables (3) tabdisp and table Number of by variables Number of margins; that is, the sum of the rows, columns, supercolumns, and by groups tabulate Number of rows for a one-way table (4) Number of rows for a two-way table (4) Number of columns for a two-way table tabulate,
summarize
Number of cells xt estimation commands Number of time periods
(3) May be restricted to fewer depending on other options specified. (4) For lntercooled Stata for the Macintosh, limits are 2.000 for the number of rows for a one-way table and 180 for number of rows for a two-way table.
Also See
[I
Related:
[R] matsize,
[R] memory
Background:
[U] 7 Setting the size of memory
Title
,
[
i
I
II
0
I
U
I[
I
lin_o,exp [,level
iF
I
IU i
i
I
[I "
(#) or hr irr rrr e__form]
exp is apy 1:near combination of coefficients that is valid syntax for test: riot _ont_fin any additive constants or equal signs.
S¢¢ [R] test, Note, however, that exp must
Descrilti(m
'
li_o_, computes point estimates, standard errors, t or z statistics, p-values, and confidence intervals )r a linear combination of coefficients after any estimation command except a.nova. Results can optionally be displayed as odds ratios, hazard ratios, incidence rate ratios, or relative risk ratios. I
The. s_ y estimation commands for survey data have their own special command, svylc, estimating linear combinations: see [R] svylc.
OptiOnS leveli( ) specifies the confidence level, in percent, for confidence intervals. I
'
i
"
l
,
l i
[
Syntax
I
i
c m -- Linear combinations of estimators
or ;Is s_t by set level:
The default is level
for
(95)
see [U] 23.5 Specifying the width of confidence intervals.
or. hr_, ilr, rrr, and eform all do the same thing:, they all report coefficient estimates as exp(3) rather than ft. Standard errors and c?nfidenc_ intervals are similarly transformed. Note that or is the default after logistic.The onl 3 differehce in these options is how the output is labeled.
, :
Option
Label
Exl_lanation
Example commands
or
Odds Ratio
Odds ratio
logistic,
hr
_Iaz. Ratio
Ha_d
stcox,
irr
IRR
Incidence Rate Ratio
poisson
rrr
_
Relmive Rate Ratio
mlogit
eform
exp(b)
Gerieric label
ratio
logit streg
Remarks After itting a model and obtaining estimate_ for coefficients/3t,/32, .... 3_-, one often wants to view esti! aates for linear combinations of the 3i!, such as 31 32. lincomcan display estimates for an)" lilaem combination of the form ¢131 + C2/_2 + " '- * Ck'3k. Any valid works. estimation command for which iinco: works after ans' test except anova expressio_ tbr test Syntax l (see [R] test-_is a valid expression for lincom,There is only one exception to this rule: lincom does not allow ddditive constants: i.e.. it cannot display estimates for co -t- C131_-" "+ ck/3_,-when co _ O. line@
is useful for viewing odds ratios, lhazard ratios, etc.. for one group ti.e., one set _f
covariatel)1 relative to another group (i.e.. another set of covariates). See examples below 197
D Example We estimate a linear regTession: i!I_
- regress y xl x2 x3
l Source I
!" ili_ !i
Model Residual
,
Total
l
SS
.
df
MS
3259,3561 1627.56282
3 144
i086,45203 II .3025196
4886.91892
147
33.24434_4
Y
Coef.
t
/ i
= = = = =
148 96.12 0.0000 0.6670 0.6600 3.3619
Std. Err.
I
l xt x2 x3 _cons
Number of obs F( 3, 144) Prob > F R-sql/ared Adj R-squared Root MSE
1.457113 2. 221682 -.006139 36.10135
1. 07461 .8610358 .0005543 4.382693
1.36 2.58 -11,08 8,24
P> [t I O. 177 O. 01I O,000 O,000
[95Y,Conf, Interval] -. 6669339 • 5197797 -,0072345 27.43863
3. 581161 3. 923583 -.0050435 44. 76407
To see the difference of the coefficients of x2 and xl, type lincom x2 - xl (1)
- xl
+ x2 = 0.0
y
Coef.
(1)
Std. Err.
.7645682
• 9950282
The expression can be any linear combination
t O. 77
P> ItI O. 444
[95Y,Coal. Interval] -1.20218
2. 731316
without a constant.
lineom 3,xl + 500,x3 (I)
3.0 xl + 500,0 x3 = 0.0
y
Coef.
(1)
1. 301825
Std. Err. 3. 396624
t O. 38
P>It I O. 702
[957,Conf. Interval] -5. 411858
8.015507
Expressions with additive constants are not allowed lincom xl
-
1
additive constant terms not allowed r (198) ;
norarenonlinear expressions. • lincom X2/xl not possible with test r(131) ;
<3 Q Technical Note lincom uses the same shorthands for coefficients as does test (see [R] test). When you type xl, for instance, lincom knows that you mean the coefficient of xl. The formal syntax for referencing this coefficient actually _b [xl], or alternatively _coef [xl]. So, more forma]ly, in the last example we could have istyped 2incom 3*_b[xl] + 500*_b[x3]
l
rl_
iincom-
!
Linear combinationsof estimators
t99
OddsAfter rati(,s andregression, incidence ratios l_gistic the orrate option can be specified I i
, _
with lincom to display odds ratios for any effect.: InCidence rate ratios after commands such as poisson can be obtained in a similar fashion by spe_cif_ing the irr option.
> Example t
Consider the low birth weight dataset from H0smer and Lemeshow (1989, Table 4. I). We estimate a logistic regression of low birth weight (variable low) on the following variables:
i
_
Vail _ble
Description
Coding
age
age in vears
bla,:k
race black
1 if black, 0 otherwise
Oth,;r
race other
1 if race other, 0 otherwise
Smol:e
smoking status
1 if smoker, 0 if nonsmoker
tit
history of hypertension
1 if yes. 0 if no
u±
uterine irritability
t if yes, 0 if no
t'wd
maternal weight before pregnancy
] if weight < t10 lb., 0 otherwise
aige_.wd
age × twd
smol:elwd ptd
smoke history x of lwd premature labor
l if yes.. 0 if no
i . l
We firsI estimate a model without the interaction terms agelwd
and smoketwd (Hosmer and
Lemest_ovq1 1989, Table 4.8) using logit ,
Io [it low
age lwd black
other
smoke p_d
ht ui
-117.336
I%er .tion O:
log likelihood
=
I_er _tion i:
log likelihood
= -99.4311]_4
l_eration
2:
log likelihood
= -98.785718
Ieerltion
3:
log likelihood
=
I%er _tion 4:
log likelihood
= -98.777998
L_gi
-98.7_8
estimates
Number
! i
=
189
LR chi2(8)
=
37.12
> chi2
=
0.0000
:
0.1582
Prob Log
.ikelihood
: -98.777998
Pseudo
of obs
R2
, I
:
,
low
i
age twd
i
-. 0464796 .8420615
Std. Err. .0373888 .4055338
z
P>[zI
-1.24 2.08
0.214 O. 038
[957, Conf. -. 1197603 ,0472299
Interval] .0268011 1,636893
black
I.073456
.5150752
2.08
O. 037
.0639273
2.082985
other smoke
.815367 .8071996
.4452979 .404446
1.83 2. O0
0.067 O. 046
-. 0574008 .0145001
I. 688135 i.599899
ptd
i i
Coef.
i .281678
.4621157
2.77
0,006
.3759478
2. 187408
ht
:1. 435227
.6482699
2.21
0.027
.1646415
2. 705813
ui _cons
.65762S6 -1.216781
.4666192 .9556797
I. 41 _1.27
-. 2569313 -3.089878
1. 572182 .6563!7
O. 159 0.203
To _et the odds ratio for black smokers relative to white nonsmokers (the reference group), type
• lincom (1) i: I
200
black
black
+ smoke,
+ smoke
or
= 0.0
Iincom -- Linear combinations of estimators
ili
low
0dds
Ratio
Std.
z
Err.
il
P>,z,
[957, Conf.
Interval]
o0o, lincom computedcxp(_black+ blacknonsmokers,type lincom (I)
smoke
- black
- black, + smoke
low
Odds
(1)
_smoke)
_-
6.56.To seetheoddsratioforwhitesmokersrelative to
or
= 0.0
Ratio
Bid.
.7662425
Err.
z
.4430176
-0.46
P>IzJ
[95% Conf.
0.645
.2467334
Interval] 2.379603
Now let's add the interaction terms to the model (Hosmer and Lemeshow 1989, Table 4.10). This time we will use logistic rather than legit. By defaulL logistic displays odds ratios. . logistic Logit
Log
low
age
black
other
smoke
ht ui
Iwd
estimates
likelihood
=
low
-96.00616
Odds
Ratio
Std.
Err.
z
ptd
agelwd
smokelwd
Number of obs LR chi2(10) Prob > chi2
= = =
189 42.66 0.0000
Pseudo
=
0.1818
R2
P>|zl
[95_ Conf.
Interval]
,8408967 1.068277
1.005344 8,167462
age black
.9194513 2.95383
.041896 1.532788
-1.84 2.09
0.065 0.037
other
2.137589
.9919132
1.64
0.102
.8608713
5,307749
smoke
3.168096
1.452377
2.52
0.012
1.289956
7.780755
ht
3.893141
2.5752
2.05
0.040
1.064768
14.2346
ui
2.071284
.9931385
1.52
0.129
.8092928
5.301191
lwd
.1772934
.3312383
-0.93
0.354
.0045539
6.902359
ptd
3.426633
1.615282
2.61
0.009
1.360252
8.632086
1.15883 .2447849
.09602 .2003996
1.78 -1.72
0.075 0.086
.9851216 .0491956
1.36317 1.217988
agelwd smokelwd
Hosmer and Lemeshow (1989, Table 4.13) consider the effects of smoking (smoke -:- 1) and low maternal weight prior to pregnancy (lwd = 1). The effect of smoking among non-low-weight mothers (lwd -- 0) is given by the odds ratio 3.17 for smoke in the logistic output. The effect of smoking among low-weight mothers is given by • lincom (I)
smoke
smoke
low (1)
+ smokelwd
+ smokelwd
Odds
= 0.0
Ratio
.7755022
Std.
Err.
.5749508
z
P>Izl
[957 Conf.
-0.34
0.732
.1813465
Note that we did not have to specify the or option
After
logistic,
lincom
Interval] 3.316322
assumes or by default.
The effect of low-weight (Iwd = 1) is more complicated since we fit an age x lwd interaction. We must specify' the age of mothers for the effect. The effect among 30-year old nonsmokers is given by t
i _• !|_
i
'
_
i
lin¢om-- Linearcombinations of estimators
i _incom l_d + 30*agelwd (ii) lwd + 30.0 agelwd
i
i.
low
I
(t)
i
t
=
201
0.0
Odds Ratio 14. ?669
Std,
Err.
13. 56689
z 2.93
P>lz[
[95X Conf.
O.003
2. 439266
Interval] 89. 39625
..........
"
lincom _omputed exp(fllwd+30,_agelwd) = 14_.8.It seemsodd that we entered it as lwd+ 30*agelwd. but remember that lwd and agelwd are just'lincom's (and test'S) shorthand for _b[twd] and
i
_b [age_wd].
i
I
_ i
We could
i
i !
typed
(ii) incomlwd _b[1wd] + 30.0+ agelwd 30*_b[agelwd] = 0.0
low (1)
i
!
have
I
Odds Ratio 14. 7669
Std. Err. 13. 56689
z 2.93
P> Iz I O. 003
[957,Conf. Interval] 2. 439266 89. 39625
,
Multiple- quation models lincpm also works with multiple-equation models. The only difference is how"y'ou refer to the coefficiehts. Recall thai for multiple-equation models, coefficients are referenced using the syntax [e_no] varname where e,_nois the equation number or equation_nameand varname is the corresponding variable name for the cpefficient: see [U] 16.5 Accessing coefficients and standard errors and JR] test for detaih. ExampM, Consider the example from [R] mlogit (Taflov et al. 1989: Wells et al. I989).
!
. _logit insure age male nonwhite site2 site3,
nolog
Mu_tinomial regresszon
Number of obs LR chi2(I0)
= =
615 42.99
Lo i likelihood = -534.36165
Pseudo Prob > R2 chi2
= =
0,0387 O.0000
} !
insure
Coef.
age
-.Ol 1745
.0061946
-I.90
O.058
-.0238862
.0003962
male nonwhite
.5616934 .9747768
.2027465 .2363213
2.77 4.12
O.006 0.000
.1643175 .5115955
.9590693 1,437958
site2 site3 cons
.1130359 -. 5879879 .2697127
•2101903 .2279351 .3284422
O,54 -2.58 O.82
O.591 O. 010 O.412
.2989296 -1.034733 -.3740222
.5250013 -. 1412433 .9134476
I
age male
-.0077961 .4518496
.01t4416 .3674867
-0.68 1.23
0.496 0.219
-.0302217 -.268411
.0146294 I.17211
i
nonwhit e
.2170589
.4256361
O.51
O.610
-.6171725
i.05129
i
site2 site3 _cons
-1.211563 -.2078123 -1.286943
.4705127 .3662926 .5923219
-2.57 -0.57 -2.17
O.010 0.570 0.030
-2.133751 -.9257327 -2.447872
-.2893747 .510108 -. 1260135
i "
'
Std. Err.
z
P> Izl
[95Y,Conf. Interval]
Prepaid
i
Un insure
l (C _tcome insur!==Indem iB the comparison group) i
i' _
202 Linear combinations of estimators To see thelincom estimate-- of the sum of the coefficient of male and the coefficient Prepaid outcome, type • lincom (I)
of nonwhite
for the
[Prepaid]male + [Prepaid]nonwhite
[Prepaid]male + [Prepaid]nonwhite = 0.0
insure I (1)
Coef.
I
Std. Err.
1.53647
.3272489
z
P>Izf
4.70
0. 000
[95Y,Conf. Interval] lr
.
.8950741
2.177866
To view the estimate as a ratio of relative risks (see [R] miogit for the definition and interpretation), specify the rrr option. lincom (1)
[Prepaid]
male
[Prepaid]male
+ [Prepaid]
insure I (i) I
nonwhite,
+ [Prepaid]nonwhite
KPd{
rrr = 0.0
Std. Err.
4.648154
z
1.521103
4.70
P> [zl
[95X Conf. Interval]
0.000
2.447517
8.827451
q
Saved Results lincom saves in r(): Scalars r (estimate) r(se)
point estimate estimate of standard
r(df)
degrees
error
of freedom
Methods and Formulas lincom is implemented
as an ado-file.
References Hosmer, D. W., Jr,, and S. Lemeshow. edition forthcoming in 2001,)
f989. Applied
Logistic
Regression.
Tarlov, A. R,, J. E. Ware. Jr,, S. Greenfield. E. C. Nelson, E. Perrin. study, Journal of the American Medical Association 262: 925-930,
New York: John Wiley & Sons. (Second
and M. Zubkoff.
t989.
The medical
outcomes
Wells. K. E., R. D. Hays, M, A. Burnam, W, H. Rogers, S. Greenfield, and J. E. Ware. Jr. t989. Detection of depressive disorder for patients receiving prepaid or fee-for-service care. Journal of the American Medical Association 262: 3298-3302,
Also See Related:
[R] svylc, [R] svytest, [R] test, [R] testni
Background:
[u] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and pOst-estimation commands
I
rR r-
Title
II
l[nktes -- Specification link test for single-equation models i
!
i
I [
I
I
J
[
iI
I
i
I [
I
T[
I
I I
Syntax
i
lin.kt,
st [if exp] [in range] [, estimation_options ]
_qaen i: cap and in range are not specified,the link _estis performedon the same sample as the previous estimation.
Descripti(,n iin.kt_ st performs a link test for model specificationafter any single-equationestimationcommand such as 1, .gistic.
i
etc.;
regress,
see [R]
estimation commands.
I
:: ! ;
Options estimation_options must be the same option_ specified with the underlying estimation command.
] :
i
Remarks • i
The fotm of the linkltest implemented here if based on an idea of Tukey (1949) which was further descfibedlby Pregibon !(1980), elaborating on ,/,ork in his unpublished tl_esis (Pregibon t979). See Methods _nd Formulas! below for more details.
We at mpt to explifin Exampletl , the mileage ratings Of cars in our automobile dataset using the weight. engine displacement, a_d whether the car is manufactured outside the U.S.: r,_gress mpg
Source Model ;
w_ight
i !1619,71935
Residual
_ Total
mpg i
weight
23.740114 ; 12443.45946 _ SS
'
! i
foreign _cons
Coef.
-.0067745
displacement I
displ
.0019286 i-1.600631 41.84795
t
foreign
3
539.906448
F( 3, Prob > F
70
Ii.7_77159 33.4720474 MS
73 dI
Std. Err. .0011665 .0100701 1.113648 2.350704
t
P>It I
70)
= =
45.88 0.0000
R-squared
=
0,6629
Adj R-squared Root Number MSE of obs
= = =
0.6484 3.4304 74
[95X Conf.
Interval]
-5.81
0.000
-.0091011
0.19
0.849
-.0181556
.0220129
-1.44 17.80
0.155 O. 000
-3.821732 37.15962
.6204699 46.53628
-.0044479
204
linktest -- Specification link test for single-equation models
Based on the R 2. we are reasonably pleased with this model. If our model really is specified correctly, then were we to regress mpg on the prediction and the prediction squared, the prediction squared should have no explanatory power. This is what link"cost does: linktest Source
_
SS
df
f
Model Residual
] |
mpg
Number of F( 2,
1670.71514
2
835.357572
Prob
772.744316
71
10.8837228
73
33.4720474
I Total
MS
2
443
I
.45946
Coef.
I
Std.
Err.
obs 71)
= =
> F
74 76.75
=
0.0000
R-squared
=
0.6837
Adj K-squared Root MSE
= =
0.6748 3.299
t
P>Itl
[95_
Conf.
-0.63
0,532
-i.724283
2.09 2.16
0.041 0.034
.6211541 .0026664
Interval]
i
_hat _cons _hatsq
]
-.4127198
t
14.00705 .0338198
We find that the prediction good as we thought.
.6577736 6,713276 .015624
.8988434 27.39294 .0649732
squared does have explanatory, power, so our specification
is not as
Although linktest is formally a test of the specification of the dependent variable, it is often interpreted as a test that. conditional on the specification, the independent variables are specified incorrectly. We will follow that interpretation and now include weight-squared in our model: • gen
weight2
regress
= weight*weight
mpg
Source
weight I
Model
weight2 SS
displ
foreign
df
MS
Number F( 4,
of obs 69)
74 39.37
1699.02634
4
424.756584
Prob
=
0.0000
744.433124
69
10.7888859
K-squared
=
0.6953
Total
2443.45946
73
33.4720474
Adj R-squared Root MSE
= =
0.6777 3.2846
mpg
Coef.
Std.
Residual
Err.
t
P>rtl
> F
= =
[95_
Conf.
Interval]
weight
-.0173257
.0040488
-4.28
0.000
-.0254028
-.0092486
weight2
1.87e-06
6.89e-07
2.71
0.008
4.93e-07
3.24e-08
-.0101625
.0106236
-0.96
0.342
-.031356
foreign
-2.560016
1.123506
-2.28
0.026
-4.801349
-.3186832
_cons
58.23575
6.449882
9,03
0.000
45.36859
71.10291
displacement
.011031
And now we perform the link test on our new model: linktest
Model
1699.39489
Kesidual
i
744. 06457
Total
I
Source
[
2443.45946
SS
2
849.697445
Prob
=
0.0000
71
I0.4797827
R-squared
=
O. 6955
33.4720474
Adj R-squared Root MSE F( 2, 71)
= =
0.6869 3.2372 81.08
Number
=
73
df
MS
> F
of obs
74
'
gi
linktest-- specifi_ationlink test for single-equationmodels
i !•
i
mpg
Coef.
1
i,
hat hatsq
I 141987
.7612218
1.50
0.138
-.3758456
2.659821
i
_cons
- .0031916 -i.50305
.0170194 8.196444
-0.19 -0.18
O.852 O.855
-.0371272 -17.84629
.0307441 14.84019
'
t
P>Itl
[957,Conf. Interval]
! We now pass the link!'test.
! i
Std. Err.
205
•
> Exampl Abo_ we followe_ a standard misinterpretation of the link test when we discovered a problem, we focusted on the exl_lanatory variables of our model. It is at least worth considering varying exactly
i
what thi link test testS. The link test told us it]at our dependent variable was misspecified. For those
i
with _engineeringconsurr_tion__gallonsbackground, mpgperis mile--inindeed a terms strangeofmeasure.andIt woulddisptacement:make more sense to modelan@ergy weight ! _egress gpm _eight displ foreign i Source I 85 df i
i
] Model i Residual
i
Prob > F R_squared
=
0.7659
.01t957628
73
.000163803
Root MSE Adj R-squared
: =
.00632 0.7558
weight displacement I foreign
_cons
Std. Err.
t
P>lt I
[957.Conf, Interval]
.0000144
2.15e-06
6.72
O.000
.0000102
.0000187
.0000186 .0066981
.0000186 .0020531
I.O0 3.26
O.319 O. 002
-.0000184 .0026034
.0000557 .0107928
.0008917
.0043337
0.21
O. 838
-. 0077515
.009535
looks eve _ bit as reasonable as our original model.
inkiest _.
[ [
.003052654 .000039995
Coef.
_
,
Source
.
I li
SS
df
Residual
i .002782409 I Total .011957628 Model l ) i .009175219
gpm [i
Coef.
hat hatsq
I i I i
.6608413 3.275857
li
-_cons
))
.008365
irt a m( eparsimonio_s
MS
Number of obs =
F( 2,
!
i
74 76.33 0. 0000
3 70
gpm
This re+el
Number of obs = F( 3, 70) =
: .009157962 ;: .002799666
Total
! !
i
MS
71
.008039189
73 2
.000163803 .00_587609
Std. Err.:
t
P> It{
74
71) :
117.06
R-squared = Adj R-squ_red : Root = Prob MSE > F =
0.7673 0.7608 .00626 0.0000
[95_ Conf. Interval]
.5152751 4.936655 ;
1.28 0.66
0.204 0.509
-.3665877 -6.567553
1.68827 13.11927
.0130468
0.64
0.523
-.0176496
.0343795
speecmcanon " .
206
linktest -- Specification link test for single-equation models
> Example The link test can be used with any single-equatio/q estimation procedure, not solely regression. Let's mm our problem around and attempt to explain whether a car is manufactured outside the U.S. by its mileage rating and weight. To save paper, we will specify logit's nolog option, which suppresses the iteration log: . logit foreign mpg weight, nolog Logit estimates
Number of obs LR chi2 (2) Prob > chi2
= = =
74 35.72 0.0000
Log likelihood = -27.175156
Pseudo R2
=
0.3966
foreign
Coef.
mpg weight _cons
-.1685869 -.0039067 13.70837
Std. Err.
z
P>]z_
.0919174 .0010116 4.518707
-1.83 -3.86 _ 3.03
0.067 O.000 O.002
[95_. Conf. Interval] -.3487418 -.0058894 4.851864
.011568 -.001924 22. 56487
When you run linktest after logit,the result is another logit specification: linktest, nolog Logit estimates
Number of obs LR chi2(2) Pro5 > chi2
= = =
74 36.83 0.0000
Log likelihood = -26.615714
Pseudo R2
=
0.4090
foreign
Coef.
_hat _hatsq _cons
.8438531 -.1559115 .2630557
Std. Err. .2738759 .1568642 .4299598
z 3.08 -0.99 0.61
P>Iz_ 0.002 0.320 0.541
[95_ Conf. Interval] .3070661 -.4633596 -.57965
1.38064 .1515366 1.105761
The link test reveals no problems with our specification. If there had been a problem, we would have been Virtually forced to accept the misinterpretation of the link test we would have reconsidered our specification of the independent variables. When using logit, we have no control over the specification of the dependent variable other than to change likelihood functions. We admit to seeing a dataset once where the link test rejected the logit specification. We did change the likelihood function, re-estimating the model using probit, and satisfied the link test. Probit has thinner tails than logit. In general, however, you will not be so lucky. q
_3Technical Note You should specify exactly the same options with linktest as you do with the estimation command, although you do not have to follow this advice as literally as we did in the preceding example, logit's nolog option merely suppresses a part of the output, not what is estimated. We specified nolog both times to save paper. : '_
If you are testing a cox model with censored observations, however, you must specify the dead() option on linktest as well. If you are testing a tobit model, you must _pecify the censoring points
i
just as you do with the tobit;
command.
T
_
linktest
!
Specification linktest for single-equation models
207
i I . !
If youiare not sure which options are important, duplicate exactly what you specified on the command. estimatio_ _ If youido not specie' if exp or in range :with li_ktest, Stata will by default perform the link test _n the same s_unpleas the previous estimation. Suppose that you omitted some data when performin_ your estimation, but want to calculate the link test on all the data, which you might do if you belleved the moclelis appropriate for all !thedata. To do this, you would type 'linktest if
:{
e(
i
pl4) -=.'.
SavedRemrs linkt,mtsaves in ::(): Scalars r(t)
t statisticon _!aats_
r(df)degreesof freedom
linkt_stis not an estimation command in the sense that it leaves previous estimation results unchangeql.For instan@, one runs a regressiofi and then performs the link test. %,ping regress without a_gumentsarid the link test still replays the original regression.
i
In ternls of integrati g an estimation commafid with linktest, linktestassumes that the name of the estimation com_nand is stored in e(cmtt) and that the name of the dependent variable in e (depval_). After estirhation, it assumes that the number of degrees of freedom for the t test is given
i
by o(df_._)if the ma!ro is defined. If the estimation co_amandreports Z statistics instead of t statistics, then tinktestwill also report Z aatistics. TheiZ statistic, however, is still returned in r(t) and r(df) is set to a missing
i
vai ,e
I i
!
Methods ,nd ForMulas, linkt.=st is implemented as an ado-file. The link test is based on the idea that if a regression or
i , !
regressior-like equatioriis properlyspecified,on_ should not be able to find any additional independent variables :hatare signil_cantexcept by chance. One kind of specificationerr'or is called,a link error. In regression, this means that the dependent vmiable needs a transformation or "link function to
i !
properly relate to the i_dependent variables. Th_ idea of a link test is to add an independent variable to the eqt ation that is l_specialb likely to be significantif there is a link error,
i
Let
l
Y = f (X/3) be the m( del and _ be!the parameter estimatesl linktest
l } I I ! ! ,i
calculates
_hat= Xa and _hat_q= _hat2 The mod_l is then refit with these two variablesl and the test is based on the significance of _hatsa. This is tNe form suggelted by Pregibon (1979/based on an idea of Tukey (t949). Pregibon (1980) su_ests slightly different method that has tome to be known as "Pregibon s goodness-of-link tesf'. We _referredthe!olderversion because it is universally applicable, straightforward, and a good second-or ier approximation. It is universally applicable in the sense that it can be applied Io any' single-eq_,ationestimation technique whereas Pregibon s more recent tests are estimation-technique specific.!
|
i
=__
Pregibon, D. 1979. Data Analytic Methods for Generalized Linear Models. Ph.D. Dissertation 1980. Goodness of link tests for generalized linear modelS. Applied Statistics 29: 15-24. Tukey, J. W. 1949. One degree of freedom for non-additivity. Biometrics 5: 232-242.
Also See Related:
[R] estimation
commands,
JR] lrtest,
[R] test, [R] testnl
University of Toronto.
_ !
Title
i
1
I
list --
Sy ntax I!
f
i
I i l
'
_list
Iv_rlist! [i:fe_]
i
[n o]_display
nolabel
noobs doublespace
]
Descrlptic_n di ;plays the v
es of variables, If no v_rlist is specified, the values of all the variables are
_,lso see bro_vsein [R] edit.
displayed.
Options . [nojdisplgy forces th_format into display or tabular (nodisplay) format. If you do not specify one its judgment of which would be most one of t_ese two options, then Stata chooses based on re.adabk nolabel
[
[in range][,
by ... : mai, be used with kist; see [R] by. The second vadist may _ntain_ime-seriesoperators;see [U114.4.3"l'ime.seHesvaNi_s.
list I
I
, s v lues of variables
uses the nur eric codes rather than the label values (strings) to be displayed.
noobs sup _ressesprintiJ g of the observation numbers. doublesp_tce requests mt a blank line be inserted between observations. This option is ignored in displa format.
Remarks l
i° ! i
I ! I
list,t,. )ed by itself lists all the observations and all the variables in the dataset. If you specify
varlist, onl those vafia_tes are listed. Specifyifig one or both of in range and if e_p limits the observatiot listed.
:;
_ Examplei
list h. s two outpu
listing a f_w variables, whereas the display format is suitable for listing an unlimited number of variables. _tata chooses automatically between those two formats: Obse_ :vation 1 li_t in 1/2 make rep78 weight
I
formats, known as tabular and display. The tabular format is suitable for
I _ispla-t
AMC Concord 3 2,930 121
price headroom
4,099 2.5
mpg trunk
22 11
length
186
turn
40
gear_r-o
3.58
foreign
Domestic
Observation ri
'
I ,
--,-
2
.,,o..
_,.r._ vauu_ Ul vitrlODleS make AMC Pacer price rep78 3 headroom
weight
3,350
displa-t . list
make
258 mpg
weight
displ
make I. AMC Concord 2. AMC Pacer ;
3. AMC
The
Spirit
length
_
mpg trunk
17S
gear_r~o rep78
_
4,749 3.0
turn
2.53
40
foreign
Domestic
in I/5
mpg 22 17
weight 2,930 3,350
displa~t 121 258 121
rep78 3 3
22
2,640
4. Buick
Century
20
3,250
196
3
5. Buick
Electra
15
4,080
350
4
first case is an example
of display
17 II
format;
[he second
is an example
of tabular
format.
The
tabular format is more readable and takes less space, but it is effective only if the variables can fit on a single line across the screen. Stata chose to list all twelve variables in display format, but when the varlist was restricted to five variables, Stata chose tabular format. tf you are dissatisfied with Stata's choice, you can make the decision yourself. Specify the display option to force display format and the nodisplay option to force tabular format.
<1 0 Technical Note When Stata lists a string variable in tabular output format, it always lists the variable right-justified. When Stata lists a string variable in display output format, it decides whether to li st the variable rightjustified or left-justified according to the display forrnht for the string variable; see [U] 15,5 Formats: controlling how data are displayed. In our previous! example, make has a display format of 7.-18s. describe
make storage
variable
name
make
display
value
type
format
label
strl8
7.-18s
variable
label
Make
Model
and
The negative sign in the 7'-18sinstructs Stata to left+justify this variable. If the display format had been 7,18s, Stata would have fight-justified the variable. Note that it appears from our listing above that :foreign describe it, we see that it is not: describe
foreign
but if we
foreign storage
variable
is also a string variable,
name
display
value
type
format
label
variable
byte
7.8.Og
origin
Car type
label
foreign is stored as a byte, but it has an associated value label named origin;see [U] 15.6.3 Value labels. Stata decides whether to right-justify or left-justify a numeric variable with an associated value label using the same rule as Stata uses for stnng variables: it looks at the display format of the variable. In this case, the display format of 7.8. Og tells Stata to right-justify the variable. If the display format had been 7,-8.0g, Stata would have left-justified this variable.
[3
i
iI
i
i "
_
_
Xst -- List values of variables
_ Technical _ote
! You car_ list the v_riables in any order that you desire. When you specify the varlist, list makes the ttisplay in the order you specify. You 'may also include variables more than once in the vartist.
Example In some !cases, you m_y wish to suppress the Observation numbers. You do this by specifying the
lie
make
mpg wight
noons make opti,,n:
i
displ
foreign
mpg
weight
in 51/55
noobs
: displa-t
foreign
Pont.
Sunbird
24
2,690
151
Domestic
Audi Pont. Audi
_000 Phoenix ;ox
17 19 23
2,830 3,420 2,070
131 231 97
Foreign Domestic Foreign
BWW
.>Oi
25
2,650
121
Foreign
1 _ Example
1
You can Imake the list easier to read by specifOng the doublespaceoption:
I
i lis_ make make '
i
Pont. iPhoenix
19
3,420
231
Domestic
i
Pont. Igunbird
24
2,690
151
Domestic
Audi_000
17
2,830
131 Foreign
Audi
'ox
23
2,070
97
Foreign
BMW 3:!0i
25
2,650
121
Foreign
}
i } ! i_
mpg weight displ foreign in 51/55, noobs double mpg weight 4ispla~t foreign
21Technical Note
You can !suppress the use of value labels by specifying the nolabel option. For instance, the variable foreign in the _:xamples above really contains numeric codes. 0 meaning Domestic and 1 meaning Foreign.When you list the variable however, you see the corresponding value labels rather than the underlyin_ numeric code: lis_
foreign
I
51
iforeign _omestic
i
52.
_omesl;ic
I
211
53. 54.
iF°reign _Foreign
55.
IForeign
Specifying t e nolabel
in
1/55
i 1 1
ption displays the underlying numeric codes:
list
_!
212
_,
_
5i. 52. 53. 54. 55.
foreign
in
51/55,
nolabel
listforeign -- List values of variables 0 0 1 1 1
0
References Riley, A. R. 1993. dml5: Interactively list values of variable,s.Stata Technical Bulletin 16: 2-6. Reprinted in Stata TechnicalBulletin Reprints. vol. 3, pp. 37-41. Royston, P. and P. Sasieni. 1994. dml6: Compact listing Of a single variable. Stata Technical Bulletin 17: 7-8. Reprinted in Smta Technical Bulletin Reprints, vol. 3, pp_41-43. Weesie, J. t999. din68: Display of variablesin blocks. Stata TechnicalBulletin 50: 3-4. Reprinted in Stata Technical Bulletin Reprints. vol. 9, pp. 27-29.
Also See Related:
[R] edit, [P] display,
[P] tabdisp
i
Ins
! i
; i i i _I I
j i
"_
I
0 -- Find z
iit
1
_l
I
I
ire-skewness log or BoxLCox transform
lnske'_O ,,ewvar = ._xp [if exp] [in range] [, level(#) I
_delta(#)
_zero(#) 1
delta(#)
--
Syntax bcskei_O newvar = _.rp [if 1
e_,7_ ] [ill range] [ .
m
level(#)
--
--
zero(#)
] d
Deseripti n of inske_10creates n@var = ln(+exp - k). choosing k and the sign of exp so that the skewness newvar is zero. bcske_FO creates n 'vat= (exp _ - 1)/A, .the Box-Cox power transformation (B x and Cox 1964), ch_osing A so t_at the skewness of ned,vat is zero. exp must be strictly positi_°c. Also see
[R] boxeo
for maximu_n likelihood estimation of A
Options level (#) specifies the confidence level for a confidence interval for k (lnskewO) or A (bcskewO). Unlike usual, the ccnfidence interval is calculated only if level() is specified. As usual, # is _ecified as an integ,.'r; 95 means 95% confidence intervals. The level() option is honored onl>_ if the umber of observations exceeds 7. delta(#) specifies the increment used for calculating the derivative of the skewness function with respect to k (lnske'gO) or A (bcskewO). The default values are 0.02 for lnskewO and 0.0I for bcske_O. zero(#) s_ecifies a vah Eefor skewness to determine convergence that is small enough to be considered zero arld is by defau it 0.001.
Remarks
Example
1
Using dur automobih_ dataset (see [U] 9 Statai's on-line tutorials and sample datasets), we want to generatd a new variab le equal to ln(mpg- k) t6 be approximately normally distributed, mpg records the miles r gallon for _ach of our cars. One feature of the normal distribution is that it has skewness
• in_kewO lnmpg Transfor_
mpg k
[95Y,Cdnf. Interval]
Skewness
(not calculated)
-7.05e-06
....
in(mpg-k)
5.383659
214
InskewO-- Find zero-skewness log or Box-Cox transform
This created the new variable lnmpg = ln(mpg - 5.384): describe Inmpg
variable name
storage type
Inmpg
display format
float
value label
X9.0g
Since we did not specify the level we could have typed
variable label in(mpg-5.383659)
() option, no confidence
interval was calculated.
At the outset,
InskewO inmpg = mpg, level(95) Transform
I
In(mpg-k)
[
k 5.383659
[95_
Conf. Interval]
-17. 12339
Skewness
9.892416
-7.05e-06
The confidence interval is calculated under the assumption that In(mpgk) really does have a normal distribution. It would be perfectly reasonable to use Inskew0 even if we did not believe the transformed variable would have a normal distribution--if we literally wanted the zero-skewness transform--although in that case the confidence inte_'al would be an approximation of unknown quality to the true confidence interval. If we now wanted to test the believability of the confidence interval, we could also test our new variable lnmpg u!sing swilk with the !nnormal option. q
El Technical Note lnskewO (and bcskewO) reports the resulting skewness of the variable merely to reassure you of the accuracy of its results. In our above example, lnskew0 found k such that the resulting skewness was -7- 10-6, a number close enough to zero for all practical purposes. If you wanted to make it even smaller, you could specify the zero () option. Typing lnskewO new=mpg, zero (le-8) changes the estimated k to -5.383552 from -5.383659 and reduces the calculated skewness to -2.10 -11 When you request a confidence interval, it is possibl+ that lnskew0 will report the lower confidence interval as '. ', which should be taken as indicating the lower confidence limit kL = -oc. (This cannot happen with bcskew0.) As an example, consider a sample of size n on z and assume the skewness of z is positive, but not significantly so at the desired significance level, say 5%. Then no matter how large and negative you make kz,, there is no value extreme enough to make the skewness of ln(x - kL) equal the corresponding percentile (97.5 for a 95% confidence interval) of the distribution of skewness in a normal distribution of the same sample size. You cannot because the distribution of ln(z - kL) tends to that of zpapart from location and scale shift--as z --+ oc. This "problem" never applies to the upper confidence limit ku because the skewness of ln(:c - ku) tends to -co as k tends upwards to the minimum value of z.
Example In the above example, using lnskewO a natural
zero
and
we
are
shifting
variable such as temperature zero is indeed arbitrary.
that
with a variable like mpg is probably zero
arbitrarily.
measured in Fahrenheit
On
the
other
hrmd,
use
undesirable,
mpg has
of tnskew0
with
or Celsius would be more appropriate
a
as the
i i
_
lnskewO-- Find zerO-skewnesSlog or Box-Cox transform 215 For a var able like mpg it makes more sense touse the Box-Cox power transform (Box and Cox 1964): y(_)=
Y_-I
A is free to :ake on any ,,alue. but note that y(1) _ y_ bcskew0 works like 1:1skew0:
1. y(0) _ In(y), and yf-l_
= 1 - 1/y.
• bcs_ewO bcmpg = ipg, level(95) 4 i Transform (_pg'L-1)/L
L
[95_,Conf. Interval]
-. 3673283
-1. 212752
Skewness
,4339645
0001898
,i
It is worth n( ting that the _ _,%confidence interval i,cludes ), = - ] (A is labeled L in the output), which has a rather nore pteasin_ interpretation--gallons iper mile--rather than (mpg-'3673 - 1)/(-3673).• The confide;_ce interval, however, is calculated under the assumption that the power transformed
_
variable is rormally distributed. It makes perfect sense to use bcskewO even when one does not believe that the transforrred variable will be normallv distributed, but in that case the confidence interval is an approximaticn of unknown quality, ffone believes thai the transformed data are normally <1
i
distributed, me can alterwatively use boxcox to egtimate 3,: see [R] boxeox.
Saved ReslJits .
lnskewO and bcskew(, save in r() Scalars
!
mma)
h-(InskewO)
r(g_ !} },.
r (I__bda) r(ll_) r(ul_)
)_(bcskewO) lower bound of tonfidence interval upper lxmnd of confidence interval
r(sl:ewness)
resulting skewness of transformed
variable
Methods armdFormulas lnskewOiand
bcskew(
are implemented as ad0-files.
i
Skewness is as calcul_ :ed by summarize: see [R] summarize. Newton's method with numeric. uncentered cerivatives is _sed to estimate k (lnskew0) and )_ (bcskew0?. In the case of lnskew0.
i 1
lhe initial value i_ chosen so that the minimum of :r - k is l and thus ln(z with )_ = 1
Acknowle lnskewOiand
k) is 0. bcskewO star1_
ment bcskewC were written by Patrick Royston of the
MRC
Clinical Trial< [_nit. London.
216
Inskew0 m Find zero-skewness log or Box-Cox transform
References Box. G. E. E and D, R. Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society, Series B 26: 211-243.
Also See Related:
[R] boxcox, JR] swilk
Complementary:
[R] ladder
•
Title log -- E_:ho copy of
!
Syntax:
_
l_og
!
_
usi,g
to file or device
filename
, append
{ol
!
_eplace
[ Zext
smcl ] ]
}
cmdl og cmdlog _sing filenal,e [, append replace ] I
cmdlog
!
set
I
k
{onlofflcl
log_ype
{text_
set lin_size
lffilename
se}
smcl}
#
is ._pecified withoul an extension, one of the suffixes .smcl..log.
The extensibn
.smct
or .txt
is added.
or ,] og is added by log depending on whether the file format is SMCL or ASCII text
The extensibn .txt is add,',dby .cmdlog. In addition to 'ommandlog. _oumay accessthe capabilitie,_of log by pullingdow_ File and choo_ingLog.
Description
[
log allowis you to mal]e a full record of your Stata session. A log is a file containing what you !_
type a_d Staia's output. 1 cmdlog 4lows you to!make a record of what you type during your Stata session. A command
; i
log contains bnly what yo type and so is a subse! of a full log. You can rOake full logs md command logs simttltaneously, one or the other, or neither. Neither is
!
produced un_l you tell St;.ta to start I¢>gging. Command logs are ah,,ays straight ASCII text files and this makes them easy to convert into
, !
do-files. (In ihis respect, it would make more sen.4e if the default extension of a command log file was .do be&use commaz]d lo_osare do-files. The default is .txt. nOt .do. howe_er, to keep you
i i
from accidenialty overwriting your important do-files.) Full logs !are recordedlin one of two formats: SMCL (Stata Markup and Control Language) or
_° i
text (meaning ASCII). The default is SMCL. but s_t logtype can change that, or you can specify an option to state the forrrm you wish. We recommend SMCL because it preserves fonts and colors. SMCL logs c_n be convert,_d to ASCII text or to other formats using the translate command: see [R] translate; translate can also be used to produce printable versions of SMCL IO_S.or you can print SMCL l_gs by pullin_ down File and choosing Log. SMCL logs can be viewed in the viewer, as can any file: !see [R] view.
: ! i
21_
_
zl _
tog -- Ec.o copy of session to file or device
log or cmdlog,
typed without arguments, reports the status of logging.
log using and cmdlog using open a log file. log close and cmdlog close close the file. Between times, log off and cmdlog off, and log on and cmdlog on can temporarily suspend and resume logging. set logtype specifies the default format in which full logs are to be recorded. Initially, full logs are set to be recorded in SMCL format. set linesize specifies the width of the screen currently being used (and so really has nothing to do with logging). Few Stata commands currently respect linesize, but this will change in the future.
Options Optionsfor use with both log and logcmd append specifies that results are to be appended onto the end of an already existing file. If the file does not already exist, a new file is created. replace specifies that filename, if it already exists, is to be overwritten, and so is an alternative to append. When you do not specify either replace or append, the file is assumed to be new and an error message is issued and logging is not started.
Options for use with log text and smcl specify the format in which the log is to be recorded. The default is complicated describe but is what you would expect: If you specify the file as filename.smcl, (regardless of the value of set logtype).
to
then the default is to write the log in SMCL format
If you specify the file asfilename, log, then the default is to write the log in text format (regardless of the value of the set logtype). If you type filename without an extension and specify neither the smcl or text options, the default is to write the file according to the value of set logtype. If you have not reset set logtype, then that default is SMCL. In addition, the filename you specified will be fixed to read filename, sracl if a SMCL log is being created or fiiename, log if a text log is being created. If you specify either of the options text or smcl, then what you specify determines how the log is written. Iffilename was specified without an extension, the appropriate extension is added for you.
Remarks For a detailed explanation
of logs, see [U] 18 Printing
and preserving
output.
Note that when you open a fulI log, the default is to show the name of the file and a time and date stamp: log
using
log
log: type :
opened
L
on:
myfile
C: \data\proj smcl 12 Dec
2000,
Ikmyfile. 19:28:23
smcl
log _ Echo copy of sessionto file or device
i
219
The above information ' ,'ill appear in the log. If you do not want this information to appear, precede
i
the comm_ nd by quiet . qu etly
l
quietly
!
Ly:
log using myfile
'_ill not suppr,;ss an}, error messages qr anything else you need to know.
i
Simila@ when you :lose a fuel log, the default is to show the full information:
i
. lo*
I
i
close
i log- c:\_t_\proj l\_y_ile, s_l
clo!ed on
12
c 2000,
12:32:41
and lhat information wili appear in the log. If you want to suppress that, type quietly log close,
i I
SavedReSults log
and cmdlog sav_ in r ()" Macros
i
r (filename) name of file I
AlsoSee
} _ i {
I
! i
r(s_atus)
on or off
r(type)
text or smcl
i
Complemehtary:
[Ri translate; [R] more, [R] query
Baekgrounh:
17 Logs: Printing and saving output [G:;W] 17 Logs: Printing and saving output, [G:;U] 17 Logs: Printing and saving output, [G!M]
[U 10 -more-- conditions, [Ui 14.6 File-naming conventions, [UI 18 Printing and preserving output
' 1
Title [ I°gistic
-- L°gisfic , regressi°n
,
i
t
Syntax logistic
depvar varlist [weight]
cluster
(varname)
maximize_options lfit
[depvar] all
lstat lroc all
[depvar]
genprob
asis
[if
exp]
[in
range]
[if exp] _in range]
[weight]
[if
(varname)
coef
[, group(#)table
outsample
expl
[in
[. cutoff(*)all
range]
[, nograph
beta(mamame)
]
graph_options
]
[weight]
beta(matname)
offset
robust
]
[weight]
(varname)
[, _level(#)
]
[weight]
beta(mamame)
Isens [depvar]
all
score (newvarname)
beta(matname) [depvar]
[i£ exp] [in range]
[if
gensens
expl
[in
(varname)
range]
[. nograph
genspec
(varname)
graph_options replace
]
by ... : may be used with logistic; see [R] by. logistic allows fweights andpweights; lfit, lstat, lro¢, and lsens allow only fweights; see [U] 14.1.6weight. logistic shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. logistic may be used with sw to perform stepwise estimation; see [R] sw.
(Continued
on next page)
220
yntax fir predict predict
[type] ,ewvarname [if exp] [in range] [, statistic rules
asif
nooffset
]
where slatistic is p xb strip * d_eta * deviance
, ___2 , ddeviemce , hat , number r,esiduals , rstandard
probability of a positive outcome (the default) xib, fitted values standard error of the prediction Pregiborl(198t) A 13influence statistic deviance residual Hosmer and Lemeshow (1989) A X2 influence statistic Hosmer and Lemeshow (1989) A D influence statistic Pregibon (1981) leverage sequential number of the covariate pattern Pearson residuals; adjusted for number sharing covafiate pattern standardized Pearson residuals: adjusted for number sharing covariate pattern
Unstarred statistics are available both in and out of sample; type predict ... if e(sataple) ... if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample even when if e(sample) is not specified.
DeScription logisticesumates a logistic regression of _lepvaron vartist, where depvar is a 0tl variable lot, more precisely, a 0/non-0 variable). Withoutarguments, logistic redisplays the last logistic estimates, logistic displays estimatesas odds ratios; to view coefficients,type logit after running logistic. To obtain odds ratios for any covariate pattern relative to another, see JR] lineom. ].fi_ displays either the Pearson goodness-of-fit test or the Hosmer-Lemeshow goodness-of-fit test is'_at displays various summary statistics, including the classification table, lroc graphs and calculates the area under the ROe curve. lsens graphs sensitivity and specificity versus probability cutoff and optionally creates new variables containing these data lfit, lstat, lroc, and lsens can produce Statisticsand graphs either for the estimation sample or for;any set of observations. However, they always use the estimation sample by default. When weights, if, or in are used with logistic, it ig not necessary to repeat them with,these commands when you want statistics computed for the estimition sample. Specify if, in. or the all option onb,' whe_nyou want statistics computed for a set of obsen_ationsother than the estimation sample. Specify wmghts (only fweights are allowed with these commands) only when you want to use a different set oftweights. Bydefault, if it. lstat, lroc, and lsens use the fastmodelestimated by logistic. Alternatively, the model can be specified by inputting a vector of coefficients with the beta() option and passing the name of the dependent variable depvar to ttie commands, The lfit,
lstat,
lroc. and lsens commands may also be used after logit
or probit.
Here is a list of other estimation commands that may be of interest. See |R] estimation commands for a complete list. See Gould (2000_for a discussion of the interpretation of logistic regression.
I)
222
lOgistic --
LOgiStiC regression
blogit
[R] glogit
Maximum-likelihood logit regression for grouped data
bprobit
[R] glogit
Maximum-likelihood probit regression for grouped data
clogit
[R] ciogit
Conditional (fixed-effects) logistic regression
cloglog
[R] cloglog
Maximum-likelihood complementary log-log estimation
glra
[R] glm
Generalized linear models
glogit
[R] glogit
Weighted least-squares togit regression for grouped data
gprobit
[R] glogit
Weighted least-squares probit regression for grouped data
heckprob
[R] heekproh
Maximum-likelihood probit estimation with selection
hetprob
[R] hetprob
Maximum-likelihood heteroskedastic probit estimation
logit
IR] logit
Maximum-likelihood logit regression
mlogit
[R] mlogit
Maximum-likelihoo_l multinomial (polytomous) logistic regression
nlogit
[R] nlogit
Maximum-likelihood nested logit estimation
ologit
[R] ologit
Maximum-likelihood ordered logit regression
oprobit
[RI oprobit
Maximum-likelihood ordered probit regression
probit
[R] probit
Maximum-likelihood probit regression
scobit
[R] scobit
Maximum-likelihood skewed logit estimation
svylogit
[R] svy estimators
Survey version of logit
svymlogit
[R] svy estimators
Survey version of mlogit
svyologit
[R] svy estimators
Survey version of ologit
svyoprobit
[R] svy estimators
Survey version of oprobit
svyprobit
[R] svy estimators
Survey version of probit
xtclog
[R] xtclog
Random-effects and population-averaged ctoglog models
xtlogit
[R] xtlogit
Fixed-effects, random-effects, and population-averaged
xtprobit
[R] xtprobit
Random-effects and population-averaged
xtgee
[R] xtgee
GEE population-averaged generalized linem: models
logit models
probit models
Options Optionsfor logistic level
(#) specifies
the confidence
or as set by set
level;
level, in percent,
for confidence
see [U] 23.5 Specifying
the width
intervals.
Tile default
of confidence
is level
(95)
intervals,
robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster() be independent
allows between
If you specify
pweights,
cluster(varname) not necessarily cluster(personid) estimated estimated
observations clusters).
which
are not independent
robust
is implied;
see [U] 23.13
specifies within
that
groups, in data
the
observations
Weighted
independent
(although
they must
estimation. across
groups
varname specifies to which group each observation with repeated observations on individuals, cluster()
standard errors and variance-covariance coefficients; see [U] 23.11 Obtaining
used with pweights to produce command in [R] svy estimators
are
within cluster
matrix of the estimators robust variance estimates,
(clusters)
but
belongs; e.g., affects the
(VCE) but not the cluster() can be
estimates for unstratified cluster-sampled data, but see the svylogit for a command designed especially for survey data.
logistic-- Logisticregremsion c_hister () implies robust' by itself.
specifying robust cluster()
is equivalent to typing cluster
223 ()
scorei(newvarname) creates newvar containing uj = 01nLj/0(xjb) for each observation j in the sample. The score vector is _ 01nLj/ab = _ujxj; i.e., the product of ne_'var with each covariate summed over observations. See [U] 23.12 Obtaining scores. asis forces retentionof perfectpredictor variables and their associatedperfectly predictedobservations and may produce instabilities in maximization; see [R] probit (sic). offset (varname) specifies that varname is to be included in the model with coefficientconstrained tolJe 1. coef causes logistic to report the estimated coefficients rather than the ratios (exponentiated coefficieJas),coef may be specified when the _odel is estimated or used later to redisplay results. c0ef affects only how resuks are displayed ahd not how they are estimated. marimize_options control the maximization process; see [RI maximize. You should never have to specify Uhem.
'
Options!forlilt, Istat,troc,andIsens group(#) (ifit onl_y)specifies the number of quantiles to be used to group the data for the Hosmer-Lemeshow goodness-of-fit test. groqp(lO) is typically specified. If this option is not _iven, ttie Pearson goodness-of-fit test is computed using the covariate patterns in the data as groups.
I_
table (If it only) displays a table of the groups used for the Hosmer-Lemeshow or Pearson goodness-of-fit test '_,ithpredicted probabilitieS,observed and expected counts for both outcomes. anldtotal_ for each group, oulzsample (lfit only) adjusts the degrees of freedom for the Pearson and Hosmer-Lemeshow goodness-of-fittests for samples outside of the estimation sample. See the section Samples other thsn_estimation sample later in this entry. all requests that the statistic be computed for all observations in the data ignoring any if or in restrictions specified with logistic. beta(matn_lme) specifies a row vector containing coefficients for a logistic model. The columns of the row vector must be labeled with the corresponding names of the independent variables in the data. The dependent variable depvar must be specified immediately after the command name. See the section Models o/her than last estimated model later in this entry. cutoff (#) (1star only) specifies the value for determining whether an observation has a predicted positive outcome. An observation is classified as positive if its predicted probability is > #. The default is_0.5. nograph (1roe and lsens) suppresses graphical output. eraph_options (1roe and lsens_ are any of the options allowed with graph, lzwoway;see [G] graph options. genprob (va'rname). gensens (varname), and gelaspec (varname) (lsens only) specily the names of new variables created to contain, respectively, the probability cutoffs and the corresponding ser_sitivityand specificity. replace (lsens only) requests tha) if existing variables are specified for genprob(), or geaspec (), they should be ovem'ritten.
'i
gensens(),
1
Optionsfor predict p, the default, calculates the probability
of a positive outcome.
xb calculates the linear prediction. std_p calculates the standard error of the linear prediction. dbeta calculates the Pregibon (1981) A_ influence statistic, a standardized measure of the difference in the coefficient vector due to deletion of the observation along with all others that share the same covariate pattern. In Hosmer and Lemeshow (1989)jargon, this statistic is M-asymptotic, that is, adjusted for the number of observations that share the same covariate pattern. deviance calculates the deviance residual. dx2 calculates the Hosmer and Lemeshow (1989) AX2 influence statistic reflecting the decrease in the Pearson X2 due to deletion of the observation and all others that share the same covariate pattern. ddeviance calculates the Hosmer and Lemeshow (1989) AD influence statistic, which is the change in the deviance residual due to deletion of the observation and all others that share the same covariate pattern. hat calculates the Pregibon (1981) leverage or the diagonal elements of the hat matrix adjusted for the number of observations that share the same covariate pattern. number numbers the covariate patterns observations with the same covariate pattern have the same number. Observations not used in estimation have number set to missing. The "first" covariate pattern is numbered l, the second 2, and so on. residuals calculates the Pearson residual as given by Hosmer and Lemeshow for the number of observations that share the same covariate pattern.
(1989) and adjusted
rstandard calculates the standardized Pearson residual as given by Hosmer and Lemeshow and adjusted for the number of observations that share the same covariate pattern.
(1989)
rules requests that Stata use any "rules" that were used to identify the model when making the prediction. By default, Stata calculates missing fot excluded observations. See JR] legit for an example. asif requests that Stata ignore the rules and the exclusion criteria and calculate predictions for all observations possible using the estimated parameter from the model. See [R] logit for an example. nooffset is relevant only if you specified offset (vamame) for logistic. It modifies the calculations made by predict so that they ignore the Offset variable: the linear prediction is treated as x3b rather than x ab + offset a.
Remarks Remarks are presented under the headings logistic and logit Robust estimate of variance lilt lstat lroc lsens Samples other than estimation sample Models other than last estimated model predict after logistic
]
1 225
- uxjmtm lOgisciadd Iogit i
logistic provides an alternative and preferr_ way to estimate maximum-likelihood logit models, the other Choice being logit described in [R] iogit, First, let us dispose of some confusing terminology. We use the words logit and logistic to mean the same thing: maximum likelihood estimation. To some, one or the other of these words connotes trarisfOrming the dependent variable and using weighted least squares to estimate the model, but that is riot ho'& we use either word here. Thus, the logit and logistic commands produce the same res_tlts, The logistic command is generally preferred to logit because logistic presents the estimates in terms 6f odds ratios rather than coefficients. To a few, this may seem a disadvantage, but you can type logb:t without arguments after logistic to see the underlying coefficients. Nevertfieless. [R] log'it is still worth reading because logistic shares the same features as logit. incl_ud_ngomitting variables due to collinearity or one-way causation. For an introduction to logistic regression, see Lemeshow and Hosmer (1998) or Pagano and Gauvreau (2000, 470-487); for a thorough discussion, see Hosmer and Lemeshow (t989: second edition_ foghcoming in 2001).
> Example Colisidtr the following dataset from a study of risk factors associated with low birth weight des¢ribed ]n Hosmer and Lemeshow (1989, appendix 1). ., ddscribe Contains data from Ibw.dta ob_: 189 ,vaz]s: Ii :size:
_ari_ble name . ]
3,402 storage type
Hosmer _ Lemeshow data 18 Jul 2000 16:27
(95.!% of memory f_ee) di6play format
valu_ label
variable label
race
fd
int
_8,0g
identification code
]_bw v(ge l%_t Z_tce
byte byte int byte
_8.Og XS.0g _8.Og _8.0g
birth weight<25004 age of mother weight at last menstrual period race
s_nok_ 1_2 _t f_v
b_e byte byte byte byte
_8.04 _8.04 _8.04 _8.04 _,8.04
b_t
int
_8.04
, !
smoked during pregnancy prematttrelabor history (count) has history of hypertension presence, u_erine irritability number of visits to physician during let trimester birth weight (grams)
Sbrt_d by :
They _ant!to investigate thecausesoflow bi_hWeightInthis dataset, race isa categorical variable indicating _vhether a person is white (race = 1), black (race --- 2), or other (race -- 3). The authors want irldichtor_(dummy) variables for race included in the regression. (One of the dummies, of course, 'mu)st be omitted.) Thus, before we can _stimate the model, we must create the race dummy The_e ale a number of ways we could do this.: but the easiest is to let another Stata command, xi. do i! fdr uI. we type xi: in front of our logistic command and in our varlist include not race ) l
;; ._i:i
_o i. race,ug,suc -- cogmuc regression but to indicate we want the indicator
variables for this categorical
variable;
see [R] xi for
the full details. . xi: logistic low age lwt i.race smoke ptl ht ui i.race _Irace_l-3 (naturally coded; _Irace_l omitted) Legit estimates
Log likelihood =
-100.724
Number of obs LK chi2(8) Prob > chi2
= = =
189 33.22 0.0001
Pseudo K2
=
0.1416
,r
low
Odds Katie
Std. Err.
age lwt _Irace_2 _Irate_3
.9732636 .9849634 3.534767 2.368079
.0354759 .0068217 1.860737 1.039949
ptl smoke ht ui
1.719161 6.249602 2.517698 2.1351
.5952579 4.322408 1.00916 .9808153
z
P>lz[
[95_ Conf. Interval]
-0.74 -2.19 2.40 1.9_
0.457 0.029 0.016 0.050
.9061578 .9716834 1.259736 1.001356
1.045339 .9984249 9.918406 5.600207
1.5: 2.36i 2. 1.6_
0.118 0.021 0.008 0.099
.8721455 1.611152 1.147676 .8677528
3.388787 5.523162 24.24199 5.2534
The odds ratios are for a one-unit change in the variable. If we wanted the odds ratio for age to be in terms of 4-year intervals, we would gen age4 = age/4 . xi: logistic Io_ age4 lwt i.race smoke ptl ht ui (ou_utomit_d)
After logistic,
we can type logit
to see the mode in terms of coet_cients
and standard errors:
logit Legit estimates
Log likelihood =
-100.724
low
Coef.
age lwt _Irace_2 _Irace_3 smoke
-.0271003 -.0151508 1.262647 .8620792 .9233448
ptl ht ui cons
.5418366 1.832518 .7585135 .4612239
Std. Err. .0364504 .0069259 .5264101 .4391531 .4008266
z
Number of obs LRchi2(8) Prob > chi2
= = =
189 33.22 0.0001
Pseudo R2
=
0.1416
P>Jzl
[95_ Conf. Interval]
-0.74 -2.t9 2.40 1.96 2.30
0.457 0.029 0.016 0.050 0.021
-.0985418 -.0287253 .2309024 .0013548 .1377391
.0443412 -.0015763 2.294392 1.722804 1.708951
.346249 .6916292 .4593768
1.56 2.65 1.65
0.118 0.008 0.099
-.136799 .4769494 -.1418484
1.220472 3.188086 1.658875
1.20459
0.38
0.702
-t.899729
2.822176
If we wanted to see the logistic output again, we would type logistic without arguments. <3
> Example You can specie' the confidence interval for the odds ratios with the level () option, and you can do this either at estimation time or when you replay the model. For instance, to see our previous models with narrower, 90% confidence intervals,
_
logistic-- Logistic reg_
!
• lqgistic, Logft
227
level(90)
estimates
Log likelihood
=
-100.724
Number of obs LR chi2 {8)
= =
Prob
=
0,0001
=
O. 1416
> chi2
Pseudo
R2
189 33.22
Robust low
Odds
age lwt _Irate
2
Ratio
Std.
Err.
.9732636 .9849634
.0329376 .0070209
z -0.80 -2.13
P>Izl
[95_, Conf.
Interval]
O. 423 O. 034
.9108015 .9712984
1.040009 .9988206
3.534767
I.793616
2.49
O.013
I.307504
9.556051
_Irace_3
2.368079
I.026563
1.99
O.047
i.012512
5. 538501
smoke
2. 517698
,9736416
2.39
O. 017
1.179852
5,372537
ptl ht
1. 719161 6.249602
.7072902 4.102026
1.32 2.79
O. 188 O. 005
.7675715 1. 726445
3. 850476 22. 6231
ul
2.1351
I. 042775
1.55
O. 120
.8197749
5. 560858
<]
RobuStestimateof variance If you specify robust. Stata reports the robust estimate of variance described in [U]23,11 Obtaining rob_ist variance estimates Here is the model previously estimated with the robust estimate of variance: xi: logistic
LOgi_
smoke _ptl ht
low age lwt i.race
i.rate
_trace_l-3 estimates
i_)g likelihood
ui, robust
(liaturaally coded;
_Irace_1
omitted)
Number of obs _ald chi2 (8)
=
-100.724
= =
189 29.02
Proh > chi2
=
0. 0003
Pseudo
=
0.1416
R2
Robust low
Odds Ratio
Std.
Err.
z
P>}zl
[957, Conf.
Interval]
0.423 0.034
.9108015 .9712984
1.040009 ,9988206
r
age lwt iIrace_2 JIrace_3 smoke ptl ht ui
.9732636 .9849634
.0329376 .0070209
-0.80 -2, 13
3. 534767 2. 368079
I.793616 1.026563
2.49 i.99
0.013 O.047
I.307504 1.012512
9.556051 5.538501
2. 517698
.9736416
2.39
0. 017
1.179852
5.372537
1.7t9161
.7072902
1.32
O. 188
.7675715
3. ff50476
6,249602 2. 1351
4. 102026 1.042775
2,79 I. 55
0.005 O. 120
1.726445 .8197749
22.6231 5.560858
Additionally, robust allows you to specify cluster() and is then able, within cluster, to relax the assumpiion of independence. To illustrate this, we have made some fictional additions to the low-birth-Weight data. Pretend [hat these data are not a random sample of mothers but instead are a random sample of mothers+from a random sztmple of hospitals. In fact, that may be true--we do not know the history of these_dam but we can pretend in any case.
i i
H0spital$ specialize and it would not be too incorrect to say that some hospitals specialize in more difficult cases. We are going to show two extremes. In one, all hospitals are alike but we are going to estimate bnder the possibility that they might differ. In the other, hospitals are strikingly different In bc/_hCases, we assume patients are drawn from 20 hospitals. In both examples, we will estimate the same model and we will type the same command to estim_ate!it. !Below are the same data we have been using but with a new variable hospid, which ident_fie_ frbm which of the 20 hospitals each patient was drawn (and which we have made up):
_r.o
" F_
,ug,_uu-
Logm.c
regressl0N
. xi: logistic low age lwt i.race smoke ptliht ui, robust cluster(hospid) i.race _Irace_1-3 (naturally coded; _Irace_l omitted) Logit estimates
Log likelihood =
-100,724
Number of obs Wald chii(8) Prob > chi2
= = =
189 49.67 0.0000
Pseudo R2
=
0.1416
(standard errors adjusted for clustering on hospid) Robust low
Ratio
Std. Err.
age lwt _Irace_2 _Irace_3 smoke
.9732636 .9849634 3,534767 2.368079 2.517698
.0397476 .0057101 2.013285 .8451325 .8284259
ptl ht ui
1.719161 6.249602 2,1351
.6676221 4.066275 1o093144
z
P>_zl
[957 Conf. Interval]
-0.66 -2.6!1 2.2_ 2.42 2.81
0,507 0,009 0.027 0.016 0.005
.898396 .9738352 1,157563 1.176562 1.321062
1.05437 .9962187 10,79386 4.766257 4.79826
1.40 2.82 1.48
0.163 0.005 0,138
.8030814 1,745911 .7827337
3.680219 22.37086 5.824014
The standard errors are quite similar to the standard etTors we have previously obtained, whether we used the robust or the conventional estimators. In this example, we invented the hospital ids randomly. Here are the results of the estimation with the same data but with a different set of hospital ids: . xi: logistic low age lwt i.race smoke ptl ht ui, robust cluster(hospid) i.race _Irace_l-3 (naturally coded; _Irace_1 omitted) Logit estimates
Log likelihood =
-100.724
Number of obs Wald chii(8) Prob > chi2
= = =
189 7,19 0.5167
Pseudo R2
=
0.1416
(standard errors adjusted for clustering on hospid) Robust Std. Err.
low
0dds Ratio
age lwt _Irace 2 _Irate 3 smoke
.9732636 .9849634 3.534767 2.368079 2.517698
.0293064 .0106123 3.120338 1.297738 1.570287
ptl ht ui
1.719161 6.249602 2.1351
.6799153 7.165454 1.411977
z
P>[zJ
[957 Conf. Interval]
-0,90 -1,41 1.43 1.57 1.48
0.368 0.160 0.153 0,116 0.139
.9174862 ,9643817 .6265521 .8089594 .7414969
1.032432 1.005984 19.9418 6.932114 8.548654
1.37: 1.60 1.15
0.171 0.110 0.251
.7919046 .660558 .5841231
3.732161 59.12808 7.804266
Note the strikingly larger standard errors. What happened? In these data, women most likely to have low-birth-weight babies are sent to certain hospitals and the decision on likeliness is based not just on age, smoking history, etc., but on other things that doctors can see but that are not recorded in our data. Thus, merely because a woman is at one of the centers identifies her to be more likely to have a low-birth-weight baby. So much for our fictional example. The rest of this' section uses the real low-birth-weight data. To remind you, the last model we left off with was
r"
logistic-- Loglstlcregression i
i •
229
,
• Xi: logistic low age lwt i.race smoRe ptl ht ui i._ace _Irace_1-3 (naturally coded; _Irace_l omitted) Logit estimates
Log likelihood =
Number of obs LR chi2(8) Prob > chi2 Pseudo R2
-100.724
low
Odds Ratio
age l'.*t _!race_2 _Irace_3 smoke
.9732636 .9849634 3.534767 2.368079 2. 517698
.0354759 .0068217 1.860737 1.039949 1. 00916
1.719161 6.249602 2.1351
.5952579 4.322408 .9808153
ptl ht ui
Std. Err.
z -0.74 -2.19 2.40 1.96 2.30 I.56 2.65 1,65
= = = =
189 33.22 0.0001 0.1416
P> iz I
[95_ Conf. l_terval]
O.457 O.029 O.O16 O.050 0.021
.9061578 .9716834 1.259736 1.001356 1. 147676
t. 045339 .9984249 9.918406 5.600207 5. 523162
O.118 0.008 O. 099
.8721455 1.611152 .8677528
3,388787 24.24199 5. 2534
lilt Ifit Computes i goodness-of-fit tests, either the Pearson X2 test or the Hosmer-Lemeshow i i
test.
By _de(ault, lfit. Istat,lroc, and lsenS compute statistics for the estimation sample using the llast rdodel estimated by logistic. However, samples other than the estimation sample can be spetifibd;t'l.seethe section Samples other than esflmation sample later in this entry. Models other thanthe )last mbdel estimated by logistic can also be specified; see the section Models other than last estimated )model
> Example
i 1
lfi_c, fyped without options, presents the Pearson X2 goodness-of-fit test for the .estimated model. The: Pdarsbn k 2 goodness-of-fit test is a test of:the observed against expected number of responses usir_g ¢elI_ defined by the covafiate patterns; see predict with the numbers option below for the defiiaitibn bfcovariate patterns. ._if_.t L_gi_tic model for low, goodness-of-fit test I number of observations = 189 _umi_r of covariate patterns = Pearson chi2(t73) = Prob > chi2 =
182 i_9.24 0.3567
Our mddell fits reasonably well. We should note, however, that the number of covafiate patterns is close td the{number of observations, making the applicability of the Pearson X2 test questionable, but not riec_ssa_ily inappropriate. Hosmer and Lemeshow (1989) suggest regrouping the data by ordering on the predicted probabilities and then forming, say, I0 nearly equal-size groups. 1fit with the group(_
o_tion does this:
.!l_i_, group(to) L_gis_ic model for low, goodness-of-fit test -_(_abl_collapsed on quantiles of estimated probabilities) i ) number of observations = 189 number of groups = :_ iHosmer-Lemeshow ohJ2(8) = i
Prob > chi2 =
10 @.65 0.2904
,
;
230 Logistic regression Again, welogistic cannot --reject our model. If you specify the tableoption, Ifit displays the groups along with the expected and observed number of positive responses (low-birth-weight babies):
_'
Ifit, group(lO) table Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) _Group
_Prob
_Obs_ i
_Exp_l
_Obs_O
_Exp_O
_Total
1 2
0.0827 O. 1276
0 2
1.2 2.0
19 17
17.8 17.0
3
0.2015
6
3.2
13
15.8
19
4
0.2432
1
4.3
18
14.7
19
5
O. 2792
7
4.9
12
14.1
19
6
O. 3138
7
5.6
12
13.4
19
7 8
O. 3872 0.4828
6 7
6.5 8.2
13 12
12.5 i0.8
19 19
9
0.5941
10
10.3
9
8.7
19
0.8391
13
12.8
5
5.2
18
10
number
of observations
number Hosmer-Lemeshow
=
189
of groups = chi2 (8) =
Prob
> chi2
19 19
i0 9.65
=
0.2984
q
Q Technical Note ifit with the group() option puts all observations with the same predicted probabilities into the same group. If, as in the previous example, we request 10 groups, the groups that lfit makes are [P0,Plo], (Pl0_P20], (P20,P30], -.-, (P90_Pl00], where Pk is the kth percentile of the predicted probabilities, with Po the minimum and Ploo the maximum. If there are large numbers of ties at the quantile boundaries, as will frequently happen if all independent variables are categorical and there are only a few of them, the sizes of the groups will be uneven. If the totals in some of the groups are small, the X 2 statistic for the Hosmer-Lemeshow test may be unreliable. In this case, either fewer groups shtutd be specified or the Pearson goodness-of-fit test may be a better choice. El
> Example The tableoption can be used without the group()option. We would not want to specify this for our current model because there were 182 covafiate patterns in the data. caused by the inclusion of the two continuous variables age and lwt in the model. As an aside, we estimate a simpler model and specify table with lfit: logistic Logit
Log
low
_Irate_2
_Irate_3
smoke
ui
estimates
likelihood low
= -107.93404 Odds
Ratio
Std.
Err.
z
Number of obs LR chi2(4) Prob > chi2
= = =
189 18.80 0.0009
Pseudo
=
0.0801
R2
P>Iz_
[95_
Conf.
1.166749
Interval]
_Irace_2
3.052746
1.498084
2.27
0.023
_Irace_3
2.922593
1.189226
2.64
0.008
1.31646
6.488269
2.945742
1.101835
2.89
0.004
1.41517
6.131701
2.419131
1.047358
2.04
0.041
1.03546
5.651783
smoke ui
7.987368
_
.
i
i
tf_t,
logistic-- Logistic regression
_Exp_O
Total
l
_1 12
O. 1230 0.2533
I3
4.9 1.0
373
35.1 3.0
404
!
!4, ':5
O. 2923 0.2997
15 3
12.6 3.9
28 10
30.4 9.1
43 13
i8 i9
O. 5087 O. 5469
2 2
1.5 4.4
1 6
I. 5 3.6
3 8
_0
O.5577
6
5.6
4
4.4
10
_I
0.7449
3
3.0
1
1.0
4
16 0.4978 !7 0.4998
__ro_p | _I i2 i3 14 i5 !6 17 i8 19 dO _I
I
231
tab
LCgi_tic model for low, goodness-of-fit test __rodp Prob _Obs_l _Exp_I _Obs_O
! I
4 4
4.0 4.5
_Prob O. 1230 O. 2533 O.2907 O. 2923 O. 2997 O. 4978 O. 4998 O.5087 O. 5469
_Irace_2 0 0 0 0 1 O 0 1 0
_irace_3 O O 1 O O 1 0 O 1
O.5577 O.7449
1 0
0 1
number of observations _umter of covariate patterns Pearson chi2(6) Prob > chi2
= = = =
4 5
4.0 4.5
smoke O 0 0 1 0 0 I O 1
ui 0 1 O O O 1 1 I 0
1 1
0 1
i
8 9
18_ 5.71 0.4569
3 Technical l_ote i I
,
tog_st_c and lfit keep track of the estimation sample. If you type logistic if x==l. then when y6u t)pe lfit the statistics will be calculated on the x==l subsample of the data automatically. t
You isho_hldspecify if or in with lfit only when you wish to calculate statistics for a set of : " i observaiion_ other than the estimation sample. Sde the section Samples other than estimation sample later m _h_ entry. If _.thez l_gistic model was estimated with iweights, 1fit properly accounts for the weights . { in it_ cdtcu[ations. (Note: ifit does not allow pweights.) You do not have to specify the weights when y6u Nn ifit.Weights should only be sp_ified with ifit when you wish {o use a different set of v_eig_ts.
(Continued on next page)
,
i J
FI
istat232
logistic -- Logistic regression
> Example istat presents the classification
statistics and classification
table after logistic.
• Istat Logistic
model
for
low True
Classified
I
Total +
l
Total
38
118
156
59 21
130 12
189 33
>=
Sensitivity
Pr(
+l D)
35,59_
Specificity
Pr(-I-D)
.5
90.77Z
+)
63.64_
-)
75.64_
-I D)
64.41_
Positive
predictive
value
Pr(
Negative
predictive
value
Pr(-DI
DI
False
+ rate
for
true
~D
Pr(+[~D)
False
- rate
for
true
D
Pr(
False
+ rate
for
classified
+
Pr(~DI
+)
36.36_
False
- rate
for classified
-
Pr(
-)
24.36_
default, Istat
Isens
-D
Classified + if predicted Pr(D) True D defined as low -= 0
Correctly
By
D
command
D[
9.23_
classified
uses can
Z3.54_
a cutoff of 0.5, although
be used
to review
you
the potenti_
can
vm-y
this with
cutoffs; see isens
the cutoff
() option. The
below.
q
iroc For other receiver operating characteristic
(ROC) commands and a complete description,
see [R] roe.
lroc graphs the (ROC) curvema graph of sensitivity versus one minus specificity as the cutoff c is varied and calculates the area under it. Sensitivity is the fraction of observed positive-outcome cases that are correctly classified; specificity is the fraction of observed negative-outcome cases that are correctly classified. W-hen the purpose of the analysis is classification, one must choose a cutoff. The curve starts at (0, 0), corresponding to c = 1, and continues to (l, t ). corresponding to c -= 0. A model with no predictive power would be a 45° line. The greater the predictive power, the more bowed the curve, and hence the area beneath the curve is often used as a measure of the predictive power. A model with no predictive power has area 0.5: a perfect model has area 1. The ROC curve was first discussed in signal detection theory (Peterson. Birdsall. and Fox 1954) and then was quickly introduced into psychology (Tanner and Swets I954) It has since been applied in other fields, particularly medicine (for instance, Metz 1978). For a classic text on ROC techniques, see Green and Swets (1974).
i
i
logistic-
Logisticregression
233
•_ Exampl ROC;cuVves are typically used when the poinf; of the analysis is classification, which it is not in _
our io_,-bi_th-welght model. Nevertheless, the R0C curve is • i lr!c L_gi_tic i
-
model
for low
n_mb_r of observations a_ea Iunder ROC curve
, I
I
:
189 0.7462
: ;
Area
i
under
ROC curve
= 0.7_62
i O0
/
/
/
6.75
t"
/
0,50
We see ithal the area under the curve is 0.7462.
isens '
q
_ i
lseris also plots sensitivity and specificity; it plots both sensitivity and specificity versus probability cuto_.fc_ T_e graph is equivalent to what you would get from Istat
if you varied the cutoff probability
from! 0io_. • lse_s
Sensi'_ivify
i
i001
"6
0.75
,
o Specificity
i
I
!
I
1
m t_ 050 0,25
i-
i
t
o.00!
t P_obabihty
1 Cutoff
;_ I
Isens will optionally create new variables specificity. 234 logistic -- Logistic regression isens,
genprob(p)gensens(sens)
containing
genspec(sp_c)
the probability
cutoff,
sensitivity,
and
nograph
Note that the variables created will have M + 2 distinct nonmissing covariate patterns plus one for c = 0 and another fore = I.
values, one for each of the M
Samples other than estimation sample Ifit, Istat,Iroc, and Isens can be used withl samples other than the estimation sample. By default, these commands remember the estimation sample used with the last logistic command. To override this, simply use an if or in restriction to select another set of observations, or specify the all option to force the command to use all the observations in the dataset. If you use lfit with a sample that is completely different from the estimation sample (i.e., no overlap), you should also specify the outsample option so that the X 2 statistic properly adjusts the degrees of freedom upward. For an overlapping sample, the conservative thing to do is to leave the degrees of freedom the same as they are for the estimation sample.
> Example Suppose that we wish to develop a model for predicting low-birth-weight babies. One approach to developing a prediction model would be to divide our data into two groups, a developmental sample and a validation sample. See Lemeshow and Le Gall (1994) and Tilford et al. (1995) for more information on developing prediction models and severity scoring systems. We will do this with the low-birth-weight the data into two samples. .
use
lbw,
(Hosmer . set
data)
1
r = uniform()
gen • sort gen
r group
(95 mlssing
= I if _n <= _N/9 values
. replace
group
(95 real
changes
generated)
= 2 if group == . made)
Then we estimate a model using the first sample (group==l), • xi: logistic i.race Legit
Log
First, we randomly divide
clear
_ Lemeshow
seed
data we considered previously.
low
age lwt i.race _Irace_l-3
estimates
likelihood
= -44.293342
smoke
our developmental
ptl ht ui if group==l (naturally coded; _Irace_l
sample.
omitted)
Number of obs LR chi2 (8) Prob > chi2
= = =
94 29.14 0.0003
Pseudo
=
0.2475
R2
°
;
logistic -- Logisticregression
! i
low
Odds Ratio
age
.91542
Std°
z
P>lzl
[95Z Conf.
Interval]
.0553937
-t. 46
O, 144
.8130414
1.03069
3.78442 .01t2295
2.17 -_.25
O. O. 030 025
1. 170327 .9526649
21.90913 .9966874
I
_I:[ace_21 i lwt smoke
.909912
.5252898
-0. I6
O. 870
.2934966
2.820953
l °_ f
i ptl _i_ace_3 i ht ! ui
3. 033543 2.606209 21.07656
1.507048 1.657608 22.64788
_, 23 t. 51 2.84
.988479
.6699458
O. 025 O. 132 O. 005 O. 986
1 • 145718 .7492483 2. 565304 .2618557
8.03198 9.065522 173. t652 3.731409
i!
i i [
To test
_
5.063678 .9744276
Err.
-0,02
_" " " c_hbratlon m the developmental sample, the Hosmer-Lemeshow
goodness-of-fit test is
I group (10) calcul!te l_it d u_ng ifit. for
Log_ist_c model low, goodness-of-fit t_st (T_bleicollapsed on quantiles of estimated probabilities) Inumber of observations = 94 I
10
number of groups = osmer-Lemeshow chi2 (8) = Prob > chi2 =
6_67 0_5721
Note tha we did not specify an if statement _-lth Ifit since we wanted to use the estimation sample. ;inqe ! the test is nonsignificant, we are satisfied with the fit of our model. Running _roc gives a measure of the discrimination: .
oc,
nograph
Logistic model for low 94 0.8158
number of observations = ar_a u.%derROC curve =
No_: we lest the calibration of our model by lJerforming a goodness-of-fit test on the validation sample. We _pecify the outsampleopUon so that the degrees of freedom are 10 rather than 8, Lo@istic . !fitI ifmodel group==2, for low, group(iO) goodness-of-fit table outsample t_st (Table collapsed on quantiles of estimateflprobabilities) _G_ou_ ' II
_Prob 0.0725 O. 1202 O. 1549 O.1888
_Obs_l 1 4 3 1
_ExpI 0.4 O.8 1.3 1.5
_Obs_O 9 5 7 8
_Exp_O 9.6 8.2 8.7 7.5
_Total 10 9 I0 9
O. 3258
4
2.7
5
I
I
O.2609 O.42t7
3 2
2.2 3.7
7 8
6.3 6. 3 7.8
9 10 10
I
181
O. 6265 0.4915 0.9737
4 3 4
5.5 4.1 7.1
6 65
4.5 4.9 1.9
10 9 9
I
235
I number of observations nu.iIlber groups [ of osmer-Lemeshow chi2(lO) Prob > chi2
=
95
= = =
I0' 28',03 0.0018
.._
We must acknowledge that our model does not fit well on the vaIidation 236 logistic -- Logistic discrimination in the validation regression sample is appreciably lower as well. • iroc
if group==2,
Logistic number i
area
model
nograph
for
low
of observations
under
ROC
sample. The model's
curve
=
95
=
O. 5839
,
q
Models other than last estimated model By default, lfit, lstat, lroc, and lsens use the last model estimated by logistic. specify other models using the beta() option.
One can
i> Example Suppose that someone publishes the following logistic model of low birth weight: Pr(low
= 1) = F(-0.02
age - 0.01 lwt + 1.3 black
where F is the cumulative logistic distribution. are the equivalent of what logit produces.
+ 1.1 smoke + 0.= ptl
Note that these coefficients
-t 1.8 ht + 0.8 ui + 0.5) are not odds ratios: they
We can see whether this model fits our data. First, we enter the coefficients as a row vector and label its columns with the names of the independent variables plus _cons for the constant (see [el matrix define and [P] matrix rowname). matrix
input
• matrix
b =
colnames
C-.02
-.01
b = age
lwt
1.3 black
1.1
.5 1.8
smoke
.8 .5)
pt]. ht
ui
_cons
We run lfit using the beta() option to specify b. The dependent variable is entered right after the command name, and the outsample option gives the proper degrees of freedom. . ifit
low,
Logistic (Table
beta(b)
model
for
collapsed number
group(lO) low,
goodness-of-fit
on quantiles
of observaZions
number Hosmer-Lemeshow
outsample
=
of groups = chi2 (I0) =
Prob
> chi2
:
Although the fit of the model is poor, lroc •iroc
low,
Logistic number area
beta(b)
model of
under
for
probabilities)
189 i0 27.33 0.0023
shows that it does exhibit some predictive
ability.
nograph low
observations ROC
test
of estimated
curve
= =
189 0.7275
q
logistic-- Logisticregression
237
predict after logistic p#edictis used after logisticto obtain predicted probabilities, residuals, and influence statistics for tt_e ostimation sample. The suggested diagnostic graphs below are from Hosmer and Lemeshow (1989). Where they are more elaborately explained. Also see Collett (1991. 120-t60) for a thorough discussion cjf model checking.
predict _wiihout options Typing p_edict p after estimation calculates _he predicted probability of a positive outcome. We ptevibusly ran the model logisticlow age Ivt _Irace_2 _Irace_3 smoke ptl ht ui. We obtain tl_epredicted probabilities of a positive outcome by typing • _re4ict
P
(o]_ti_n
p assumed;
• dum_arize
Pr(tow))
p low Obs
V_iable p low
189 189
Mean .3121693 .3121693
Std. Dev. .1918915 .4646093
Max
Hin .0272559
.8391283 0
I
predibt _vit_ the xb and stdp options predict with the xb option calculates the linear combination xjb, where xj are the independent variaNes! in he jth observation and b is the estimated parameter vector. This is sometimes known as the incle_ fu _ction since the cumulative distribution function indexed at this value is the probability of a _siiive outcome. With the _tdp option, predict calculates the standard error of the prediction, which is not adjusted tbr replidated covariate patterns in the data. The itifluence statistics described below are adjusted for replicated c(_variate patterns in the data.
predict Wit_ the residuals option predict _can calculate more than predicted probabilities. The Pearson residual is defined as the square root bf the contribution of the covariate pattern to the Pearson X2 goodness-obfit statistic. signed adcor_ing_to whether the observed number of positive responses within the covanate pattern is less th_n Or greater than expected. For instance, lz_red_ct N_rize
r,
residuals r, detail Pearson
i '
residual
_ercentiles
Smallest
IZ_
-1.750923
-2.283885
5_
-1.129907
-1.750923
IOZ
-,9581174
-1.636279
Obs
25_:
-,6545911
-1.636279
Sum of
50Z
-.3806923
189 Wgt.
Mea_ Largest 2.23879
Std.
189 -.0242299
Dev.
,9970949
75_
.8162894
90ZI
1.510355
2.317558
Variance
.9941981
95Z 99Z
1.747948 3.002206
3,002206 3,126763
Skewness Kurtosis
.8618271 3.038448
238 notice logistic-Logistic We the prevalence of aregression few, large positive residuals: t
'"
• sort list
r id r 10w
p age
race
in -5/1
185.
id 33
r
low 1
186.
57
2.23879
1
187.
16
2.317558
1
188.
77
3.002206
1
189.
36
3.126763
1
2.224501
p
age 19
race white
,166329
15
white
.1569594
27
other
.0998678
26
white
.0927932
24
white
.1681123
predict with the number option Covariate patterns play an important role in logistic regression, Two observations are said to share the same covariate pattern if the independent variables for the two observations are identical. Although one thinks of having individual observations, the statistical information in the sample can be summarized by the covariate patterns, the number of observations with that covariate pattern, and the number of positive outcomes within the pattern. Depending on the model, the number of covariate patterns can approach or be equal to the number of observations or it can be considerably less. All the residual and diagnostic statistics calculated by Stata are in terms of covariate patterns, not observations. That is, all observations with the same covariate pattern are given the same residual and diagnostic statistics. Hosmer and Lemeshow (1989) argue that such "M-asymptotic" statistics are more useful than "N-asymptotic" statistics. To understand the difference, think of an observed positive outcome with predicted probability of 0.8. Taking the observation in isolation, the "residual" must be positive--we expected 0.8 positive responses and observed 1. This may indeed be the "correct" residual, but not necessarily. Under the M-asymptotic definition, we ask how many successes we observed across all observations with this covariate pattern. If that number were, say, 6, and there were a total of 10 observations with this covariate pattern, then the residual is negative for the covariate pattern we expected 8 positive outcomes but observed 6. predict makes this kind of calculation and then attaches the same residual to all observations in the covariate pattern. Thus, there may be occasions when you want to find all observations number allows you to do this: predict
pattern,
• summarize
We
number
pattern
Variable
Obs
pattern
189
previously
estimated
Mean
89.2328
the model
ui over 189 observations.
predict
sharing a covariate pattern.
logistic
Std.
Dev.
Min
53. 16573
low
age
1
lwt
_Irace_2
Max
182
_Irace_3
smoke
ptl
ht
There are 182 covariate patterns in our data.
with the deviance
option
The deviance residual is defined as the square rooT:of the contribution to the likelihood-ratio test statistic of a saturated model versus the fitted model. It has slightly different properties from the Pearson residual (see Hosmer and Lemeshow, 1989): predict
d,
de_iance
•
......
........
logistic-- Logistic regression Summarize
Percentiles
5_
residual
Smallest
-1.843472
1_
-1.911621
-i. 33477
-1.843472
10_ '_
-!. 148316
-I .843472
Dbs
25_
-.8445325
-1.674869
Sum of _/gt.
50_
-.5202702
Mean Largest
189 189 -. 1228811
Std. Dev.
I.049237
175_ : 90_
.9129041 1,541558
1.894089 1. 924457
Variance
1. 100898
!95_ !99_
i.673338 2. 146583
2. 146583 2. 180542
Skewness Kurtosis
.6598857 2. 036938
predict With the rstandard option z PearsOn residuals do not have a standard deviation equal to t. a fine point, rstandard Pearson _esiduals normalized to have expected: standard deviation equal to !. i redict
rs,
ummarize
generates
rstandard r rs
Variable
Obs
Mean
Std. Dev,
Mix
Max
r
189
-.0242299
.9970949
-2.283885
3. 126763
rs
189
-.0279135
i,026406
-2.4478
3.149081
I
'
• +rrelate
i(o=189>
r rs r
r
t, 0000
rs
O. 9998
rs
1. 0000
Rememblr that we previously created r containing the (unstandardized) Pearson residuals, In these data, wh&her you use standardized or unstandardized residuals does not much matter, I
pred_t
ith the hat option
, : _
hat @culates the leverage of a covariate pattern a scaled measure of distance in terms of the in_tep_ndent variables. LaNe values indicate covariate patterns "far" from the average covariate patlern--_patterns that can have a large effect on the estimated model even if the corresponding residual i[ small. This suggests the following:
[
}
239
d, detail deviance
[
i
i
(Continuefl on next page)
,_
240
.
predict logistich, graph h r,
hatLogis_c regression border yline(O) ylab xlab
°g 15
o
_,
o o000
0
o
o
cO0
oooo
oo
I
o
o
_,
o °°
o
o
o
o
_
vj °
oo
O-
Pearson
residual
The points to the left of the vertical line are obserx,_ed negative outcomes: in this case, our data contain almost as many covariate patterns as observatiens, so most covariate patterns are unique. In such unique patterns, we observe either 0 or 1 success and expect p, thus forcing the sign of the residual. If we had fewer covariate patterns, which is to say, if we did not have continuous variables in our model, there would be no such interpretation arid we would not have drawn the vertical line at O. Points on the left and right edges of the graph represent large residuals--covariate patterns that are not fitted well by our model. Points at the top of our graph represent high leverage patterns. When analyzing the influence of observations on the model, we are most interested in patterns with high leverage and small residuals patterns that might otherwise escape our attention. predict with the dx2 option There are many ways to measure influence of which hat is one example, dx2 measures the decrease in the Pearson X 2 goodness-of-fit statistic that would be caused by deleting an observation (and all others sharing the covafiate pattern):
(Continued
on next page)
{
}
• _re_ict graph
dx2,
dx2
dx2 p, border
ylab
xlab
Paraphrasing Hosmer and Lemeshow (1989), the points going from the top left to the bottom right, correspond to covariate patterns with the number of positive outcomes equal to the number in the group; the points on the other curve correspond to 0 positive outcomes. In our data. most of the covariale patterns are unique, so the points tend to lie along one or the other curves: the points that are off' the curves correspond to the few repeated covariate patterns in our data in which all the outcomes a_e not the same.
! i
We exa_ainethis graph for large values of dx2--there are two at the top left. I i
i
predct w th the ddeviance option Anothel measure of influenceis the change in _thedeviance residuals due to deletion of a covarJate pattern:,
!
pr_dict As
with
d_2,
dd, ddeviance one
typically graphs ddevi_uce:
against the
probabi]ir}, of a positive outcome.
We
direct you ito Hosmer and Lemeshow (I989) foran example and the interpretation of this graph. predi_ With the dbeta option One_of the more direct measures of influence of interest to model fitters is the Pregibon (1981} ,me tsure, a measure of the change in the!coefficientvector that would be caused by deleting an observ_tion (and all others sharing the covartate pattern): dbe_a
i
(Continued on next page}
. predict
242
db, dbeta
logistic -- Logistic regression
ilt!
graph
db p, border
ylab
xlab
I
.75
{
I
J
I
"
-o p
o
_ _
o o o o
o
o
oo
o
o
c_mlmm_
.25 t
o
Q
o
o
_o
a_l_J_,::_o
o_Oo
o
o
o
J
o
eOo_
_
''
o 1
"T
Pr(low)
One observation .
sort
• list
has a large effect on the estimated coefficients.
in I
Observation
189
id lwt
188 95
ptl
3
ht
0
fry
0
bwt
_3637
0 117
p d
.839i1283 -1. 9111621
r rs
-2. 283885 -2.4478
5.99_726
dd
4. 197658
_Irace_3 pattern h db
dx2
low race
dx2
.1294439 ,8909163
Hosmer and Lemeshow graph
We can easily find this point:
db
p [w=db],
0 _ite
age smoke
25 1
ui
I
_Irace_2
0
(1989) suggest a graph that _combines two of the influence measures: border
Symbol
ylab
xlab
size proportional
tl("Symbol
size
proportional
to dSeta
I
Pro'fowl,
We can easily spot the most influential points by the dbeta
and dx2 measures.
to dBeta")
-
] i i
•
.
'_
logistic -- Logistic
regression
243
SavedReSults i
ti ! ! ! I
_og_s Scalars ic saves e(N) e df_.m) e r2_p)
in e(): number of observations model de_ees of freedom pseudo R-sqeared
tog likelihood, constant-only model number of clusters X2
e(ll_O) e(N_clust) e(chi2)
log likelihood
e Ii) :Macro_ e_ e_depvar) el wtype)
logistic name of dependent variable weight type
e(clustvar) name of cluster variable e(vcetype) covariance estimation method e(chi2type) Wald or LR: type of model X2
e(wexp)
weight expression offset
e(predict)
program used to implement predict
coefficient vector
e (V)
variance-covariance matrix of the estimators
eloffset)
test
Matrices e_b)
Functichls e sample)
marks estimation sample
_fi_s:_vesin r(): Scalars
_st_t i
rmmber of observations
r(df)
degrees of freedom
r(m)
number of covariate patterns _r groups
r(chi2)
X2
_aves in r ():
Scalars r(P_c_rr)
i r(P-n
I
r(N)
4)
r(P_p(_) r(P_.n_)
!roc
percent correctly classified
r(P_lp)
putative predictive value
sensitivity
r(P_ln)
negative predictive
specificity
r(P_Op)
false
false positive rate given true negative false negative rate given true positive'
r(P_On)
false negative rote given classified negalivc
positive
Vall}e
rate given
classified
positive
s_ves in r(): Scalars
i
r(N)
number of observations
r(area)
area under the ROC curve
in i
r():
Isens
Scalars r(N) number of observations
_
,,
Methods 3ndFormulas logis';ic,
lfit,
lstat,
lroc.
and lsens
are implemented
as ado-files. l! o
l
Define xj as the (row) vector of independent Variables, augmented by 1. and b as the correspond _v estimated 3arameter (column) vector, The logistic regression model is estimated bv logit: see {RI loRit for demih
!
of estimation.
....
The odds ratio corresponding to the ith coefficient is ¢i --- exp(bi). The standard error of the odds ratio = g'_si,-- where si isregression the standard error of bi estimated by logit. zo_ is sS Iogmuc Logistic
_!
Define lj = xj b as the predicted index of the jth observation. positive outcome is
The predicted probability
of a
exp,±i) 1+ exp(I )
Pj
If it Let 2_r be the total number of covariate patterns among the N observations. View the data as collapsed on covariate patterns j = 1, 2.... , M and define mj as the total number of observations having covariate pattern j and yj as the total number of positive responses among observations with covariate pattern j. Define pj as the predicted probability of a positive outcome in covariate pattern
j.
The Pearson X2 goodness-of-fit
statistic is M
x2=
(uj mjpj)2 j=, mjpj(i-
This X2 statistic has approximately M - k degrees of: freedom for the estimation sample, where k is the number of independent variables including the constant. For a sample outside of the estimation sample, the statistic has M degrees of freedom. The Hosmer-Lemeshow
goodness-of-fit
X 2 (Hosmer and Lemeshow
1980; Lemeshow and Hosmer
1982; Hosmer, Lemeshow, and Klar 1988) is calculated similarly, except rather than using the M covariate patterns as the group definition, the quanti!les of the predicted probabilities are used to form groups. Let G = # be the number of quantiles requested with group (#). The smallest index 1 <_q(i) <_ M such that q(i)
-G
j=l
gives Pq(i) as the upper boundary of the ith quantile for i = 1, 2..... first index.
G. Let q(0) -- 1 denote the
The groups are then !Pq(o),Pq(1)],
(Pq(1),Pq(2)],
...,
If the table option is given, the upper boundaries Pq(1) .... group number on the output.
(Pq(G-1),Pq(c,)] , PqfG) of the groups appear next to the
The resulting X 2 statistic has approximately G- 2 degrees of freedom for the estimation For a sample outside of the estimation sample, the statistic has G degrees of freedom.
sample.
m
logistic -- Logistic regression
predicitafk_r logistic =_ !,. i_
Index j will now be used to index observations, not covariate patterns. Redefine mj for each observatiol as the total number of observations sharing j's covariate pattern. Redefine yj as the total number of positive responses among observations sharing j's covariate pattern_
}
a'son residual for the jth observation is defined
rj= i
For mj > l, the deviance residual dj is defined
, i
dj==t=
_
i ;
i i
dy =
i i
!
i
0
(mj - yj)In
( )/ mj(1-pj)
In the limiting cases, the deviance residual
{-v/2mjll_(l-pj)[ ",/2mjllnp_l
if yj yj=O if = mj
The unldjusted diagonal elements of the hat matrix huj are given by hv_ = (XVXt)jj, where V is the e _timated covafiance matrix of parameters. The adjusted diagonal elements hj created by hat are th,m hj = mjpj (1 - pj) hu_. The-stalldardized Pearson residual rsj is rj/V/I ThePregibon
(1981)
- hi.
_-_j influence statistic is rj _hj
t
I _
yjl
is gi_venb3_
li ; : _:
II ( l
mj - yj
where file dgn is the same as the sign of (yj - mjpj). i
i!
245
(1- hj)
i The corresponding change in the Pearson X_ is r 2 is _Dj
=-Idj/(1 - hi).
sj. The corresponding change in the deviance residual
Istat and Isens
Again, et j index observations. Define c as the cutoff () specified by the user or. if not specified, as 0.5. Lel 7)3be the predicted probability of a positive outcome and yj be the actual outcome, which we will treat as 0 or I. although Stata treats it as 0 and non-0, excluding missing observations. A prediction is classified as positive if Pa >- c and otherwise is classified as negative. The classificati_m is correct if it is positive and yj -I or if it is negative and yj = O. .S'en._lttv_t3; is the fraction of yj = I observmions that are correctly classified. Specificity is the percent of !
!
= 0 obser_ations that are correctly classified.
Iroc
246
logistic --
Logistic regression
1
The ROC curve is a graph of specificity againstl (I - sensitivity), This is guaranteed to be a monotone nondecreasing function, since the number :of correctly predicted successes increases, and
,
the number
i
The area under the ROC curve is the area on the bottom of this graph, and is determined by integrating the curve. The vertices of the curve are determined by sorting the data according to the predicted index, and the integral is computed using the trapezoidal rule.
of correctly
predicted
failures
decreases
as the classification
cutoff
c decreases.
References Brady, A. R. 1998. sbe21: Adjusted population attributable fractions from logistic regression. State Technical Bulletin 42: 8-12. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 137-143. Cleves, M. and A. Tosetto, 2000. sg139: Logistic regression when binary outcome is measured with uncertainty. State Technical Bulletin 55: 20-23. Collett, D, 1991. Modelling Binary Data. London: Chapman & Hall. Garrett, J. M. 1997. sbe14: Odds ratios and confidence intervat_ for logistic regression models with effect modification• State Technical Bulletin 36: 15-22. Reprinted in State Technical Bulletin Reprints, vol. 6, pp. 104-114. Gould. W. W. 2000. sgI24: Interpreting logistic regression in all its forms. State Technical Bulletin 53: 19-29. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 257-270. Green. D. M. and J. A. Swets. t974. Signal Detection Theorj and Psychophysics. rev. ed. Huntington, NY: Krieger. Hilbe, J. t997. sg63: Logistic regression: standardized coefficients and partiai correlations. State Technical Bulletin 35: 21-22. Reprinted in State Technical Bulletin Reprints, voI.' 6, pp. 162-163. Hosmer, D. W., Jr., and S. Lemeshow. 1980. Goodness-of+fit tests for the multiple logistic regression Communications in Statistics A9: 1043-1069.
model.
t989. Applied Logistic Regression. New York: John Wiley & Sons. (Second edition forthcoming in 2001.) Hosmer. D. W.. Jr.. S. Lemeshow, and J. Klar. 1988. Goodness-of-fit testing for the logistic regression model when the estimated probabilities are small. Biometric Journal 30:: 911-924. Irala-Est_vez. J. de and M. A. Mart{nez. 2000. sg125: Automatic estimation of interaction effects and their confidence intervals. State Technical BuIletin 53: 29-31. Reprinted in State Technical Bulletin Reprints, vol. 9, pp. 270-273. Lemeshow. S. and D. W. Hosmer, Jr. 1982. A review of goodness of fit statistics for use in the development of logistic regression models. American Journal of Epidemiology 115:: 92-106. • 1998. Logistic regression. In Encyclopedia of Biostatistics. ed. R Armitage and T. Colton. 2316-2327. John Wiley & Sons.
New York:
Lemeshow. S. and J.-R. Le Gall. 1994. Modeling the severity of illness of 1CU patients: a systems update. Journal of the American Medical Association 272: 1049-1055. Metz, C. E. 1978. Basic principles of ROC analysis. Seminarsi rn Nuclear Medicine 8: 283-298. Pagano, M. and K. Gauvreau. 2000. Principles of Biostatistics. 2d ed. Pacific Grove. CA: Brooks/Cole. Paul, C. 1998. sg92: Logistic regression for data including muitiple imputations. State Technical Bulletin 45: 28-30. Reprinted in State Technical Bulletin Reprints, vol. 8, pp. 180-183. Pearce, M. S. 2000. sgt48: Profile likelihood confidence intervals for explanatory variables in logistic regression. Stata Technical Bulletin 56: 45--47, Peterson, W. W.. T. G. Birdsalt, and W. C. Fox. 1954. The theory of signal detection. Trans. ItLE Professional Group on Intbrmalion Theory', PGIT-4: t71-212. Pregibon, D. 1981. Logistic regression diagnostics. Annals of Statistics 9: 705-724. Tanner. W. P., Jr., and J. A. Swets. 1954. A decision-making theory of visual detection. P._ychological Review 61: 401-409_ TdIbrd. J. M., P. K. Roberson, and D. H. Fiser. 1995. sbel2: Using lilt and lroc to evaluate mortality prediction models. State Technical Bulletin 28: |4-18. Reprinted in 5;rata Technical BulletiJ7 Reprints. vol. 5. pp. 77-81.
[,
i
i
t I
I
logistic-- Logistic regression
247
i
i Tobias.A. 000. she36:Summarystatisticsreport for diagnostictests. StataTechnicalBulletin56: 16-18. Z Tobias.A. ,md M, J, Campbell.t998. sgg0: Akaike's |nformationcriterionand Schwarz'scriterion.State TechtTica! i Bullern 45: 23-25. Reprintedin Steta TechnicalBulletinReprints,vol. 8, pp. 174-177. Wee_ie,J. 998. sg87: Windmeijer'sgoodness-of-fittest for logistic regession. Stat_ TechnicalButle6n 44:_2-27 Reprinte) in State TechnicalBulletinReprints,vol. 8, pp. 153-t60. _
i
:i
Also See i Complem rotary:
[R] adjust, [R] lincom, [_] linktest, [R] Irtest JR] mfx JR] predict [R] roc, [R] sw, [R] test, JR] testnt, JR] vce, [R] xi
Related:
[R] brier, [R] dogit, [R] dloglog, [R] cusum, [R] glm, [R] giogit, [R] logit, [R] nlogit, [R] _obit, [R] scobit, [R] svy estimators
Bad,ground:
[U] 16.5 Accessing coet_ients and standard errors [U] 23 Estimation and p6st-estimafion commands,
i
[U] 23.11 Obtaining robdst variance estimates, [U] 23,12 Obtaining sco_s, [R] maximize
i
]
l
i
i
i
/
i
,is
Iogit -- Maximum-likelihood
I
III
logit estimation
I I
i
Syntax logit
depvar
noconstant asis by ....
[indepvars] or robust
moaimize_options
[weight] cluster
[if
exp]
( varname)
[in range] s¢ore
[, level(#)nocoef
(newvamame
) off set ( vanrame)
]
may be used with logit; see [R] by. iweights, and pweights are allowed; see [U] 14.1;6 weight. shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. may be used with sw to perform stepwise estimation; see [R] sw.
fweights,
logit logit
Syntax for predict [type] newvarname
predict
[if
exp]
[in range]
[, statistic rules
asif
nooffset
]
where statistic is p xb stdp • dbeta , deviance • dx2 , ddeviance • _hat • number residuals • rstandard
probability of a positive outcome (the default) x/b, fitted values standard error of the predioIion Pregibon (1991) A/3 influence statistic deviance residual Hosmer and Lemeshow (1989) A X 2 influence statistic Hosmer and Lemeshow (1989) A D influence statistic Pregibon (1981) leverage sequential number of the c0variate pattern Pearson residuals; adjusted _for number sharing covariate pattern standardized Pearson residues: adjusted for number sharing covariate pattern
Unstarred statistics are available both in and out of sample; type predict .,. if e(sample) ... if wanted only for the estimation sample, Starred statistics are calculated only Torthe estimation sample even when if e(sample) is not specified.
Description logit estimates a maximum-likelihood
logit model,
Also see [R] logistic; logistic displays estimates as odds ratios. A number of auxiliary commands that can be run after Iogit, probit, or logistic estimation are described in [R] logistic. A list of related estimation commands is given in [R] logistic.
248
i T
IogB-- Maximum-fikelihoodlogit estimation
249
options level (#) specifies the confide_e level, in percent, for confidence intervals. The default is level
(95)
or as s_,tby set level: see [V] 23.5 Specif!ing: the width of confidence intervals. nocoef siecifies that the coefficient table is not to be displayed. This option is sometimes used by
i l
progran writers but is of no use interactively.
l
noconsta
i i
_ , i
_t suppresses the constant term (intercept) in the logit model.
or report the estimated coefficients transformeff to odds ratios, i.e., eb rather than b. Standard errors and:confidence inter_,als are similarly transfohned. This option affects how results are displayed, not ho_, they are efitimated, or may be specified at estimation or when replaying previously estimated results. robust s_ecifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditiopal calculatioh; see [u] 23.11 Obtai0ing robust variance estimates, robust combined with c3iuster () allows observations which a_e not independent within cluster (although they must be independent betv_en clusters). If you _pecify pweights, robustis implied_ see [U] 23.13 Weighted estimation. See: fR_logistic for txamptes using this opti0n.
!
clu_ster(Ivamame)
s_cifies
that the observations are independent across groups (clusters_ but
not necessarily within groups, vamame specifies to which group each observation belongs: e.g., c tust#r(personid) in data with repeated iobservations on individuals, cluster() affects the estimated standard 6rrors and variance-covariance matrix of the estimators (VCE), but not the estimatkd coefficients; see [u] 23.11 Obtaining robust variance estimates, cluster () can be
! i
used wlth pweightsto produce estimates for unstratified cluster-sampled data, but see the svylogit
!
commind in [R] sv_(.estimators for a comm_nd designed especially for survev, data.
i l
by itself. () implies robust; specifying robust ¢lust+r See [Rj logistic for examples using this optitn.
I
score(n ,
I
i
cluster
() is equivalent to typing cluster()
|wvarname) creates newvar containing uj = OlnL ./O(xjb)
sample. The score Vector is _
OlnLj/Ob
= _ ujxj',
for each observation j in the
i.e., the product of newrar w_th each
covari_te summed o;¢er observations. See [U] 23.12 Ol_taining scores. offset(_larname) specifies that varname is to be included in the model with coefficient constrained to be 1_. as is fore ._sretention of perfect predictor variables and their associated perfectly predicted observations and m y produce instabilities in maximization: see [R] probit. maximize.options _pecif3 them.
i
,
control the maximization process: see [R] maximize. You should never have t_
Optionsfcbrpredict p, the default,_ calculates the probability of a positive outcome.
I
xb calcul ttes the linea_ prediction,
l
strip cal, ulates the standard error of the linear prediction I
250 :_
Iogit -- Maximum-likelihood Iogit estimation
dbeta calculates the Pregibon (198 l) Aft influence statistic, a standardized measure of the difference in the coefficient vector due to deletion of the observation along with all others that share the same eovariate pattern. In Hosmer and Lemeshow (1989) jargon_ this statistic is M-asymptotic, that is, adjusted for the number of observations that share the same covariate pattern. deviance calculates the deviance residual. dx2 calculates the Hosmer and Lemeshow (1989) ?<X2 influence statistic, reflecting the decrease in the Pearson X2 due to the deletion of the observation and all others that share the same covariate pattern. ddeviance calculates the Hosmer and Lemeshow (t989) £xD influence statistic, which is the change in the deviance residual due to deletion of the observation and all others that share the same covariate pattern. hat
calculates the Pregibon (1981) leverage or the diagonal elemems of the hat matrix adjusted for the number of observations that share the same cOvariate pattern.
number numbers the covariate patterns--observations with the same covariate pattern have the same number. Observations not used in estimation have number set to missing. The "first" covariate pattern is numbered 1, the second 2, and so on. residuals calculates the Pearson residual as given by Hosmer and Lemeshow for the number of observations that share the same covariate pattern.
(1989) and adjusted
rstandard calculates the standardized Pearson residual as given by Hosmer and Lemeshow and adjusted for the number of observations that share the same covariate pattern.
(1989)
rules requests that Stata use any "rules" that were used to identify the model when making the prediction. By default, Stata calculates missing fop excluded observations. asif requests that Stata ignore the rules and the exclusion criteria, and calculate predictions for all observations possible using the estimated parameter from the model. nooffset is relevant only if you specified offset(v_trname) for logit. It modifies the calculations made by predict so that they ignore the offset _ariable: the linear prediction is treated as x_b rather than xjb + offsetj.
Remarks logit performs maximum likelihood estimation of models with dichotomous hand-side) variables coded as 0/l (or. more precisely, coded as 0 and not-0).
dependent
(left-
Example You have data on the make. weight, and mileage rating of 22 foreign and 52 domestic automobiles. You wish to estimate a logit model explaining whether a car is foreign based on its weight and mileage. Here is an overview of your data:
Io_lit-- Maximum-likeVnoodlogit estimation • d _scribe Contains o3s :
data
from
auto.dta 74
vats: i
7 Jul 2000
1,998
(99.7_, of memory
storage
display
value
format
label
.
mak
strl8
_,-18s
Make
mpg
int
_8. Og
Mileage
i
_ei ht for _ign
int byte
_,8.Ogc _,8.Og
name
I
_or _ed by: Note:
i
, ilspect
foreign dataset
Car _ype
Number
Integers
Zero [ve Posit
52 22
82 22
Total Missing
74
74
#
Negative
i•
_ I
#
# #
i: :
1
'
(2
unique
Integers
74
values)
foreign
The vari
of Observations Non-
Total
i
# #
(mpg)
saved
:
i
and Model
foreign
foreign:
i
label
Weight (ibs. ) Car type
last
i
is labeled
amd all values are documented
in the label.
le foreign takes on two unique values. 0 and 1 The value (1 denotes a domestic car
and 1 de_otes a foreign car. i
The njodel you wish to estimate is
!
Pr(foreign= I)= F(flo -*-_lweight4-/32mpg)
i
i
i
.,here Flz) = e=/(1 + e':) is the cumulative logistic distribution. To estlmate" this model, you type I
l_git
i
I i f
foreign!weight
leg likelihood l%g likelihood
= -45.03_21 = -29.898_68
Iteration
2:
lSg
likelihood
= -27.495771
Iteration Iteration !teration
3: 4: 5:
l@g log log
likelihood likelihood likelihood
= -27.184D06 = -27.175_66 = -27. 175156
Lo
it
estimates:_
foreign
Number
I I
Coef.
of obs
LR chi2(2) Prob > chi2 Pseudo R2
_ _ _ -27.175156
Lo_l | likelihood
:
mpg
Ite "ation O: Ire "ation I:
I
i
variable
origin
has cha_tged since
Data
13:5i
free)
type
variable .
1978 Automobile
4
si.,e:
i
i
25t
Std. Err.
weight mpg
i_-.0039067 -.1685869
.0919174 .00t0116
_cons
! t3.70837
4. 518707
z
P>Iz]
=
74
= = =
35.72 0.0000 0.3966
[95Y, Conf.
-3.86 -I .83
O. 000 0.067
-. 3487418 0058894
3.03
O. 002
4,851864
Interval]
-. .011568 001924 22.56487
You find that heavier cars are less likely to be foreign and that cars yielding also less likely to be foreign, at least holding the weight of the car constant. 252 Iogit- Maximum-likeUhood Iogit estimation
F l
i
See [R] maximize
for an explanation
better gas mileage are
<1
of the outpui.
D Technical Note Stata interprets a value of 0 as a negative outcome (failure) and missing) as positive outcomes (successes). Thus, if your dependent and 1, 0 is interpreted as failure and 1 as success. If :your dependent 1, and 2, 0 is still interpreted as failure, but both 1 and 2 are treated If you prefer a more formal mathematical the model
statement,
treats all other values (except variable takes on the values 0 variable takes on the values 0, as successes.
when you type logit
g x, Stata estimates
: exp(xjf_)
p (yj# oIxj)= I+ rl
Model identification The logit command has one more feature, and it is probably the most useful, logit will automatically check the model for identification and. if it is underidentified, drop whatever variables and observations are necessary for estimation to proceed.
> Example Have you ever estimated a logit model where one or more of your independent predicted one or the other outcome.'?
variables perfectly
For instance, consider the following data: Outcome y 0 0 0 1
Independent
Variable x 1 1 0 0
Let's imagine we wish to predict the outcome on the basis of the independent variable. Notice that the outcome is always zero whenever the independent vari_tble is one. In our data Pr(y = 0 ] x = 1) = 1, which in turn means that the logit coefficient on x must be minus infinity with a corresponding infinite standard error. At this point, you may suspect we have a problem, Unfortunately, not all such problems are so easily detected, especially if you have a lot of independent variables in your model. If you have ever had such difficulties, then you have experienced one of the more unpleasant aspects of computer optimization, The computer has no idea that it is trying to solve for an infinite coefficient as it begins its iterative process. All it knows is that. at each step, making the coefficient a little bigger, or a little smaller, works wonders. It continues on its merry way until either (1) the whole thing comes crashing tO the ground when a numerical overflow error occurs or (2) it reaches some predetermined cutoff that stops the process. Meantime, you have been waiting. In addition, the estimates that you finally receive, if you receive any at all, may be nothing more than numerical roundoff.
i
i _
"
lOgit-- Maximum-flkelihoodlogit estimation 253 Stata watches for these sorts of problems, alerts you, fixes them. and properly estimates the model. Let's return to our automobile data. Among the variables we have in the data is one called repair that take ; on three values. A value of t indicates that the car has a poor repair record, 2 indicates an avera :e record, antl 3 indicates a better-than-average record. Here is a tabulation of our data:
[}
tabulate
'.
foregtgn
repair repair
foreign
1
2
3
Total
)omestic
10
27
9
46
i
Foreign
0
3
9
12
I
Total
I0
30
18
58
i_ _
i
! !
Notice t:mt all the cai's with poor repair recor:ds (repair==1) are domestic. If we were to attempt to predk foreign on the basis of the repair records, the predicted probability for the repair==l category _;ould have io be zero. This in turn means that the Iogit coefficient must be minus infinity, andLet's that try; would b_zzing. Statasetonmost this computer problem, programs First, we _ake up two new variables, rep_is_l that indi :ate the repair category. , _enerate repjs_l generate rep_is_2 I
= (repair==1) (repair==2)
The stat#ment generate rep_is_l = (repa_r==l)creates a new variable, rap_is_l, that takes on the v_lue l when repair is 1 and zero otherwise. Similarly, the next generate statement creates rep_isI+2t that takes on the value 1 when repair is 2 and zero otherwise. We are now ready to estimate!our logit model. See [R] probit for tl_e corresponding probit model.
}
_ogit for rep_is_l rap_is_2
I }
No_e: rep_is_l_=O predicts failure pezfectly rep_is_l _droppedand i0 obs not used
i i
It,.'ration 0: It,_ration 1: It,;ration2:
llog likelihood = -26.99_087 _og ]_og likelihood likelihood = -22.48_187 -22.230498
i i !
It,;ration3: It_:ration4: It,._ration 5: It,._ration 6: It,._ration 7:
_og _og ]_og _og _og
i I i
_{ !
and rap_is_2.
likelihood likelihood likelihood likelihood likelihood
= = = = =
-22.22g139 -22.229138 -22.22_138 -22.22g138 -22.229138
Lo_it estimates
Number of obs LR chi2(1)
= =
48 9.53
Lo
Pseudo R2
=
O.1765
likelihood = -22.229138 i
i foreign rep_is_2 _cons
Coef. -2.197225 ! 3.89e- 16
Std. Err. .7698003 .4714045
z -2.85 O.O0
Prob > chi2 = 0.0020 P> [z[ [95% Conf. Interval3 O.004 1. 000
-3. 706006 -. 9239359
-. 6884436 .9239359
Remember that all thd cars with poor repair rdcords (rep_is_1)are domestic, so the model cannot be estimated, or at lehst it cannol be estimated if we restrict ourselves to finite coefficients. State noted theft lact. It sad i Note: rep..as_l-=0 predicts fmlure perfectly . Th_s ,s Statas mathemat_cal-y precise ay of saying _:hat we said in English. When rep_is_l i._not equal toO. the car is domestic.
_l !
Stata then went ll_llllVlll on to say,lli_VlllllVVv "rep_is_l iv_||dropped and t0 obs not used". This is Stata eliminating the iv1 I¥_IL l_Lll||(l[Jl_J| problem. First. the variable rep_is_l had to be removed from the model because it would have an infinite coefficient. Then, the I0 observations that led to the problem had to be eliminated as well so as not to bias the remaining coefficients in the model The 10 observations that are not used are the 10 domestic cars that have poor repair records. Finally, Stata estimated what was left of the model, which is all that can be estimated. q
[] Technical Note Stata is pretty smart about catching problems like this. It will catch "one-way causation by a dummy variable", as we demonstrated above. Stata also watches for "two-way causation"; that is, a variable that perfectly determines the outcome, both successes and failures. In this case Stata says, "so-and-so predicts outcome perfectly" and stops. Statistics dictates that no model can be estimated. Stata also checks your data for collinear variables; it will say "'so-and-so dropped due to collinearity". No observations need to be eliminated in this case, arid model estimation will proceed without the offending variable. It will also catch a subtle problem that can arise with continuous data. For instance, if we were estimating the chances of surviving the first year after im operation, and if we included in our model age, and if all the persons over 65 died within the year, Stata will say, "age > 65 predicts failure perfectly". It will then inform us about the fixup it takes and estimate what can be estimated of our model. logit
(and logistic
note:
4 failures
and probit) and
0 successes
will also occasionally display messages such as completely
determined.
There are two causes for a message like this. Let us deal with the most unlikely case first. This case occurs when a continuous variable (or a combination df a continuous variable with other continuous or dummy variables) is simply a great predictor of the dependent variable. Consider Stata's auto. dta dataset with 6 observations removed. . use
auto
(1978
Automobile
Data)
drop if foreign==O _ gear_ratio>3.1 (6 observations deleted) • logit Logit
Log
foreign
mpg
gear_ratio,
nolog
likelihood
Number
= -6.4874814
foreign mpg weight gear
note:
weight
estimates
Coef.
Err.
=
68
=
Prob
> chi2
=
0.0000
R2
=
0,8484
Pseudo
Std.
of obs
LR chi2 (3)
z
P>Izl
[95% Conf,
72.64
Interval]
-.4944907
.2655508
-1.86
0,063
-i,014961
.0259792
-.0060919
.003i01
-1.96
0.049
-.0121696
-.000014
ratio
15,70509
8.166234
1.92
0.054
-.3004359
31.71061
_cons
-21.39527
25.41486
-0.84
0.400
-71.20747
28.41694
4 failures
and
0 successes
completely
determined.
Iogit -- Maximum-likelihood Ioglt estimation :+ •
I , i
Note that t[ ere are no missing standard errors in the output. If you receive the "completely
determined"
message ar d have one or more missing standard errors in your output, see the second case discussed
;
below. Note g_ar_ratao +large coefficient,logit thought that the 4 observationswith the smallest predictedprobabilitieswereessentiallypredictedperfectly. 1
.(option predict p passumed; !
Pr(foreign))
. so_t"p . li_t p in i/4
!
+
255
i
p
1. 2. 3.
1.34e-I0 6.26e-09 7.84e-09
4. |
!.49e-08
If this hlappensto you,there is no need to dd anything.Computationally,the modelis sound.It
+
i
is the seco d case discussed
+
The se@ndcase occurswhenthe independenttermsare all dummyvariablesor continuousones with repea_edvalues (e.g.. age). In this case, one or more of the estimatedcoefficientswill have missingst_dard errors.:Forexample,considerthis datasetconsistingof 5 observations.
?
• li_t i
y 0 0 1 0 i
1. 2. 3, 4. 5.
I ¢
below that requires
xl 0 1 i 0 0
careful examination.
x2 0 0 0 i i
. lo_it y xl x2, nolog. Logi_ 7 estimates ! i 1 i
Numberof obs LR chi2(2) Prob > chi2 Pseudo R2
-2.7725887
Log likelihood • 1 Coef-. 8td. Err.
++
i8. 26157 t8.26157
{
co
-_8.26157
2 1.414214
P>Izl
[95Y, Conf. Interval]
9 13
0.000
14.34164
-i12.91
0.000
note: i failureand 0 successescompletelydetermined.
i
. predict p (optionp assumed Pr(y)) xl
y
-15,48976
0
0
0
+
2. 3. 4. 5.
0 t 0 1
1 1 0 0
0 0 1 I
Two thiSgs are happeaing
i
covariate
i
dropped,
p
x2
i.
i+
-21.03338
22.1815
, li. _ _
•
5 I,18 0.5530 O. 1761
z
+
+
= = = =
1',. 17e-08
.5 .5 .5 .5
here. The first is tl_at logit is able to fit the outcome
(y = 0) for the
p_ttern+ xl = 0 and x2 = 0 (i.e., the first observation) perfectly. It is this observation that letel' dete + is the "1 f__ilure ...con_ } rm'ned". The second thing to note is that if this observation is t,n
!+
xl, x2, arid the constant
'i
are colli_ar.
....................-_oo---........,o_J,_ "Wmxmmum-.KellnOO0Ioglt estimation
This is the cause of the message "completely determined" and the missing standard errors. It happens when you have a covariate pattern (or patterns) with only one outcome, and there is collinearity when the observations corresponding to this covariate pattern are dropped. If this happens to you, confirm the causes. First identify the covariate pattern with only one outcome. (For your data, replace xl and x2 with the independent variables of your model.) • egen pattern = group(x1 x2) quietly logit y xl x2 predict p (option p assumed; Pr(y)) • snmraarize p Variable
Obs
Mean
p
5
.4
Std. Dev. .22360_8
Min
Max
I. 17e-08
.5
If successes were completely determined, that means there are predicted probabilities 1. If failures were completely determined, that means there are predicted probabilities O. The latter is the case here, so we locate the corresponding value of pat_;ern:
that are almost that are almost
• tab pattern if p < le-7 x2) 1 Total group (xl 1.
Freq.
Percent
Cum.
1
i00.O0
i00.O0
1
100.O0
Once we omit this covafiate pattern from the estimation sample, logit can deal with the collinearity: logit y xl x2 if pattern-=l, nolog note: x2 dropped due to collinearity Number of obs LR chi2 (I) Prob > chi2 Pseudo R2
Logit estimates
Log likelihood = -2.7725887
= = = =
4 0.00 1.0000 0.0000
|
y [
Coef.
xl _cons
0 0
Std. Err. 2 I.414214
z O.O0 O.O0
P>lz{ 1.000 I.000
[95'/, Conf. Interval] -3.919928 -2.771808
3.919928 2 _771808
We omit the collinear variable. Then we must decide whether to include or to omit the observations with pattern
= 1. We could include them
logit y xl, nolog Logit estimates
Number of obs LR chi2(1) Prob > chi2
= = =
5 O. 14 0.7098
Log likelihood = -3.2958369
Pseudo R2
=
0.0206
Y I _cons xl j
Coef. -. 6931472 .6931472
Std. Err. I.224742 1. 870827
or exclude them: • logit y xl if pattern~=l, nolog
z -0.57 O. 37
P> lz[ O.571 O. 711
[957,Conf. Interval] -3.093597 -2.973605
I.707302 4. 3599
Iog_ -- Maximum-likelihoodIoglt estimation ,
Logi
estimates _ i
Log _ikelihood '
i
= =
Prob
=
1.0000
=
0.0000
> chi2
Pseudo
R2
4 0.00
1
Y
i
!
= -2.7725887
Number of obs LR chi2 (I)
257
Coef,
xI
0
_cons
0
Std.
Err. 2
1,4142/4
z
P> ]zt
[95_ Conf.
Interval]
O. O0
1.000
-3.919928
3.919928
0.00
1.000
-2.771808
2.771808
If the _ovariate pattern that predicts outcome perfectly is meaningful, you may want to exclude these observations from the model. In this case_ one would report covariate pattern such and such predicted 6utcome perfectly and that the best model for the rest of the data is .... But. more likely. the perfec_ prediction was simply the result of h_ving too many predictors in the model. In this case. one wouldI omit the extraneous variable(s) from further consideration and report the best model for all the datA. 23
'
i
Obtaining redicted values
i
Once y_u have estimated a logit model, you can obtain the predicted probabilities using the predict )remand for both the estimation sample and other sampI_s: see [U] 23 Estimation and post-estimation commands and [R] predict. Here we will make only a few additional comments.
i
predict without arguments calculates the predicted probability of a positive outcome: i.e.. Pr'd/j = .) = F(xjb), With the xb option, it calculates the linear combination xjb, where x_
!
are the in, ependent variableSasin the jth observation and b is the estimated parameter vectOr.atThis is sometin_es known the index function since the ,mmulative distribution function indexed this
i
value is thI probability of a positive outcome. In bothtcases, State remembers any "rules'" used to identify the model and calculates missing for excluded dbservations unless rules or asif is Specified. This is covered in the following example.
e
!
i
!
For inf_rmation about the other statistics available after predict,
> Example In the _revious example, we estimated the 10git model logit foreign
Pr(foreign))
i
(_0 _issing
values
generated)
i
s_arize
foreign
! i
rep_is_l
rep_is_2.
To obtain _redicted probabilities: . predict p (option p assumed;
,
see [R] logistic.
_ariable
!
:foreign p
p 0bs
Mean
58 48
,2068966 .25
Std. Dev. .4086186 .1956984
Min
Max
0 .t
! .5
State rem_ mbers any "'rules" used to identify ihe model and sets predictions to missing for any excluded , bservations, in the previous examplel logit dropped the variable rep_is_l from our model anc excluded l0 observations. Thus. when we typed predict p. those same I0 observations were a_ai_ excluded an_t their predictions were _et to missing.
-
predict'srules option will use the rules in file prediction. During estimation, we were told "rep__is_t-=O predicts failure perfectly", so the rule is that when rep_is_l is not zero, one should zoe _oglt-- MaxJmum-l|lmlthoodIogit predict 0 probability of success or a positive estimation outcome: predict p2,
rules
• summarize foreign p p2 Variable [
)
foreign p p2
predict's asif for all observations
Obs 58 48 58
Mean .2068966 .25 .2068966
Std. Dev. .4086186 .1956984 •2016268
Min
Max
0 .I 0
1 .5 .5
option will ignore the rules and the exclusion criteria and calculate predictions possible using the estimated parameters from the model:
• predict p3, asif . summarize foreign p p2 p3 Variable
Obs
Mean
foreign p p2 p3
58 48 58 58
.2068966 .25 .2068966 .2931034
S%d, D_v.
Min
Max
.4086186 .195698_4 .2016268 .2016268
0 .1 0 .1
1 .5 .5 .5
Which is right? What predict does by default is the most conservative approach. If a large number of observations had been excluded due to a simple rule, one could be reasonably certain that the rules prediction is correct. The asif prediction is only correct if the exclusion is a fluke and you would be willing to exclude the variable from the analysis anyway. In that case. however, you should re-estimate the model to include the excluded observations. q
Saved Results logit
saves in
e():
Scalars e (N)
number
e ('2.1)
log likelihood
e(df_m) e (r2_p) e (N_clust)
model degrees of freedom pseudo R-squared number of clusters
of observations
e(_l_O) e (chi2)
log likelihood, X2
e(cmd)
logit
e(vcetype)
covariance
e(depvar) e (wtype)
name of dependent variable
e(chi2type) e(offset)
Wald offset
e(wexp) e(clustvar)
weight expression name of cluster variable
e(predict)
program
coefficient
e(V)
variance-covariance estimators
constant-only
model
Macros
weight
type
estimation
method
or LR: type of model X_ test used to implement
predict
Matrices e(b)
vector
Functions e(sample)
marks estimation
sample
matrix
of the
t
Iogitj- Maximum-likelihoodIoglt estimation
259
V
; i
.
Methods
Formulas
The wo_ logit is due to Berkson (1944) and is by analogy with the word probit, For an introduction to probit a_d logit, see, for example. Aldrich and Nelson (1984), Hamilton (1992Z Johnston and DiNardo (1_97), Long (1997), or Powers and Xie (2000). The likelihood function for logit is 1
InL=
Ewj
lnF(xjb)
+ Zwiln{1-F(xjb)}
jES
i !
j_S
where S is!the set of all observations j such that yj _ O, F(z) = eZ/(l optional wdights. In L is maximized as described in [R] maximize.
+ eZ), and wj denotes the •
If robusl standard errors are requested, the dalculation described in Methods and Formulas of [R] regresstis carried forward with uj = !1 - F(xjb)}xj for the positive outcomes and -F(xjb)xj for the neghtive outcomes, qc is given b5 its asymptotic-like formula,
Reference
.Aldrich.J. 0' and F. D. Nelson. 1984. Linear Probab_lit);Logit, and Probit Models. Newbury Park. CA: Sage
i o
Publicatiohs. Berkson,J. @44. Applicationof the logisticfunctionto l'iio-assay.Journalof the AmericanStatisticalA._ociation39: 357-365.' Cleves,M. a}d A. Tosetto 2000. sg139:Logisticregressionwhenbinary outcomeis measuredwith uncertainty.Stata T_chmcallBulletin55: 20-23.
',
Hami['ton.L.!C 1992. f_egres_ionwith Graphics.PacificGrove.CA: Brooks/ColePublishingCompany.
i
ltosmer. D +'.. Jr.. and S. Lemeshow.1989. AppliedLOgisticRegression.New York:John Wiley & Sons. (Second editionforthcoming"in 200I.) Johr_ston.J. ]nd J. DiNardo. I997. EconometricMethods.4th ed. New York:McGraw-Hill. Judge,G. G..!W.E Griffiths,R. C. Hill.H. L/itkepohl.andT.-C.Lee. 1985. The Theoryand Practiceof Econometrics. 2d ed. Nlw York:John Wiley& Sons. Long. J. S_|997. RegressiobModels for Cate_,oricaland Limited Dependent Variables.ThousandOaks. CA: Sage
i
, i
Publicatic_as. Powers.D, _, and Y. Xie. 2000. StatisticalMezhodsfor CategoricalDataAnah,sis, San Die__o.CA: AcademicPress Pre_ilyon.D.!1981. Logisticregressiondiagnostics.Annals of Statistics9: 705-72&
Also Compleme
atary:
i
[R] clogit, [R] cloglog, [R] cusum, [R] glm, [R] giogit, [R] logistic. [Ri nlogit, [R] probit, [R] scobit. [R] svy estimators. [R] xtelog.
I
[R]
i
i
Related:
[R] adjust, [R] lincom. [R] linktest. [R] lrtest. [R] mfx. [R] predict, [R] roe. [R] sw, [R] test, [R] testnl, [R] vce. [R] xi
Backgrounch
xtgee, [R] xtlogit, [R] _tprobit
[u] 16.5 Accessing coefficients and standard errors, [U_]23 Estimation and post-estimation commands, [U_23.11 Obtaining robu_ variance estimates. [U_23,12 Obtaining scores, [R_maximize
_
Ioneway
:,
--
Large
one-way
ANOVA,
random
effects,
I I
and reliability
I
Syntax loneway
response_var
group_var
[weight t [if
exp]
[in
range I [, mean median exact
l_evel(#) ] by ...
: may be used with loneway; see [R] by.
aweights
are allowed; see [U] 14.1.6 weight.
Description loneway estimates of levels
one-way
of group_var
analysis-of-variance
and presents
different
(ANOVA) models
ancillary,
statistics
Feature
from
on datasets one,ray
with a large number (see [R] oneway):
oneway loneway
Estimate one-way model on fewer than 376 levels on more than 376 levels Bartlett's test for equal variance Multiple-comparison tests Intragroup correlation and S.E. Intragroup correlation Confidence interval Est. reliability of group-averaged score Est. S.D. of group effect Est. S.D. within group
x x
x x x
x x x x x x x
Options mean specifies that the expected value of the Fk-l.N,-k distribution Fm in the estimation of p instead of the default value of 1. median specifies that the median of the Fk-l_N-k distribution in the estimation of p instead of the default value of 1. exact;
requests
confidence not used. level
(#)
that exact intervals.
specifies
default is level(95) intervals.
confidence
This option
the confidence
intervals is allowed
level,
or as set by set
be computed,
level;
be used as the reference
as opposed
only if the groups
in percent,
be used as the reference
for confidence
are equal intervals
see [U] 23.5 Specifying
260
to the default
point
of the coefficients. width
Fm
asymptotic
in size and weights
the
point
are The
of confidence
r
i
_
Re.m
loneway-
i
Large one-wayANOVA,random effects,and reliability
261
> Example lonewa't's output looks like that of oneway except that, at the end, additional information is presented. Jsing our automobile dataset (see [U]'9 Stata's on-line tutorials and sample datasets), we have eated a (numeric) variable called ma_u:facturer_grp identifying the manufacturer of each car an within each manufacturer we have retained a maximum of four models, selecting those with',the h Jcest mpg. We can compute the intradlass correlation of mpg for all manufacturers with at least You models as follows: . 'loneway mpg manufacturer_grp if nummake == 4 One-way Analysis of VarianCe for mpg: Mileage (mpg)
S_urce
SS
df
Number of obs =
36
R-squared :
0. 5228
MS
F
Between|manufactu/~p Withi_ manufactur_p
621.88889 567.75
8 27
77,736111 21.027778
Total
1189. 6389
35
33.989683
Intraclass correlation 0.402T0
Asy. S.E O.18770
Prob > F
3.70
O.0049
[957 Conf. Interval] O.03481
0.77060
Estimatec_SD of manufactur_p effect 3.765247 Estimated SD within manufactur-p 4.585605 Est. reliability of a manufactur-p mean .7294979 (evau%uatedat 11=4.00)
q
In additi(,n to the standard one-way ANOVAoutput, lonewayproduces the R-squared, estimated standardde,,iation of the group effect, estimated standard deviation within group, the intragroup correlation he estimated reliability of the group-a_eraged mean, and, in the case of unweighted data. the asyrr/ptc :ic standard error and confidence interval for the intragroup correlation.
R-squared The R-squared is, of course, simply the underlying R 2 for a regression of response_var on the levels of ¢rqlup_var. or mpg on the various manu(acturers in this case.
The random effects ANOVA model loneway assumes that we observe a variable Yij measured for r_, elements within k groups or classes such that Yij
::=
,tZ+ Ct" i -I-
6ij,
i = 1,!2,...,
k.
3 = 1.2 .....
ni
and %. and _]ij are independent zero-mean randon3 variables with variance a,] and cr2, respectively. This is the random-effects ANOVAmodel, also kno_'n as the components of variance model, in which it is t}_picall31assumed thak the Yij are normally d_stributed.
!
The interpretation '!
with respect to our example is that the observed value of our response
variable,
mpg, is created in two steps. First, the ith manufacturer is chosen and a value c_i is determined--the !o,Large one-way reliability typical mpgtoneway for that --manufacturer less ANUVA, the overallrandom mpg/_. effects, Then aand deviation, eij, is chosen for the jth model within this manufacturer. This is how much that particular automobile differs from the typical mpg value for models from this manufacturer. For our sample of 36 car models, the estimated standard deviations are cr,_ = 3.8 and cr, -- 4.6. Thus, a little more than half of the variation in mpg between cars is attributable to the car model with the rest attributable to differences between manufacturers. These standard deviations differ from those that would be produced by a (standard) fixed-effects regression in that the regression would require the sum within each manufacturer of the eij, ei. for the ith manufacturer, to be zero while these estimates merely impose the constraint that the sum is expected to be zero.
Intraclass correlation There are various estimators of the intraclass correlation, such as the pairwise estimator, which is defined as the Pearson product-moment correlation computed over all possible pairs of observations that can be constructed within groups. For a discussion of various estimators, see Donner (1986). loneway computes what is termed the analysis of variance, or ANOVA, estimator. This intraclass correlation is the theoretical upper bound on the variation in response_var that is explainable by group_var, of which R-squared is an overestimate because of the serendipity of fitting. Note that this correlation is comparable to an R-squared you do not have to square it. In our example, the intra-manu correlation, the correlation of mpg within manufacturer, is 0.40. Since aweights weren't used and the default correlation was computed, i.e., the mean and median options were not specified, loneway also provided the asymptotic confidence interval and standard error of the intraclass correlation estimate.
Estimatedreliability of the group-averagedscore The estimated reliability of the group-averaged score or mean has an interpretation similar to that of the intragroup correlation; it is a comparable number if we average response_var by group_vat, or rapg by manu in our example. It is the theoretical upper bound of a regression of manufactureraveraged mpg on characteristics of manufacturers. Why would we want to collapse our 36-observation dataset into a 9-observation dataset of manufacturer averages? Because the 36 observations might be a mirage. When General Motors builds cars, do they sometimes put a Pontiac label and sometimes a Chevrolet label on them, so that it appears in our data as if we have two cars when we really have only one. replicated? If that were the case, and if it were the case for many other manufacturers, then we would be forced to admit that we do not have data on 36 cars: we instead have data on 9 manufacturer-averaged characteristics.
Saved Results loneway
saves in r O :
Scalars r(N) r(rho) r(lb) r(ub)
number of observations intraclass correlation lower bound of 95% CI for rho upper bound of 95% CI for rho
r(rho_t) r(se) r(sd_w) r(sd_b)
estimated reliability asymp. SE of intraclass correLati_m estimated SD within group estimated SD of group effect
!
'loneway-- Large one-wayANOVA,random effects, and reliability
263
Metl!ods and Formulas is implemented as an ado-file.
lo_e_ !
The r_ean squares in the lone_ay's
ANOVAtable are computed as follows:
Mso= _i wi(_,.-9.)_/(k- t) an_
MS,= _ _ _,j(y,j- _,.)2/(u-k) •
1"
j
in which
j i
i
j
i
= E expected wij w.. values = wi. these Yi. m_an = E squares wiiyij/wi, The c0rre_:i. ;ponding of are
t
and
_.. =
wi.ff_./w..
2 + go% and E(MS_)= _2
E(MS_,) = a 2 in Which
_..- Z,wUw k-1
Note that in the unweighted case, we get
N- Z,-_/N g=
k-1
i
As expecti d, g = rn for the case of no weights _mdequal group sizes in the data, i.e., n_ = m for all i. l_ep[acilLgthe expected values with the obse_,ed values and solving yields the ANOVAestimates of a_ and cry. Substituting these into the defini[ion of the intraclass correlation 2
P= _ + G_ yields the _NOVA estimator of the intraclass correlation: IFobsPA
=
_bbs
--
1 1+ 9
Note that 7obs is the observed value of the F statistic from the ANOVAtable. For the case of no weights ar d equal hi, PA = roh, the intragroup correlation defined by Kish (1965). Two slightly different e:timators are available through the mean and median options (Gleason 19971. If either of these optioas is specified, the estimate of p becomes
•
0= Fob_ _-(__ i-)Fm
i
}
) ' :
For _he:rae..n option, Fm= E(Fk-1,._'-K) = (_}r_ k)/(N - k - 2), i.e., the expected value of the ANO\__,tab e's F statistic. For the median optioh. Fm is simply the median of the F statistic. Note thal setting F,, to I gives PA, so for large samples these different point estimators are essentially the samd. Als_, since the iniyaclass correlation of the random-effects model is by definition nonnegative.
I
:
for any of he three possible point estimators p is truncated to zero if Fobs is less than F,_.
r_ ' it ,:i
For the case of no weighting, interval estimators for PA are computed. If the groups are equal-sized 264 ni equal) Ionewayeffects, exact and reliability (all and the Large exact one-way option isANOVA, specified,random the following (under the assumption that the Yij are normally distributed) 100(1 a)% confidence interval is computed:
-
[
Fobs - FmF_,
Fobs -- FmFz
Fobs + (9 - 1)FmFu'
Fobs + (9 - 1)FmFt
]
with F,_ - 1, Fl = Fa/2,k_l,N_k, and Fu - Fl_a/2,k_l,N_k, F.,k--l,N--k being the cumulative distribution function for the F distribution with k - 1 and N - k degrees of freedom. Note that if mean or median is specified, Fm is defined as above. If the groups are equal-sized and exact is not specified, the following asymptotic 100(1 - a)% confidence interval for PA is computed: [PA --ZaI2V(pA),PA + zaI2V(pA)] where Zal2 is the t00(1 -a/2) percentile of the standard normal distribution and V(pA) is the asymptotic standard error of p defined below. Note that this confidence interval is also available for the case of unequal groups. It is not applicable, and therefore not computed, for the estimates of p provided by the mean and median options. Again, since the intraclass coefficient is nonnegative, if the lower bound is negative for either confidence interval, it is truncated to zero. As might be expected, the coverage probability of a truncated interval is higher than its nominal value. The asymptotic standard error of PA, assuming the Yij are normally distributed, is also computed when appropriate, namely for unweighted data and when PA is computed (neither the mean nor the median options are specified):
V(pA)
= 2(1_P)2i
(A + B + C)
with A = {1 + p(gN-k
1)} 2
B = (1 - p){1 + p(2gk-1 2
1)}
2
C = p {_-_ ni " 2N-1E:nf
(k- 1)2
n_)2}
and PA is substituted for p (Donner 1986). The estimated reliability of the group-averaged score, known as the Spearman-Brown formula in the psychometric literature (Winer, Brown. and Michels t991, 1014), is
prediction
tO Pt--
1 -t- (tfor group size t. loneway
1)p
computes Pt for t -= 9.
The estimated standard deviation of the group effect is aa -- v/(MSa - MSe)/g. This comes from the assumption that an observation is derived by adding a group effect to a within-group effect. The estimated standard deviation within group is the square root of the mean square due to error, or x/--M--Se.
Ioneway -- Large one-wa_ ANOVA, random effects, and reliability
265
AcknOWledgment We wo_lld like to thank John Gleason of Syracuse vernon
University
for his contributions
to this improved
of loneway.
Referencts Donner, A. 1986. A review of inference procedures for'the intraclass correlation coefficient in the one-way random effects ITodel. International Statistical Review 54: 67;-82. Gteason, L !_. 1997. sg65: Computing intraclass correlations and large ANOVAs. Stata Technical Bulletin 35: 25-3t Reprinte_ in Stata Technical Bulletin Reprints. vol. 6, pp I67-176. Kish, L.; 19_5. Survey Sampling. New York: John Wiley & Sons. Win¢r, B. L D. R. Brown. and K. M Michels. 199I. Statistical Principles in Experimental Design. 3d ed. New York: McOraw -Hill.
Also See Related:
[R] onewa_d
'io
lorenz -- Inequality measures
............
[;i"
; _
1
II
IIII H I
i
i
ii
III
II
i
Remarks Stata should have commands for inequality measures, but at the date that this manual was written, Stata does not. Stata users, however, have developed an excellent suite of commands, many of which have been published in the Smm Technical Bulletin (STB),
Issue
insert
author(s)
command
description
STB-48
gr35
N.J. Cox
psr., qsm, pdagum._ qdagum
Diagnostic plots for assessing Singh-Maddala and Dagum distributions fitted by MLE
STB-23
sg30
E, Whitehouse
lorenz, inequal_ atkinson, relsginl
Measures of inequality in Stata
STB-23
sg31
R. Goldstein
rspread
Measures of diversity: absolute and relative
STB-48
sgl04
S.P. Jenkins
sumdist, xfrac,
Analysis of income distributions •
ineqdeca, geivars, i ineqfac, povdeco STB-48
sgl06
S. R Jenkins
smfit, dagumfiti
Fitting Singh-Maddala and Dagum distributions by maximum likelihood
STB-48
sgl07
S. R Jenkins, E Van Kerm
glcurve
Generalized Lorenz curves and related graphs
STB-49
sgl07.1
S. E Jenkins, P. Van Kerrn
glcurve
update; install this version
STB-48
sgl08
P. Van Kerm
poverty
Computing poverty indices
STB-51
sgtI5
D. Jolliffe, B. Krushelnytskyy
ineqerr
Bootstrap standard errors for indices of inequality
STB-51
sgll7
D. Jolliffe, A. Semykina
sepoy
Robust standard errors for the Foster-GreerThorbexke class of poverty indices
Additiona] commands may be available; enter Stata and typ_ search
inequality
measures.
To download and install from the Internet the Jenkins isumdistcommand, for instance, you could 1. Pull down Help and select STB and User-written Programs. 2. Click on http://www.stata.com. 3. Click on stb, 4. Click on stb48. 5. Click on sg 104. 6. Click on click here to install. or you could instead do the following: 266
i
- '
-
lorenz -- Inequality measures
1, !Na_lgate
to the appropriate
_. Type net
267
STB issue:
_rom http://vwv,
stata,
com
stata,
com/stb/stb48
| Type net cd stb Type net cd stb48 or
). Type net
from http://_w,
2. Ty] e net
describe
3. Ty
±nsta_[1
net
sgl04 sgl04
Refemncq s Cox, N. J. I999, gr35: Diagnostic plots for assessing Singh_Maddala and Dagum distributions fitted by MLE, Stata Technic_1 Bulletin 48: 2-4, Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 72-74. Goldstei_, _. 1995. sg3]: Measures of diversity: absolute and relative. Stata Technical Bulletin 23: 23-26. Reprin'ted in Stata Technical Bulletin Reprints, voL 4, pp. 150_-154. Jenkins. S. _. 1999a. sgl04t Analysis of income distributions. Stata Technical Bulletin 48: 4-18. Reprinted in Stata Tec1_ica' Bulletin Reprints, vol. 8, pp. 243-260. -. 19991 sg]06: Fitting Singh-Maddal_ & Dagum distributions by maximum likelihood. Stata Technical Bulletin 48: t9-5. Reprinted in Stata Technical Bulletin Reprints. rot. 8, pp. 26t-268. Jenldns. _S. • and P. Van Kdrm. 1999a, sgl07: Generalized Lorenz curves and related graphs. Stata Technical Bulletin 48: 25- 9. Reprinted in Stata Technical Bulletin Re_qrints,vol, 8, pp. 269-274. --
lff°J9 sgl07.t: Generalized Lorenz cur'¢es and related graphs. Stata Technical Bulletin 49: 23, Reprinted in S_ata Tetfinical Bulletin • epnnts, voL 9, p. 171.
Jolliffe, D. nd B. Krushelrtytskyy. 1999 sgll5: Bootstrap standard errors for indices of inequality, Stata Technical 13_lletin ;1: 28-32. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 191-196, Jotliffe, D. nd A. Semykintt. 1999 sgll7: Robust stant_u'd errors for the Foster-Greer-Thorbecke class of poverty i_ices. 'tara Technical Bulletin_51: 34-36. Reprinted in Stata Technical Bulletin Reprints. vol. 9. pp, 200-203." Van Kerm. 1999. sg]08: Computing povert_ indices. St_ta Technical Bulletin 48: 29-33. Reprinted in Stata Technical Bulletin _eprints, vol. 8,_pp. 274-278. Whitethouse,_E. 1995, sg30: !Measures of inequality in Sltata. Stata Technical Bulletin 23: 20-23. Reprinted in Stata Te_chnicallBulletinReprir_s. vol. 4, pp. 146-150.
! ,
I
Irtest -- Likelihood-ratio I
test after model estimation i
I
II
I
I
Syntax irtest [, saving(name) using(name)
m_odel(name)dr(#) ]
where name may be a name or a number but may not exceed 4 characters.
Description irtest saves information about and performs lil_elihood-ratio tests between pairs of maximum likelihood models such as those estimated by cox, ]_ogit, logistic, poisson, etc. lrtest may be used with any estimation command that reports a tog-likelihood value or, equivalently, displays output like that described in [R] maximize. lrtest, typed without arguments, performs a likelihood-ratio test of the most recently estimated model against the model previously saved by lrtest ,i saving(0). It is your responsibility to ensure that the most recently estimated model is nested within the previously saved model. lrtest
provides an important alternative
to test'for
maximum likelihood
models.
Options saving(name) specifies that the summary statistics as:;ociated with the most recently estimated model are to be saved as name. If no other options are pecified, the statistics are saved and no test is performed. The larger model is typically saved by typing lrtest, saving(0). using(name) specifies the name of the larger mode_ against which a model is to be tested. If this option is not specified, using(O) is assumed. model (name) specifies the name of the nested model (a constrained specified, the most recently estimated model is used.
model) to be tested. If not
df (#) is seldom specified: it overrides the automatic degrees-of-freedom
calculation.
L
Remarks The standard use of Irtest is 1. Estimate the larger model using one of Stata's estimation saving(O). 2. Estimate an alternative,
nested model (a constrained
commands
and then type lrtest,
model) and then type lrtest.
Example You have data on infants born with low birth weights along with characteristics of the mother (Hosmer and Lemeshow 1989 and more fully described in JR] logistic). You estimate the following model: 268
_
irtest -- LiWelihood-ratio test after model estimation
269
lo istic low age lwt race2 race3 smoke ptl ht ui bogi
Dog
estimates
ikelihood =
]
-100.724
age lwt low race2 race3 smoke
.9732636 ,9849634 Odds Ratio 3. 534767 !2.368079 12.517698
ptl ht ui
;1.719161 ;6.249602 2.1351
.0354759 .0068217 Std. Err, 1.860737 1.039949 I.00916 .5952579 4.322408 .9808153
Number of obs LR chi2 (8) Prob > chi2
= = =
189 33.22 0.0001
Pseudo R2
=
0.1416
-0,74 -2.19 z 2.40 1.96 2,30
O.457 0.029 P> Iz J O,016 0.050 O.021
.9061578 1,045339 .9716834 .9984249 [957_Conf. Interval] 1.259736 9.918406 1.001356 5.600207 I.147676 5,52316"2
1.56 2.65 1.65
O.118 O.008 0.099
.8721455 I.611152 .8677528
You now _ ish to test the constraint that the coefficients on age, lwt;, ptl, equivalent] in this case that the odds ratios are all 1). One solution is te3t ( I ( 2 ( 3 (4
age l_t
3.388787 24,24199 5.2534
and ht are all zero (or
pl_ ]at
age = 0.0 lwt = 0.0 ptl = 0.0 ht = 0.0
I
chi2( 4) = Prob > dhi2 =
12.38 0.0147
This test i; based on the inverse of the information matrix and is therefore based on a quadratic approxima ion to the likelihood function: see [R] test. A more precise test would be to re-estimate the model, apt lying the proposed constraints, and then calculate the likelihood-ratio test. lr't:est assists you iin doi lg this. You fir_t save the st_itistics associated with tlie current model: lr zest, saving_O)
The"nam_" 0 was not _h°sen arbitrarily, although we could have chosen any name. Why we chose 0 will bec _me clear sb+rtly. Having saved the information on the current model, we now estimate the constrained model, ,_,hich in this case is themodel omitting age, l_,,,"t:,ptL and ht: Io istic low r_ce2 race3 smoke ui L_gi
estimates
Number of obs LR chi2(4) Prob > chi2
Dog Likelihood = '-107.93404 low race2 race3 smoke ui
Pseudo
R2
= -=
189 18.80 0.0009
=
0.0801
Std. Err.
z
P>Izt
[957,Conf. Interval]
3.052746
I.498084
2.27
O.023
I.166749
7.987368
12.922593 12.945742 2.419131
I.189226 I.101835 1.047358
2.64 2.89 2.04
0. 008 O.004 0.041
1.31646 i.41517 i.03546
6.488269 6.131701 5.651783
Odds Ratio
That done. typing Irteit will compare this model with the model we previously saved: Ir zest Logi 3tic:
likelJhood-ratio test
chi2(4) = Prob > chi2 _
14.42 0.0061
._/i
_'
¢"# 'J LqI_OL -I.,.II_V_IIIIVq.PU--i CIILI_I IIIUUI_I _,_ILI||lidIIqQIFI The more !!precise syntax for theCILIU test |ql_Ol is Irtest, usihg(O),meaning that the current model is to be compared with the model saved as 0. The name 0, a_ we previously said. is special when you do not specify the name of the using() model, using(b) is assumed. Thus. saving the original model as 0 saved us some typing when we performed the test.
Comparing results, test reported that age, lwt, ptl, and ht were jointly significant at the 1.5% level; lrtest reports they are significant at the 0.6% level, lrtest's results should be viewed as more accurate. q
Example Typing lrtest, saving(0) and later lrtest by itself is the way lrtest used, although here is how we might use the other options: logit lrtest, logit
chd age
age2
sex
estimate full model
saving(O) chd
age
save results
sex
estimate
lrtest lrtest, logit
is most commonly
simpler model
obtain test saving(I)
save logit results as t
chd sex
estimate simplest model
Irtest
compare
with full model
irtest, using(1)
compare
with model 1
lrtest,
repeat against furl model wst
model (1)
_>Example Returning to the low birth weight data in the first example, you now wish to test that the coefficient on race2 is equal to that on race3. The base modellis still stored in 0. so you need only estimate the constrained model and perform the test. Letting z be the index of the logit, the base model is z = _0 -1- fllag
e -J- _21wt
+ fls_ace2
+ fl4race3
-k ""
+ fl3lrace2
+ race3)
+ -..
If _3 -- 34, this can be written z --- _0
+ tinge
+ fl21wt
To estimate the constrained model, we create a variable equal to the sum of race2 estimate the model including that variable in their place:
(Continued
on next page)
and race3
and
........
•"_-
ge:
race23
= r_ce2
_
--
I----7
...............................
_....................
LiK_llnooo-ra1[lO le_ff a_er model estlmatlon
271
+ race3
• loiistic low a_e _ lwt race23 smoke ptl ht ui Lbgi' estimates
Lpg
Ekelihood = low
)
-100.9997
Oc_dsRatio
age lvt; race23 smoke ptl ht ui
Number of obs LR chi2(7) Prob > chi2 Pseudo P_
1.9716799 :.9864971 _.728186 5.664498 )1.709129 _.116391 !2.09936
Std. Err. .0352638 .0064627 1.080206 1.052379 ,5924775 4.215585 .9699702
z -0.79 -2.08 2.53 2.48 1.55 2.63 1.61
= = = =
189 32.67 0.0000 0.1392
P>lzl
[95Z Conf. Interval]
0.429 0.038 0.011 0.013 0.122 0.009 0.108
.9049649 .9739114 !.255586 1.228633 ,8663666 1.58425 .8487997
1.043313 .9992453 5.927907 5.778414 3.371691 23.61385 5.192407
chi2(1) = Prob > chi2 =
0.55 0.4577
Comparing this model with our original model, we obtain Irl;est Logi_tic:
likelihood-ratio test
By corn )arison, typing testrace2=race3after estimating our base model results in a significance level of .4:;72. q
Saved Re;ults lirtest
saves in r() Scalars r(p)
two-sided p-_alue
r(df)
degrees of fvbedom
r(chi2)
X2
Ptogan mers desiring that an estimation command be compatible with trtest it requires :he following Imacros to be defined: e(c_l)
name q estimationcommand
e(ll) e (dr._m) e(N)
log-likelihood value model degrees of freedom number of observations
should note that
MethOdsand Form! Jlas irtest
is implemen_d
as an ado-file.
Let Lo and Lt be +e log-likelihood values_ associated with the full and constrained models, respectivel '. Then X2 _ -2(L1 - L0) with d_ - dl degrees of freedom, where do and dl are the model Jegrees of freedom associated with the full and constrained models (Judge el al )985, 216-;21q).
Z7Z
+t i
Irtest --
LiKelmooo-ratlo
lesl al_er moael estlmalciOrl
References Hosmer, D. W., Jr., and S. Lemeshow. I989. Applied Logistic Regression. New York: John Wiley & Sons. (Second edition forthcoming in 2001.) Judge, G. G., W. E. Griffiths, R. C. Hill, H. L/itkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wiley & Sons. P6rez-Hoyos, S. and A. Tobias. 1999. sgtll: A modified likelihood-ratio test command. Stata Technical Bulletin 49: 24-25. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp 171-173. Wang, Z. 2000. sg133: Sequential and drop one term likelihood-ratio Reprinted in Stata Technical Bulletin Rcpnnts, vol. 9, pp. 332-334.
tests. Stata Technical Bulletin 54: 46-47.
Also See Related:
[R] estimation commands, JR] linktest, [R] test, [R] testnl
I
F
Title
i Life tables for survival data
SyntSx lgab:e
timevar [deadvar]
[weight] [if
bxp] [in range]
sm'vival fail_re hazard intervals(interval)
test
[, by(groupvar)
level(#)
tyid(varname)
noadjust
nol;abgraph g_,aph_options noconf ]
fwdigkts Ireallowed;se_ [UI 14.1.6weight.
Deripti
)n
itab] displays and graphs life tables for individual-level or ag_egate data and optionally presents the likeli]lood-ratio and log-rank tests for equNalence of groups, ttable also allows examining the empirical hazard function through aggregation. Also see [R] st sts for alternative commands. timevc r specifies the time of failure or censoring. If deadvar is not specified, all values of timevar are inteq: reted as failure times: otherwise, time_ar is interpreted as a failure time where deadvar _ 0 and as a _ensoring time otherwise. Observations with timevar or deadvar equal to missing are ignored Note arefully that deadvar does not specify_the number of failures. An observation with deadvar eq_aalto or 50 has the same interpretation the observation records one failure. Specify frequency weights )r aggregated data (e.g., itabletim_ [freq=number]).
options bz(groulwar) creates :_eparate tables (or graphs within the same image) for each value of groupvar :group Jar may be siring or numeric. 1eve! (#) specifies the confidence level, in percent, for confidence intervals. The default is level i or,as _et by set level; see [R] le Vel.
(95)
survival, failure, 'and hazard indicate the table to be displayed. If not specified, the default is the survival table. Specifying failure Would display the cumulative failure table. Specifying surv_ va! failure would display both the survival and the cumulative failure table. If graph is specif ed, multiple iables may not be requested. intervals (intowaI) !specifies the time intervals into which the data are to be aggregated for tabular preset tation. A single numeric argument is':interpreted as the width of the interval. For instance. intm'val(2) aggregates data into the time intervals 0 _< t < 2, 2 _< _ < 4. and so on. Not specif¢ing interval() is equivalent to specifying interval(I). Since in most data, failure times are recorded _s integers, tNs amounts _to no aggregation except that implied by the recording of the time variable and so produces Kaplat_-Meier product-limit estimates of the survival curve (with an actuarial _justment; see the noadjust option below). Also see [R] st sts list. Although it is l]ossible to exhmine survival and faihire without aggregation, some form of aggregation is almol j
always req_red for exarnining the tilazard. 273
=
"
274
._
ltable -- Life tables for survival data
When more than one argument is specified, time intervals are aggregated as specified. For instance, interval(O,2,8,16) aggregates data into the intervals 0 < t < 2. 2 _< t < 8, 8 < t < 16, and (if necessary) the open-ended interval t > 16.
....
interval (w) is equivalent to interval (0,7,15,30,60,90,180,360,540,720), corresponding to one week, (roughly) two weeks, one month, two months, three months, six months, 1 year, 1.5 years, and 2 years when failure times are recorded in days. The w suggests widening intervals. test
presents two X2 measures of the differences does nothing if by () is not specified.
between groups when by()
is specified, test
tvid(varname) is for use with longitudinal data with time-varying parameters as processed by cox; see [R] cox. Each subject appears in the data more than once and equal values of varname identify observations referring to the same subject. When tvid() is specified, only the last observation on each subject is used in making the table. The order of the data does not matter, and "last" here means the last observation chronologically. noadjust suppresses the actuarial adjustment for deaths and censored observations. The default is to consider the adjusted number at risk at the start of the interval as the total at the start minus (the number dead or censored)/2. If noadjust is specified, the number at risk is simply the total at the start, corresponding to the standard Kaplan and Meier assumption, noadjust should be specified when using ltable to list results corresponding to those produced by sts list; see [R] st sts list, notab
suppresses displaying the table. This option is often used with graph.
graph requests that the table be presented graphically as well as in tabular form; when notab is also specified, only the graph is presented. When specifying graph, only one table can be calculated and graphed at a time; see survival,failure, and hazard above. graph_options are any of the options alIowed with graph, twoway; see [G] graph options. When noconf is specified, twoway's connect() and symbol() options may be specified with one argument; the default is connect (1) symbol(O). When noconf is not specified, the connect () and symbol () options may be specified with one or three arguments. The default is connect(Ill) and symbol(Oii), drawing the confidence band as vertical lines at each point. When you specify one argument, you modify the first argument of the default. When you specify three, you completely control the graph. Thus. connect(ill) would draw the confidence band as a separate curve around the survival, failure, or hazard. noconf
suppresses graphing the confidence
intervals around survival, failure, or hazard.
Remarks Life tables date back to the seventeenth century; Edmund Halley (1693) is often credited with their development, ltable is for use with "cohort" data and. although one often thinks of such tables as following a population from the "birth" of the first member to the "death" of the last. more generally, such tables can be thought of as a reasonable way to list any kind of survival data. For an introductory discussion of life tables, see Pagano and Gauvreau (2000. 489-495): for an intermediate discussion, see. for example, Armitage and Berry (1994. 470-477) or Selvin (t996 311-355); and for a more complete discussion, see Chiang (1984).
L>Example In Pike (1966), two groups of rats were exposed to a carcinogen and the number of days to death from vaginal cancer was recorded (reprinted in Kalbfleisch and Prentice 1980, 2):
,
itabte -- Life tables for survLvMderta Group i Group 2
143
164 t88 188 190 t92 206 209
213
220
227
230
234
246
265
304
216"
244*
t42 233 344*
155 239
163 240
198 261
205 280
232 280
232 296
233 296
233 323
216 233 204*
The '*' o[ a few of the' entries indicates that the observation was censored--as the rat ha rea_n$.
275
of the recorded day,
still not died due to vaginal cancer but was withdrawn from the experiment for other
Having _ntered these data into Stata, the firs| few observations are i
list
in
1/5 group
1 2 3 4 5,
t 143 164 188 188 190
1 1 1 1 1
died 1 1 1 1 1
That is, t] e first obse_¢ation records a rat from group I that died on the 143rd day. "/'be va6able died reccrds whether that rat died or was wlthdra n (censored): lJst if died==O t 216 244 204 324
group I 1 2 2
18, 19, 39. 40,
died 0 0 0 0
Four rats, wo from each group, did not die but were withdrawn. The sl: lival table f_brgroup 1 is I'able t died lifgroup==l nterval 1,_3 I_M 1_t8 I!)0 I!)2 2_)6 2 )9 2 ,3 2 .6 2 !0 2 !7 _0 2 14 2 _4 2 _6 255 3 )4
144 165 189 191 193 207 210 214 217 221 228 231 235 245 247 266 305
Beg. Sotal 19 18 17 15 14 13 12 11 I0 8 7 6 5 4 3 2 1
Deaths
Lost 1 1 2 1 I 1 1 1 1 1 1 1 i 0 1 1 1
0 0 0 0 0 0 0 0 I 0 0 0 0 1 0 0 0
Survival O.9474 O.8947 O.7895 O.7368 O.6842 O.6316 0.5789 O.5263 0.4709 O.4120 0.3532 O.2943 0.2355 O. 2355 O.1570 O.0785 O.0000
Std. Error O.0512 O.0704 O. 0935 O.1010 O.I066 O.1107 0.1133 O.1145 O.1151 O.1148 O.1125 O.1080 O. 1012 O. 1012 0.0931 O.0724
[957. Conf. O.6812 0.6408 O. 5319 O.4789 O.4279 O.3790 0.3321 O.2872 0.2410 O.1937 O.1502 O.1105 0.0751 0.0751 0.0312 O.0056
Int. ] O.9924 O.9726 O.9153 O.8810 O.8439 O.8044 0.7626 O.7188 O.6713 O.6194 O.5648 O.5070 0.4259 O. 4459 O.3721 O.2864
The repoted survival rates are the survival rates at the end of the interval, Thus. 94,7% of rats su_ived 144 days or r_ore.
_
%_
276
ltable -- Life tables for survival data
0 Technical Note If you compare the table just printed with the corresponding table in Kalbfleisch and Prentice (1980, I4), you will notice that the survival estimates differ beginning with the interval 216-217, the first interval containing a censored observation. 1table treats censored observations as if they were withdrawn half-way through the interval. The table printed in Kalbfleisch and Prentice treated censored observations as if they were withdrawn at the end of the interval even through Kalbfleisch and Prentice (1980, 15) mention how results could be adjusted for censoring. In this case, the same results as printed in Kalbfleisch and Prentice could be obtained by incrementing the time of withdrawal by 1 for the four censored observations. We say "in this case" because there were no deaths on the incremented dates. For instance, one of the rats was withdrawn on the 216th day, a day on which there was also a real death. There were no deaths on day 217, however, so moving the withdrawal forward one day is equivalent to assuming the withdrawal occurred at the end of the day 216-217 interval. If the adjustments are made and ].table is used to calculate survival in both groups, the results are as printed in Kalbfleisch and Prentice except that for group 2 in the interval 240-241, they report the survival as .345 when they mean .354. In any case, the one-half adjustment for withdrawals is generally accepted but it is important remember that it is only a crude adjustment that becomes cruder the wider the intervals.
to El
> Example When you do not specify the intervals, Itable uses unit intervals. The only aggregation performed on the data was aggregation due to deaths or withdrawals occurring on the same "day". If we wanted to see the table aggregated into 30-day intervals, we would type Itable t died if group==l, interval(30) Interval 120 150 180 210 240 300
150 180 210 240 270 330
Beg. Total
Deaths
Lost
Survival
19 18 17 11 4 1
1 1 6 6 2 1
0 0 0 1 1 0
O. 9,_74 0. 8947 O. 5_89 O. 24_81 O. 1063 O. O(N)O
Std. Error O. 0512 O. 0704 O. 1133 O. I009 O. 0786
[95Y.Conf. Int.] O. 6812 O. 6408 O. 3321 O. 0847 O. 0139
O. 9924 0. 9726 O. 7626 O. 4552 O. 3090
The interval printed 120 150 means the interval including 120. and up to but not including The reported survival rate is the survival rate just after the close of the interval. When you specify more than one number as the argument to interval(), widths but the cutoff points themselves.
150.
you specify not the
I
i
t
Rab4e-- Life tables for survival data
277
{ o
l_able }
t
died
if
group==1,
interval(l!20,180,210,240,330)
Beg. nterval
Total
Std. Deaths
Lost
Survival
Error
[95Z Conf.
Int,]
I_0
180
19
2
0
0.8947
0.0704
0,6408
0.9726
2 0
240
I1
6
1
0.2481
0.1009
0.0847
0.4552
2II0 0
330 210
4 17
3 6
1 0
0.0354 0,5789
0.0486 0.1133
0.0006 0,3321
0.2245 0,7626
If any of :he underlying failure or censoring tifnes are larger than the last cutoff specified, they are treated as being in the open-ended interval: • l;able
t died
interval(_20,180,210,240)
Beg. Total
Deaths
Lost
210
17
6
0
0.5789
0.1133
0.3321
0.76_6
240
11
6
1
0.2481
0.1009
0.0847
0.4552
4
3
1
0.0354
0.0486
0.0006
0.2245
_nterval
1 0
if group==l,
i! Io
Survival
Std. Error
[95Z Conf.
Int,]
ooo00o4 00o
Whether lhe last interval is treated as open-end_d or not makes no difference for survival and failure tables, bu_' it does affect hazard tables. If the ifiterval is open-ended, the hazard is not calculated for it. q
Examfle !
The by(varname) option specifies that separate tables are to be presented for each value of va;vzame. Remember that our rat dataset contains two goups: l_able
I
t died,
by(group)
interval(30)
I interval
Beg. Total
Deaths
Lost
groap = 1 20 150
gr
Survival
Std. Error
[95Z Conf.
Int.]
19
t
0
0.9474
0.0512
0.6812
0.9924
50
180
18
1
0
0,8947
0.0704
0.6408
0,9726
30
210
17
6
0
0.5789
0.1133
0.3321
0.7626
tO
240
11
6
1
0.2481
0.1009
0,0847
0,4552
_0 )0
270 330
4 1
2 1
1 0
0.1063 0.0000
0,0786
0.0139
0.3090
ap = 2 20 150
21
1
0
0.9524
0.0465
0.7072
0.9932
_50
180
20
2
0
0.857!
0.0764
0.6197
0.9516
_80 _10
210 240
18 15
2 7
1 0
0,7592 0,4049
0.0939 0.1099
0.5146 0,1963
0.8920 0.6053
70
300
6
4
0
0.1012
0.0678
0.0172
0.2749
O0
330
2
1
0
0.0506
0.0493
0.0035
0.2073
30
360
1
0
1
0.0506
0.0493
0.0035
0.2073
40
270
8
2
0
0.3037
0.1031
0.1245
0,5057
278
,,.
Rable -- Life tables for survival data
> Example A fmlure table is simply a different way of looking at a surviv_ • liable t died if group==l, Interval 120 150 180 210 240 300
interval(30)
Beg. Total
Deaths
Lost
19 18 17 ii 4 1
1 1 6 6 2 i
0 0 0 1 1 0
150 180 210 240 270 330
table: failure is 1 - survive:
failure Cum. Failure
Std. Error
0.0526 0.ii053 0._211 0.7519 0.8937 1.0000
0.0512 0.0704 0.1133 0.1009 0.0786
[95_ Conf. Int.] 0.0076 0.0274 0.2374 0.5448 0.6910
0.3188 0.3592 0.6679 0.9153 0.9861
q
Example Selvin (! 996, 332) presents follow-up data from Cuder and Ederer (1958) on six cohorts of kidney cancer patients. The goal is to estimate the 5-year survival probabihty. WithYear Interval Alive Deaths Lost drawn
','ear Interval Alive
1946
19_48
1947
0-1 1-2 2-3 3-4 4-5 5-6 0- 1 1-2 2-3 3-4 4-5
9 4 4 4 4 4 t8 11 ll 10 6
4 0 0 0 0 0 7 0 1 2 0
1 0 0 0 0 0 0 0 0 2 0
19z_9 4 1950 1951
0-1 1-2 2-3 3-4 0-I 1-2 2-3 0-I I-2 0-1
21 10 7 7 34 22 t6 19 13 25
WithDeaths Lost drawn 11 I 0 0 I2 3 1 5 1 8
0 2 0 0 0 3 0 ! ! 2
7
15 1t 15
6
The following is the Stata dataset corresponding
to the table:
list
I. 2. 3. 4. 5. e[c.
year 1946 1946 1946 1947 1947
t .5 .5 5.5 .5 2.5
died 1 0 0 1 1
pop 4 1 4 7 1
As summary data may often come in the form shown above, it is worth understanding exactly how the data were translated for use with 3.table. t records the time of death or censoring (lost to follow-up or withdrawal), died contains 1 if the observation records a death and 0 if it instead records Iost or withdrawn patients, pop records the number of patients in the category. The first line of the table stated that. in the 1946 cohort, there were 9 patients at the start of the interval 0-1, and during the interval. 4 died, and 1 was lo_t to follow-up. Thus, we ent.ered in observation 1 that at t = .5. 4 patients died and, in observation 2 that at t = .5, t patient was censored. We ignored the information on the total population because ].table will figure that out for itself. t
liable -- Life tables for survival data
279
!
i •
The s@ond line of the table indicated that in the interval 1-2, 4 patients were still alive at the beginninglof the interval and. during the interval, 0 died or were lost to follow-up. Since no patients died or wgre censored, we entered nothing into our data, Similarly, we entered nothing for lines 3, 4, and 5 _f the table. The last line for 1946 staied that, in the interval 5-6, 4 patients were alive at the begmr_mg of the mtervat and that those 4 peltlents were w,hdrawn. In obserx:atmn & we entered that there lwere 4 censorings at t = 5.5.
4
}
The f t that we chose to record the times cff deaths or censoring as midpoints of intervals does not matte_: we could just as well have recorded the times as 0.8 and 5.8. By default, liable wilt
l
form mteqvals 0-1, 1-2, and so on, and place Observations into the intervals to which they belong. We sugge,_t using 0.5 and 5,5 because those numbers correspond to the underlying assumptions made
i
by ltabl_;
!
in making its calculations. Using midpoints reminds you of the assumptions.
To ob@n the survival rates, we type
!
. l_abte
t
died
nterval
[freq=pop]
Total
Deaths
Lost
Survival
Error
[95Y, Conf.
Int.]
iO
1
Beg. i26
47
19
O.5966
Std. O. 0455
O.5017
O. 6792
II 12 13
2 3 4
60 38 21
5 2 2
17 15 9
O. 5386 O. 5033 0.4423
O. 0479 O.0508 O. 0602
O.4405 O.4002 O. 3225
O. 6269 O. 5977 O. 5554
14
5
I0
0
6
0.4423
0.0602
O.3225
O. 5554
5
6
4
0
4
O. 4423
0.0602
O. 3225
O. 5554
|
We estimate the 5-year sur_,ival rate as .4423 and the 95% confidence interval as .3225 to .5554, Selvin t1996, 336), in presenting these results, lists the survival in the interval 0-1 as I. in 1-2 as .597, i_ 2-3 as .539. and so on. That is. rdative to us, he shifted the rates down one row and inserted all in the first row. In his table, the survival rate is the survival rate at the start of the interval. 1t_ our table, the survival rate is the survival rate at the end of the interval (or, equivalently. at the star_ of the new interval). This is. of course, simply a difference in the way the numbers are presented!and not in the numbers themselves. 4
Example The di,,crete hazard function is the rate of failure--the number of failures occurring within a time interval divided by the width of the interval (assuming there are no censored observations). While the surviv:fl and failure tables are meaningful at_the "individual" level with intervals so narrow that each cont4ins only a si_lgle failure
that is not true for the discrete hazard. If all intervals contained
!
to be a c_nstant! one death liand if all intervals were of equal widfla, the hazard function would be I/'At and so appear The e_piricalty determined discrete hazard function can only be revealed by aggregation. Gross and Clark!(1975, 37_ print data on malignant melanoma at the M. D. Anderson Tumor Clinic between
1
1944 and 1t960. The interval is the time from i/fitial diagnosis:
i
! {
!
a.,.,v
_uz_u_ --
n_._ taulW_
;, _ _:_i:_i
IUI- _UI'VlVal
Interval (years)
Number withdrawn alive
Number dying
19 3 4 3 5 1 0 2 0 0
77 71 58 27 35 36 17 10 8 0
312 96 45 29 7 9 3 t 3 32
0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9+
i i
CIRl[a
Number lost to follow-up
For our statistical purposes, there is no difference between the number test to follow-up (patients who disappeared) and the number withdrawn alive (patients dropped by the researchers)--both are censored. We have entered the data into Stata; here:is a small amount of it: . list
1.
t .5
d 1
pop 312
2.
.5
0
19
3.
.5
0
77
4.
t .5
1
96
5.
1,5
0
6.
1.5
0
We entered numbers
in I/6
each group's
of the table, Itable
t d
time of death
recording
or censoring
d as 1 for deaths
[freq=pop],
hazard
Beg. Interval
3 71
as the midpoint
of the intervals
and 0 for censoring.
The hazard
and entered table
the
is
interval(O,l,2,3,4,5,6,7,8,9)
Cum.
Std.
Total
Failure
Error
Std. Hazard
Error
[95_ Conf.
Int.]
0
1
913
0.3607
0.0163
0.4401
0.0243
0.3924
0.4877
1
2
505
0.4918
0.0176
_.2286
0,0232
0.1831
0,2740
2
3
335
0.5671
0.0182
0_.1599
0.0238
0.1133
0.2064
3
4
228
0.6260
0.0188
0i. 1461
0,0271
0,0931
0.1991
4
5
169
0.6436
0.0190
01.0481
0.0182
0.0125
0.0837
5
6
122
0,6746
0.0200
0_.0909
0.0303
0.0316
0.1502
6 7
7 8
76 56
0.6890 0.6952
0.0208 0.0213
0_.0455 0L0202
0.0262 0,0202
0.0000 0.0000
0.0969 0,0598
8 9
9
43 32
0,7187 1.0000
0,0235
0,0800
0.0462
0.0000
0.1705
We specified the interval() option as we did and not as interval(1) (or omitting the option altogether) to force the last interval to be open-ended. Had we not, and if we had recorded t as 9.5 for observations in that interval (as we did), ltable would have calculated a hazard rate for the "interval". In this case. the result of that calculation would have been 2, but no matter what the result, it would have been meaningless since we do not know the width of the interval. You are not limited to merely see the result graphically: . itable
i
t d [freq=pop],
> xlab(0,2,4,6,8,10)
examining
hazard
border
a column
of numbers.
1(0,1,2,3,4,5,6,7,8,9)
With the graph
graph
notab
option,
you can
liable -- Life tables for survival data i
,
i
,l_
I
1
I
|
281
I
{o I
2-
t
"_trne{years)
The verti@ lines in the graph represent the 95% confidence intervals for the hazard; specifying noconf w_uld have suppressed them. Among th+ options we did specify, although it is not required, not_b supl_ressed printing the table, saving us some paper, xlab () and border _ ere passed through
,.ee
o.,,o°.,e.e
made
q
Example You cani _raph the survival function the same way you graph the hazard function: just omit the hazm-d', op! on,
q
Methodsand Formulas It.able lis implemented as an ado-file. Let ri b_ the individual failure or censoring times. The data are aggregated into intervals given by tj, j = !, .... J, and t j+l = oc with each interval containing counts for tj _<_- < tj.1. Let dj and mj be the number 6f failures and censored observations during the interval and i\_ the number alive at theistart of the imerval, Define nj = Nj - m)/2 as the adjusted number at risk at the start of the inter}'al. If the noadjust option is specified, nj = .N_.. The product-limit _ estimate of the survivor function is
sj
11 k=t
_k
{Kalbfleisct_ and Prentice t980. 12, 15). Greenwood's formula for the asymptotic standard crror of
Sj is
i1 1
sj = S)
k=i
' nk(nk-dk)
(i •., _ !
282 Itable -- Kalbfleisch Life tables for (Greenwood 1926; and survival Prentice data 1980, 14, 15). sj is reported as the standard deviation of survival but is not used in generating the confidence intervals since it can produce intervals outside 0 and 1. The "natural" units for the survival function are log(- log Sj ) and the asymptotic standard error of that quantity is
s'J = _J [_"_,log {(nk-dk)/nk}l Edk/{nk(nk_dk) (Kalbfleisch and Prentice 1980, 15). The corresponding The cumulative
} 2
confidence
intervals are S; xp(±zl-°/2_)
failure time is defined as Gj = 1 - Sj, and thus the variance is the same as for A
Sj and the confidence intervals are 1-
S; xp(±zl-_/2
s_).
For purposes of graphing, both Sj and Gj are graphed against tj+t. Define the within-interval failure rate as fj = d_/nj. (within-interval) hazard is then
The maximum
likelihood
estimate of the
fj AJ = (1-
fj/2)(tj+l
- tj)
The standard error of )_j is /1-
{(tj+l
- tj)Aj/2}
= AjV from which a confidence tj+l)/2If the noadjust
interval is calculated.
2
dj For graphing purposes,
,_ is graphed against (tj +
option is specified, the estimate of the hazard is
tj+l
fJ
- tj
and its standard error is
The confidence interval is
2
)'J 2
where X22d_,qis the qth quantile of the X 2 distribution with 2dj degrees of freedom (Cox and Oakes 1984, 53-54. 38-40). For the likelihood-ratio test for homogeneity, let do be the total number of deaths in the gth group. Define Ta = _,eg wi, where i indexes the individual failure or censoring times. The X 9 value with G - ] degrees of freedom (where G is the total number of groups) _s
(Lawless 1982, J13). The log-rank test for homogeneity
is the test presented by sts
test;
see [R] st sts.
!
! i
._ i
Itable -- Life tables for survival data
283
AcknDWledgme nts ltabl_is thank Mi(het
Refemnc Armita_e, ]
based on the lftbl Henry-Amar, Centre
command by Henry Krakauer and John Stewart (1991). We also Regional Francois Baclesse, Caen, France for his comments.
and G. Berry. 1994. Statistical Methods in Medical Research. 3d ed. Oxford: Blackwell Scientific
Pa_tcat: )ns. Chiang. C. _.. 1984, The Life Table and Its Applications. Malabar. FL: Krieger. Cox. D.R. _nd D. Oakes. 1984. Analysis of Survival Data. London: Chapman and Hall. Cutler, S JI Ghr_ic'_,!nsdFelEderer_-'958eas 8" 699' 7i2, " Maximumutilizationofthefifetablemethodinana,yzingsurvival..loumalof Greenwood. M. 19_6. The natural duration of cancer. Reports on Public Health and Medical Subjects 33: 1-26. bondon: tHis Majesty's Stationery Office.
i i
York: Joltn Wiley & Sons. Gross. A. J. and V. A. Clark. I975. Smwival Distributions: Reliability Applications in the Biomedical Sciences, New Halley. E. I,i93. An estimate of the degrees of mortaliw of mankind, drawn from curious tables of the births and funerals _t the city of Breslau. with an attempt to ascertain the price of annuities on lives. Philosophical Transactions 17: 596-510. London: The Royal Society.
!
Kahn. H. A and C. T. Sempos. 1989. Statistical Methods in Epidemiology. New York: Oxford University Press.
i
Katbfleiseh, I D. and R. L. Prentice. 1980 The Statistical Analysis Sons. 1
i
! I
of Failure Time Data. New York: John Wiley &
Krakauer, H.!and J. Stewart. 1991 ssa]: Actuarial or life,table analysis of time-to-event data. Stata Technical Bulletin 1:23-25 t Reprinted in Stata Technical Bulletin Repents vol 1 pp 200-'_02 Lawless J fl 198"_ Statistical Models and Methods for Lifetime Da_ New York"John Wiley & Sons. K. Gauvreau, 2OO0. Pagano, M. J. Principles of Biostalistics, 2d ed, Pacific Grove, CA: Brooks/Cole. Pike, M. C. 1_66 A method of analysis of a certain class of experiments in carcinogenesis. Biometrics 22: 142-161. Selvin, S. 19_6 Statistical ._nalysis of Epidemiologic Dgta. 2d ed. New York" Oxford University Press.
AlsoSee Related:
[R] cox, [R] st, [R] weibull
Bact_graun
:
]
Stata Graphics Manual
r.... _
Title I Iv--Letter-valuedisptay
s
, , _ , ,
]
i !;2
Syntax
by ... : may be used with lv; see [R] by.
Description lv shows a letter-value display _ukey 1977, 44,-49; Hoaglin t983) for each variable in varlist. If no variables are specified, letter-value displays are shown for each numeric variable in the data.
Options generate adds four new variables to the data: _-mid, containing the midsummaries; _spread, containing the spreads: _psigma, containing the pseudosigmas; and _z2, containing the squared values from a N(0, 1) corresponding to the particular letter value. If the variables _.mid, _spread, _psigma, and _z2 already exist, their contents are replaced. At most, only the first 11 observations of each variable are used; the remaining observations contain missing. If varlist specifies more than one variable, the newly created variables contain results for the last variable specified. The generate option may not be used wfth by ... :. tail(#) indicates the inverse of the tail density through which letter values are to be displayed. 2 corresponds to the median (meaning half in each tail), 4 to the fourths (roughly the 25th and 75th percentiles), 8 to the eighths, and so on. # may be specified as 4, 8. 16, 32, 64, 128, 256, 512, or 1,024 and defaults to a value of # that has corresponding depth just greater than 1. The default is taken as 1,024 if the calculation results in a number larger than 1.024. Given the intelligent default, this option is rarely specified.
Remarks Letter-value displays are a collection of observations drawn systematically from the data, focusing especially on the tails rather than the middle of the distribution. The displays are called letter-value displays because letters have been (almost arbitrarily) assigned to tail densities: Letter M F E D C
Tail Area 1/2 1/4 1/8 1/16 1/32
Letter B A Z Y X
Tail Area 1/64 1/128 1/256 t/512 1/1024
i
_
Iv -- Letter-valuedisplays
285
Example! You h_ve data on the mileage ratings of 741automobiles. To obtain a letter-value display:
i
; _
i
i i
"#
74
H F E D C B A
37.5 19 10 5.5 3 2 1.5 I
Mileage (mpg)
18 15 14 14 12 12 12
inner
fence
7.5
outer
fence
-3
!
20 21.5 21.5 22.25 24.5 23.5 25 26.5
25 28 30.5 35 35 38 41
35.5
! !
# below 0
46
pseudosigma 5.216359 5.771728 5.576303 5.831039 5.732448 6,040635 6.16562 # above 1
0
0
The d_cimal points can be made to line up _d' thus the output made more readable by specifying a display_format for the variable: see [U] 15.5 Formats: controlling how data are dispiayed.
•. i_ mpg mpg _J.9._2f f rmat # | 74
F M E D
19 37.5 I0 5.5
Mileage (mpg)
18.O0 15.00 14. O0
21.50 20.O0 21.50 22.25
25.O0 28.00 30.50
spread 7.O0 13.00 16.50
pseudosigma 5.22 5.77 5.58
C
3
14. O0
24.50
35. O0
21.00
5.83
B A
2 1.5 1
12. O0 O0 12. 12. O0
23.50 25.00 26,50
35. O0 O0 38. 41. O0
23.00 26. O0 29, O0
5.73 6.04 6.17
i i
spread 7 13 16.5 21 23 26 29
inn,_r fence out,_rfence I
At the to
7.50 -3.O0
35.50 46.00
# below
# above
0 0
1 0
the number of observations is indicated as 74. The first line shows the statistics associated
!
with M, t_ie letter value that puts half the densiiy in each tail, or the median. The median has depth 37.5 (thatlis. in the ordered data. M is 37.5 obse_'ations in from the extremes) and has value 20. The next line ;hows the staiistics associated with F or the fourths. The fourths have depth 19 (that is, in the ordere d data, the lower fourth is observation; 19 and the upper fourth is observation 74 - t 9 + 1), and the viLlues of the lower and upper fourths are 18 and 25. The number in the middle is the point ketween the fourths called halfway a midsummary. If the distribution were perfectly symmetric. the midsu nmarv would equal the median. The spread is the difference between the lower and upper summarie, , (25 - 18 = 7). For fourths, half of the data lies within a 7-mpg band. The pseudosigma is a calculation of the staf_dard deviation using only the lower and upper summaries and assuming that the variab e ts normally &stnbuted. If the data really _ " " " were normally distributed, all the pseudos_gmas
i
would be roughly equal.
i
Alter tl_eletter values, the line labeled with depth l reports the minimum and maximum values, tn this case. _he halfway point between the extremes is 26.5, which is greater than the median, indicating that 4I is more extreme than 12. at least relative to the median. Also note that, with each letter value, the midsummarids are increasing our daia arc skewed. The pseudosigmas arc also increasing,
f i i
i i
_!:'_
_!
indicating that the data are spreading out relative to a normal distribution skewness, this elongation may be an artifact of the skewness.
although, given the evident
At the end is an attempt to identify outliers, although the points so identified are merely outside some predetermined cutoff. Points outside the inner fence are called outside values or mild outliers. Points outside the outer fence are called severe outliers. The inner fence is defined as (3/2)IQR and the outer fence as 3IQR above and below the F summaries, where the tQR is the spread of the fourths. <1
[] Technical Note The form of the letter-value display has varied slightly with different authors, lv displays appear as described by Hoaglin (]983) but as modified by Emerson and Stoto (]983), where they included the midpoint of each of the spreads. This format was later adopted by Hoaglin (]985). if the distribution is symmetric, the midpoints will all be roughly equal. On the other hand, if the midpoints vary systematically, the distribution is skewed. The pseudosigmas are obtained from the lower :and upper summaries for each letter value. For each letter value, they are the standard deviation a normal distribution would have if its spread for the given letter value were to equal the observed spread. If the pseudosigmas are all roughly equal, the data are said to have neutral eIongation. If the pseudosigmas increase systematically, the data are said to be more elongated than a normal; i.e., have thicker tails. If the pseudosigmas decrease systematically, the data are said to be less elongated than a normal; i.e., have thinner tails. Interpretation of the number of mild and severe outliers discussion is drawn from Hamilton (1991): Obviously, the presence
of any such outliers
is more problematic.
The following
does not rule out that the data have been drawn
from a normal; in large datasets, there will most certainly be observations outside (3/2)IQR and 3IQR. Severe outliers, however, comprise about two per million (.0002%) of a normal population. In samples, they lie far enough out to have substantial effects on means, standard deviations, and other classical statistics. The .0002%, however, should be interpreted carefully: outliers appear more often in small samples than one might expect from population proportions due to sampling variation in estimated quartiles. Monte Carlo simulation by Hoaglin, Iglewicz, and Tukey (t986) obtained these results on the percentages and numbers of outliers in random samples from a normal population:
n t0 20 50 100 200 300
percentage any outliers severe 2.83 1.66 1.15 .95 .79 _75 .70
.362 .074 .011 .002 .001 .001 .0002
number any outliers severe .283 .332 .575 .95 1.58 2.25 _
.0362 .0148 .0055 .002 .002 .003 o:,
Thus, the presence of any severe outliers in samples of less than 300 is sufficient to reject normality. Hoaglin, Iglewicz. and Tukey (1981) suggested the approximation ,00698 + .4In for the fraction of mild outliers in a sample of size n or. equivalently, .00698n + .4 for the number of outliers. vI
Iv -- Letter-value-displays
287
Exampt i i
The _enerate option adds the variables __mid. _spread, _psigma, and _z2 [o your data. makin_ possible many of the diagnostic graphs sugges_tedby Hoaglin (1985). • _v mpg, generate
(_utputomitted) I
list _mid spread _psigma _z2 in 1/!2 i _mid _spread _psigma
il • i
16.5 13
5. 216359
.4501955
5.771728 5.576303
2.188846 1.26828
22.25 21.5
_.
23.5
23
5. 732448
4.024532
_.
25
26
6.040635
4.631499
1!.
26.5
29
2,5
2, 58 103,3.24255
6.16562
5.53073
l
:
!
7
_i [
,
I
2o
21.5
_z2
Observallons 12 through the end are missing for these new variables. The definition of the observations 1 is alway_ the same. The first observation conlains the M summary, the second the E the third the E, and _ on. Observation 11 always contains the summary for depth 1. Observations 8 through 10 corresponding to letter values Z, '_: and X contain missing because these statistics were not calculate_t. "Wehave °nly 74 observations and !their depth would be 1. : •
Hoag!in (1985) suggests graphing the midsummary against z 2. If the distribution is not skewed. the poin_ in the resulting graph will be along a horizontal line: :raph _mid _z2, border ylabel xlabel
7 i
28"
28-
c
o u:,
24-
E
o
o 22-
20-
c
o
r
Z squared
Theegrapi3 clearly indicates the skewneas of the distribution. One might also graph _psigma _72 to e!amine elongation.
aeainst -I
i l
Saved 288 Results Iv- Letter-value displays iv saves in r()" Scalars r (_1) r(rain) r (max) r (median) r(l_F) r(a...F) r(l_.E) r(u__g) r(l_r)) r(u_D) r(l_C) The lower/upper
nmnber of observations minimum maximum median lower 4th upper 4th lower 8th upper 8th lower 16th upper 16th lower 32rid
8ths, 16ths .....
r(u_C) r(l_B) r (u_B) r (1..A) r(u../) r(l_Z)
r(u..Z) r(t_¥) r(u_Y) r(t_.X) r(u...X)
upper 32nd lower 64th upper 64th lower 128th upper 128th lower 256th upper 256th lower 512th upper 512th lower 1024th upper 1024th
1024ths will be defined only if there are sufficient data.
Methodsand Formulas iv is implemented
as an ado-file.
Let N be the number of (nonmissing)
observations
on x, and let z(i ) refer to the ordered data
when i is an integer. Define x(i+.5) = (x(i) + x(i+l))/2;
the median is defined as x((N+_)/2 ).
Define X[d] as the pair of numbers x(d ) and x(N+l-d) where d is called the depth. Thus, x[1 ] refers to the minimum and maximum of the data. Define m = (N q- 1)/2 as the depth of the median, f - (LmJ + 1)/2 as the depth of the fourths, e -- ([fj + 1)/2 as the depth of the eighths, and so on. Depths are reported on the far left of the letter-value display. The corresponding fourths of the data are x[fj, the eighths x[e], and so on. These values are reported inside the display. The middle value is defined as the corresponding midpoint of x[.]. The spreads are defined as the difference in
x[.]. The corresponding 456-457)
point zi on a standard
zi =
normal distribution
F-l{(di, 1/3)/(N + 1/3)} F -1 {0.695/(_N + 0.390) }
where d, is the depth of the letter value. The corresponding the spread to -2zi (Hoaglin 1985, 431).
is obtained
as (Hoaglin
1985,
if di > 1 otherwise
pseudosigma
is obtained as the ratio of
Define (Fl, F_,) = x[f]. The inner fence has cutoffs Ft - _ (F,_ - Ft ) and F_ + 23(F_, - Ft). The outer fence has cutoffs/_ - 3(Fu - Fl) and Fu + 3IFu - Ft). The inner fence values reported by lv are almost exactly equal to those used by graph, box to identify outside points. The only difference is that graph uses a slightly different definition of fourths: namely, the 25th and 75th percentiles as defined by summarize.
References Emerson. J. D. and M. A Stoto. 1983. Transforming data. In UnderstandingRobust mid Exploratory Data Analysis. ed. D. C. Hoaglin. E Mosteller, and J. W. Tukey, 97-128. New York: John Wiley & Sons
,il
i' Iv-
Letter-value displays
289
:
Fox. J. 19901 Describing umvafiate distributions. In Modem Methods of Data Analysis_ ed J. Fox and J S_Long,
I
58-t25.L.IC _ewbury Sage Publications. Hamilton, 1991.Park, sed4:CA: Resistant normality check and outlier identification. Stata Technical Bulletin 3: 15-18.
i
Repfintedl1 in Stata Technical Bulletin Reprints, rot. 1,i pp. 86-90. Hoagiin, D. ¢. 1983, Letter values: a set of selected or_r statistics. In Unde_tanding Robust and ExploratoO, Data
i
D. quantiles C. Hoaglin, Mosteller, J. V¢.Yukey, 33-57. New York: John Wiley &ed. Sons. _. Analysis. 1985. led. IUsing to F. study shape, and tn Exploring Data Tables, Trends, and Shapes, D_ C. Hoaglin,
i
E Mosteller, and J. W. Tukey, 417-4601 New York: John Wiley & Sons. !
J
Hoa21in.D.C.B. Iglewicz, hnd J. _,r Tuke,,. 1981. Smali-sample performance of a resistant rule for outlier detection. _¢_80Prc_eedines of the Statistical Computing Section,!144-152. Washinoton, DC: American Statistical Association. I
---'8I:1986'99I-_)9_beff°rmance of some resistant rules for outlier IaUelinz._Journal of the American Statistical AssociatiOn
[
Tuke,,', J. W.t°"I19,7.Explorat&v Dat, Anah'sis. Reading, MA: Addison-Wesley Publishing Company.
i
i i
I
i
i
I
i _,
Also See Related:
[R] diagplotsi
[R] stem, [R] summarize
[ i
]
!ii ii_
F
r
mILm_
matsize -- Set the maximum number of variables in a model
_..
I
I
I I
f j
I
Syntax set mgtsize
#
where 10 _< # _< 800
Description set matsize sets the maximum number of variables that can be included in any of Stata's modelestimation commands. The command may not be used with Small State: matsize is permanently frozen at 40. For Intercooled Stata, the initial value is 40, but it may be changed upward or downward. The upper limit is 800.
Remarks set matsize controls the internal size of matrices that Stata uses. The default of 40, for instance, means that linear regression model s are limited to 38 independent variables--38 because the constant uses one position and the dependent variable another, making a total of 40.
Under Stata for Macintosh, there must be no data in memory when you change matsize and increasing matsize decreases the amount of memory available for data. Under Stata for Windows and Stata for Unix. you may change matsize with data in memory-, but increasing matsize increases the amount of memory consumed by Stata, increasing the probability of page faults and thus of making Stata run more slowly.
Example You wish to estimate a model of y on the variables xl through xlO0, Without thinking, you type • regress
y xl-xlO0
matsize too r (908) ;
small;
type
-help
matsize-
You realize that you need to increase matsize; you are using Intercooled set matsize 150 no; data r(4);
in memory
would
be
If you are using Statafor Macintosh,
lost
you must
save mydata file
mydata,dta
• drop • set
saved
_all matsize
150
• use mydata • regress
y xl-xlO0
(output omitted ) 290
Stata and type
i, i J
i
matsize -- Set the r_ximum
Under $tat_ for Windows
i
I t
i
-help matsize-
r(90_) ; _
. sel_ matsize . re_ress
l_
291
or Stata for Unix, yo_ do not have to go to that trouble:
• recess y xl-xlO0 mats_ze too small; type
i
number of variables in a model
!
150
y xl-xlDO
(ou_ut omitted )
Related:
[R] memory
Backgrom_d:
[u] 7 Setting the size of merry
_1
!i
i
i
i
r! .,,;eoax,mze Oet ,sofi, .at, ema.i I !
Syntax mle_cn_/ ...
[,
tolerance(#) where init_specs matname
[,
[no]l_og trace 1tolerance(#)
gradient hessian showstep gtolerallce(#)
diEficutt
iterate(g) from(init_.specs)
]
is one of skip
copy]
{ [eqname:] name=#1
/eqname=#
} ['"]
# [#-1, opy Description Stata has two maximum likelihood optimizers: one is used by internally coded commands and the other is the ml command used by estimators implemented as ado-files. Both optimizers use the Newton-Raphson method with step halving (to avoid downhill steps) and special fixups when nonconcave regions of the likelihood are encountered. The two optimizers are similar, but differ in the details of their implementation. For information on programming maximum likelihood estimators in ado-files, see [RJ ml and Maximum Likelihood Estimation with Stata (Gould and Sribney 1999).
Options log
and nolog specify whether an iteration log showing the progress of the log likelihood is to be displayed. For most commands, the log is displayed by default and nolog suppresses it. For a few commands (such as the svy maximum likelihood estimators), it is the opposite; you must specify log to see the log.
trace
adds to the iteration log a display of the current parameter
gradient (ml-programmed vector. hessian (m!-programmed Hessian matrix.
vector.
estimators only) adds to the iteration log a display of the current gradient estimators only) adds to the iteration tog a display of the current negative
shorastep (ml-programmed estimators only) adds to the iteration log a report on the steps within an iteration. This option was added so that developers at Stata could view the stepping when they were improving the ml optimizer code. At this point, it mainly provides entertainment. iterate (#) specifies the maximum number of iterations. When the number of iterations equals iterate (), the optimizer stops and presents the current results. If convergence is declared before this threshold is reached, it will stop when convergence is declared. Specifying iterate (0) is useful for viewing results evaluated at the initial value of the coefficient vector, iterate(0) and from() specified together will allow one to view results evaluated at a specified coefficient vector; note, however, that only a few commands allow the from() option, iterate (16000) is the default for both estimators programmed internally and estimators programmed with ml. 292
i ! ! ! !
! ! _
! :_
|_
maximize-- Detailsof iterativemaximization
!
293
totera_nc_ (#) specifies the tolerance for the Coefficient vector. When the relative change in the coefficient vector from one iteration to the next is less than or equal to tolerance(), estimates are decl:tred to have converged. If this criterion is satisfied, convergence is declared regardless of the stat_s of the likelihood tolerance ltolerance (). I to!er ce(le-4) is the default for estimators programmed internally in Stata. toleramce(le-6)is the default for estimators programmed with ml. Itolerance(#)specifies the tolerance for the log likelihood. When the relative change in the log likeliho(d from one iteration to the next is less than or equal to ltolerance(), estimates are declared to have con,_erged. If this criterion is satisfied, convergence is declared regardless of the status ol the coefficient vector tolerance tolerance (). Itoler_mce(0)
is the default for estimators programmed internally in Stata.
Itoler_mce(le-7) is the default for estimators programmed with ml. gt oleran(e(#) (ml-prc/grammed estimators only) specifies an optional tolerance for the gradient relative to 1hecoefficientg. When [9ibil < gtolerance() for all parameters bi and the corresponding element., of the gradidnt gi, then the gradient tolerance criterion is met. Unlike tolerance()and
!
Itoler_ race (). isthe ffl:oler_ce criterion mustisbemet metand in addition to an3' or other tolerance. That is, conv_:rgence declared when () gtolerance() tolerance() ltolermaee() is
i
met. Th_ gtolerance() option is provided for particularly deceptive likelihood functions that may trigger premature declarations of conve!gence. The option must be specified for gradient checkin_ to be activa! ed: by default the gradient is not checked. , . . .t _ . • i ...... difficult I (ml-programmed estimators only) st_eclfies that the hkehhood functmn Is hkety to be difficult to maximize due to nonconcave regions. When the message "'not concave" appears repeated y, ml's standhrd stepping algorithm may not be working well. difficult specifies that a different stepping algorithm is to be used in nonconcave regions. There is no guarantee that diffic_Llt will work better than the default: sometimes it is better, sometimes it is worse. The difficlLlt option shbuld only be attempted when the default stepper declares convergence and the last teration is "'not concave", or when the_default stepper is repeatedly issuing "not concave" message and only producing tiny improvemetltS in the log likelihood.
__ ! ! ! i
i ! i _
fromC) spdcifies initial _}atuesfor the coefficients. Note that only a few estimators in Stata currently support his option. "l)he initial values can be specified in one of three ways: by. specifying the name of a vector corjtaining the initial value_ (e.g., from(b0) where b0 is a properly labeled vector); _y specifying coefficient names with _,hevalues (e.g., from(age--2. I /sigma=7.4)); or by specifying a list o_fvalues (e.g., from(2.1 7.4, copy)), from() is intended for use when doing b(otstraps (see_[R] bstrap) and in other special situations (e.g., used with iterate(0)]. Even wren the valueg specified in from() are close to the values that maximize the likelihood. onl,,, a fetw of iteratiofis may be saved. Poor values in from() may lead to convergence problems.
!
skip species that any parameters found in the specified initialization vector tha! are nor also found in the mbdel are to be ignored. The default action is to issue an error message.
l: !
copy specil_es that the li._t of numbers or the initit_lization vector is to be copied into the initial-value vector b_ position ratiaer than by name.
_
! ! 1
Remarks Only in -are circumstances would a user ever need to specify any of the these options, with the exception oI nolog, nol_g is useful for reducing the amount of output appearing in lot files. The foil Ning is an e_ample of an iteration log:
), • _i_'
Iteration O: Iteration I:
log likelihood = -3791.0251 log likelihood -- -3761.738
Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Iteration 5: log likelihood Iteration 6: log likelihood Iteration 7: log likelihood Iteration 8: log likelihood (table of results omitted)
= -3758.0632 = -3758.0447 = -3757.5861 = "3757.474 ---3757.4613 = -3757.4606 = -3757.4606
(not concave)
At iteration 8, the model converged. The only notable thing about this iteration log is the message "not concave" at the second iteration. This example was produced using the heckman command; its likelihood is not globaIty concave, so it is not surprising that this message sometimes appears. The other message that is occasionally seen is "'backed up". Neither of these messages should be of any concern unless they appear at the final iteration. If a "not concave" message appears at the last step, there are two possibilities. One is that it is a "valid" result, but there is collinearity in the model that the command did not catch. Stata checks for obvious collinearity among the independent variables prior to performing the maximization, but strange collinearities or near collinearities can sometimes arise between coefficients and ancillary parameters. The second cause for a "not concave" message at the final step is that the optimizer entered a very flat region of the likelihood and prematurely declared convergence. If a "backed up" message appears at the last step, there are also two possibilities. One is that it found a perfect maximum and it could not step to a better point; if this is the case, all is fine, but this is a highly unlikely occurrence. The second is that the optimizer worked itself into a bad concave spot where the computed gradient and Hessian gave a bad direction for stepping. If either of these messages appear at the last step, do maximization again with the gradient option. If the gradient goes to zero, the optimizer has found a maximum that may noI be unique but is a maximum. From the standpoint of maximum likelihood estimation, it is a valid result. If the gradient is not zero, it is not a valid result, and you shoutd try the following: Try tightening up the convergence criterion. Try lgol (0) to1 (le-7) or gtol (0.1) (with the default ltol () tol ()) and see if the optimizer can work its way out of the bad region. If you get repeated "not concave" steps with little progress being made at each step, try specifying the dif:ficult option. Sometimes di:f:Eicult works wonderfully, reducing the number of iterations and producing convergence at a good (i.e., concave) point. Other times, difficult works poorly, taking much longer to converge than the default stepper.
(Continued
on next page)
?
maximize -- Details of iterafive maximization
i i
295
SavedRe,,;ults Maximt m likelihood estimators
save in e():
Scalars
!
i l I
-e(ri)
number of observations
always saved
e (df..z)
model degrees of freedom
always saved
e (r24) e (il)
pseudo':R-squared log likelihood
sometimes saved always saved
e(ll_()
log likelihood, constant-only model
usually saved
e(N_clust) ! e (chi_) e (±c) e(raxLk)
number of clusters
saved when cluster specified; see [U] 23.11 Obtaining robust variance estimates usually saved
e(rank3)
rank of e(V) for constant-only model
saved when constant-only model is estimated
x2 number of iterations rank of: e(V)
always saved
Macms i
e (cmd) e (depv 4)
name of command name(s) of dependent variable(s)
always saved usually saved
i
e(wtyp._) e(_exp
weight !type weight expression
saved when weights are specified or implied saved when weights are specified or implied
e(clus
name Of cluster variable
saved when cluster specified: see [U] 23.11 Obtaining robust variance estimates
l
I ,_ i
i I l
l i
1 ;_
e(vcet_pe) covariahce estimation method
saved when robustis specified or implied; see
e (user e (opt)
name Of likelihood-evaluator program type of- optimization
[U] 23.11 Obtaining robust variance estimates always saved always saved
e(chi2;ype) e(predtct) e(cnsltst)
Wald or LR: type of model _-_ test program used to implement predict constraint numbers
usually saved usually saved saved when there are constraints
coefficient vector iteration log (up to 20 iterations) variance-covariance matrix of the estirntators
always saved always saved always saved
Matrices e(b) e(ilog e(V)
I l
vat) i
i Funclions e (samp te)
marks esnmation sample
See the Saved Results section complete li;t of returned results.
in the manual entry for any maximum
likelihood
estimator
ior a
Methodsand FormUlas Let Id e the log likelihood of the tull model (i.e.. the log-likelihood value shown '3on the outpul[ and let Lo __ethe log likelihood of the "'constantaonly'" model. The likelihood-ratio X+ model te_t ts
i
defined as (L1 -- L0), The pseudo R 2 (Judge et al. 1985) is defined as t L1;Lo+ This is simply the log lib lihood on a gcale where 0 corresponds to the "constant+only'" model and I corresponds
!
to perfect
t,rediction
likelihood
i O.
tbr_discrete
models
i.e.. predicted
probabilities
are all I and the overall
log
_)
estimates given by the inverse of the negative Hessian (second derivatives) matrix. If robust, cluster(), p_reights are specified, standard display errors are based errors on the robust By default,or Stata's maximum likelihood then estimators standard based on variance variance estimator (see [U] 23,11 Obtaining robust variance estimates); in this case, likelihood-ratio tests are not appropriate (see [U] 30 Overview of survey estimation) and the model X 2 is a Wald test. Some maximum likelihood routines have the ability to report coefficients in an exponentiated form; e.g., odds ratios in logistic. Let b be the unexponentiated coefficient, s its standard error, and b0 and bl the reported confidence interval for b. In exponentiated form, the point estimate is e b, the standard error ebs, and the confidence interval e bo and e bl . The displayed Z statistics and p-values are the same as those for the unexponentiated results. This is justified since e b = 1 and b = 0 are equivalent hypotheses, and normality is more likely to hold in the b metric.
References Gould, W. and W Sribney. 1999. Maximum Likelihood Estimation with Stata. College Station, TX: Stata Press. Judge, G. G., V¢ E. Griffiths, R. C. Hill, H. Liitkepohl,and T.-C. Lee. 1985. The Theory and Practiceof Econometrics. 2d ed. New York:John Witey & Sons
Also See Complementary:
[R] irtest, [R] ml
Background:
[u] 23 Estimation
and post-estimation
commands
i¸¸¸ i
i
T,,,e
i_
Arithmetic, geometric, and harmonic means
Syntax I !
,
l
meansi [varlist] [weight] [if exp] [in range] [, add(#) o__nlylevel(#)
]
by ,.. : n}ay be used witla means; see [R] by. aweights _nd fweights ai'e allowed; see [U] 14.1.6 _eight
i
!
iI
i
DescriptiOn, meansicomputes the arithmetic, geometric, and harmonic means, and corresponding confidence intervals, for each variable in varlist or for all ihe variables in the data if vaHist is not specified. If you simpl_'want arithmetic means and corresponding confidence intervals, see [R] ci.
! i
)
i
iI
Options
i i
! : !
_
add(#) a_ds the value # to each variable in i,arlist before computing the means and confidence intervals. This is useful when analyzing variables with nonpositive _atues. only modifies the action of the add(#) option. If specified, the add(#) option only adds # to vari.abll_swith at least one nonpositive value! level (#) specifiesthe Confidencelevel, in percem,for confidenceintervals. The default is level (95) or as s'.t by set level; see [U] 23.5 Specit_ing the Mdth of confidence intervals.
Remarks i
Example ) i_
You ha!_ea dataset containing 8 observation_ on a variable named x. The eight values are 5, 4, -4, -5, I_,0, missing, and 7.
_,
iable
Type
i
Obs
Arithmetic Geometric Harmonic
7 3 3
Mean
[957. Conf.
5. _92494 i 5.060241
-3.204405 2.57899 3.023008
Interval]
5.204405 i0. 45448 15. 5179
i
i
me,,,odd(5 _ariable
i
!
i
i i
t
( (,)
_
Miss_ng
Type x !
)
ConsUlt
A_ithmetic
7
Geometric iHarmonic
6 6
Mean
was
added
values
tt,eferenc_
_o
the
variable(s)
ifi confidence _nterval Manual
is for
1.795595
5.477226 3. _40984 prio_
interval(sl undefined
[95Y. Conf. 6
i
thatlconfidence _
Obs
Interval] 10.2044"
2.1096
14.22071
............... to for
f6r
details.
::
calculating harmonic
corresponding
the mean
results. indicate
variable(s).
* *
]
! !!_; il
298 numbermeans -- Arithmetic, geometric, and harmonic The of observations displayed for the arithmetic mean means is the number of nonmissing observations. The number of observations displayed for the geometric and harmonic means is the number of nonmissing, positive observations. Specifying the add(5) option results in 3 additional positive observations. Note that the confidence interval for the harmonic mean is not reported; see Methods and Formulas bel ow.
Saved Results means saves in r () : Scalars r(N) r(N_pos) r (mean) r(ib) r(ub) r(Var) r(mean_g) r(lb_g)
r(ub_g) r(Var_g) r (raean_) r(Ib_h) r(ub_.h) r(Var_h)
number of nonmissingobservations; used for arithmeticmean number of nonmissingpositive observations: used for geometric & harmonic means arithmetic mean lower bound of confidence interval for arithmetic mean upper bound of confidence interval for arithmetic mean variance of untransformeddata geometric mean lower bound of confidence interval for geometric mean upper bound of confidence interval for geometric mean variance of lnz_ harmonic mean lower bound of confidence interval for harmonic mean upper bound of confidence interval for harmonic mean variance of 1/z_
Methodsand Formulas means is implemented
as an ado-file.
See, for example, Armitage and Berry (1994) or Snedecor and Coct_ran (1989). For a history of the concept of the mean, see Plackett (1958). When restricted to the same set of values (i.e., to positive values), the arithmetic mean (7) is greater than or equal to the geometric mean which in turn is greater than or equal to the harmonic mean. Exact equality holds only if all values within a sample are equal to a positive constant. The arithmetic mean and its confidence interval are identical to those provided by ci; see [R] ci. To compute the geometric mean, means first creates uj = In x_ for all positive zj. The arithmetic mean of the uj and its confidence interwd are then computed as in ci. Let g be the resulting mean, and let [L, U] be the corresponding confidence interval. The geometric mean is then exp(g) and its confidence interval is [exp(L), exp(U) ]_ The same procedure is followed for the harmonic mean, except in this case uj = 1/xj. The harmonic mean is then I/g and its confidence interval is [l/U, I/L] if L is greater than zero. If L is not greater than zero. then this confidence interval is not defined and missing values are reported. When weights are specified, means applies the weights to the transformed values, uj -- In zj and uj - 1/a:7 respectively, when computing the geometric and harmonic means. For details on how the weights are used to compute the mean and variance of the "uj, see [R] summarize. Without weights, the formula for the geometric mean reduces to
means -- Arithmetic, geometric, and harmonic means
exp n i
Without _ eights, the formula
for the harmonic
299
i
J mean is
,, ,i ....
II
J "
i
AcknoWl_dgments This in, proved version of means Is"based on the _ci command (Carlin, Vidmar, and Ramalheira ]998) andtwas written by John Carlin. University of Melbourne, Australia; Suzanna Vidmar, University of Metbodrne, Australia: and Carlos Ramalheira, Coimbra University Hospital, Portugal:
I
I
I
i_'
Referen_ Arrearage. _.
Publicat_ns. Cartin, J., _. Vidmar, and C. Ramalheira. 1998. sg75: Geometric means and confidence intervals. Stata Technical Bulletin t41: 23-25. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 197-199.
i
1
I
,
and G Berry. 1994. Statistical Methods i in Medical Research. 3d ed. Oxford: Blackwell Scientific /
i
l_:
_ i
i '
:
Ko_.z.S. Sons. an_ N. L. Johnsor_. ed. 1985. Encyclopedia of StatisticaI Sciences. vol. 1. vol. 3, New York: John Wiley & Plackett....R L I958. The principle of the arithmetic m_an, Biome_ka 45: 130-I35. Snedec_r. ,_. W. and W. G Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa S_ate University Pres_.
AlsoSee Related:
[R] ci, [R] summarize
i i
Title
I_'
memory -- Memory size considerations
Syntax set memory #[klm] memory set virtual { on Ioff } where # is specified in terms of kilobytes or megabytes.
Description set memory, memory, and set
virtual
are relevant only if you are using Intercooled
Stata.
set memory allows you to increase or decrease the amount of memory allocated to Stata while Stata is running. Increases are obtained from the operating system; decreases are returned to the operating system, set memory can be specified only if you are using Stata for Windows or Stata for Unix. Stata for Macintosh users must instead set the amount of memory [GSM] A.5 Specifying the amount of memory allocated.
when they invoke Stata; see
memory displays a report on Stata's memory usage, memory is available on all Intercooled regardless of platform.
Statas
set virtual controls whether Stata should perform extra work to arrange its memory to keep objects close together. By default, virtuaI is set off. set virtual is available on all Intercooled Statas regardless of platform.
Remarks Remarks are presented under the headings Resetting the amount of memory Obtaining the memory report and how Stata uses memory Using virtual memory
If you use Stata for Macintosh,
skip the first heading.
Resettingthe amountof memory If you use Stata for Windows or Stata for Unix, you can change the amount of memory Stata has allocated while Stata is running:
J
300
memory- Memorysize considerations
301
i
, s_It memory 4m no; _data in memory
would
be lost
;
r(41
You can lhange the amount of memory, but only when there are no data in memory: . d_op
_all
. s _t memory (40 _6k)
You can
_crease it
. s#t memory (32 t68k)
° ! i
4m
32m
or decrea ;e it. 1 . s-_t memory (10 _4k)
Im
If you as] for more than your operating system can provide, you will be told so: l
. S_t
i
op._sys,
I
i r(969);
i :
I
memory
128m
refuses
to provide
memory
The nurn_r you type can be specified in megabytes or kilobytes. When you suffix numbers with m it means _negabytes. When you suffix numbers with k (or nothing) it means kilobytes. • s_t
memory
4000k
(40_0k) s_t memory
lOOO
(it OOk)
21Technica Note } i
(This lote is relevant only if you use Stata for Unix.) There is a detail in the operating system's handling of returned memory that we have gldssed over. You probably think that the checking out and returJling of memciry from the operating system is handled like the checking out and retumin_ of
t
a book ai a library. With some operating systems, it is handled that way. but others var__.Operating systems andle returned memory in one of thr e ways: 1. The iI|stant memory is returned, it is marked as returned and is available for other programs {o
i i i i
checkl°ut" 2. When memory is returned it is put in a special bin and, five or ten minutes from now. it will be m_ ked as returned for other programs to check out. In the meantime, you could check it ou_ again t f you want, but no other program can.
i
3, When memory is returned it is put in the special bin and never moved from there. You can have the m:mory back, but no other program can ever have that memory. Wind(,ws follows _olicy t. The various fla_,'orsof Unix differ on which policy they follow and this has mplications.
i
Let's magine you are pushing your Unix computer to its limits and have allocated lots of memory to Stata. You suddenlyI want to jump ou_ of Stare and do something in Unix. so you use Stata's shell comman_ to obtain a hew shell:
i i
opl sys. refuse_
rqo2) ;
to start
_ew process
"
t
r! _/_*
--,,,--,,,--, if ! there ,v,,,;,,,vn oI-u memory. uul i:_![lerll[iorls This can happen is noIf free This reminds you that Stata has all the memory but you no longer need it, so you retum most of it: • set memory (4096k)
4m
Now you try the shell
command again. What will happen?
1. If your system follows policy l, shell 2. If your system follows policy 2, shell 3. If your system follows policy 3, shell
will work• will not work, but will work five or ten minutes from now. will not work.
The result hinges on whether the operating system really takes back, and when, the memory Stata returns. If your operating system follows policy 3, you must exit and restart Stata. If your operating system follows policy 2 and you are in a hinD, , you can also exit and restart. O
Obtaining the memory report and how Stata uses memory Type memory and Stata will give you a memory report. Below, we just started Stata: • memory 1,023,992
Tot al memory overhead
(pointers)
data
data
+ overhead
programs,
saved
results,
etc.
Total
bytes
O• 00_,
0
O.00_,
0
O. 00_.
1,392
O. 14_,
O. 14Z
I,392
Free
i00. OOZ
0
1,022,600
99.86Y,
If you perform this experiment on your computer, you will probably see different numbers. our memory report after we load the automobile dataset that comes with Stata: . use
auto
(1978
Automobile
Here is
Data)
• memory Total
1,023,992
memory
overhead
(pointers)
data
data
+ overhead
programs,
Total Free
saved
results,
bytes 296
etc.
I00.00_, O. 03_
3,182
O. 317'
3,478
0.34_,
2,368
0.23_
5,846
O. 577,
I, 018,146
99.43Y,
Total memory refers to the total amount of memory State has allocated to its data areas--the number that can be specified at start-up time or reset by set memory. Well. almost. If you use Stata for Macintosh. total memory refers to a number somewhat smaller than that because State has to carve an area out of the total for another purpose. Stata for Macintosh users: just accept that the number is smaller than the number you specified and know that the larger the number you specify at start-up time. the larger the total memo W will be; see the technical note below.
F
=
memory -- Memory size considerations
303
!
Overh_d. data. and -data -4.overhead refer to the amount of memory necessary, to hold the dataset currently in memory. Start with the middle number.
! i
3,182 l_vtes is the total amount of memory necessary to hold the automobile dataset and you could work this lout for yourself from a describe,detail. The automobile dataset has 74 observations and each _bservation requires 43 bytes (called the width), and 74 × 43 = 3,182. 296 bytes is the pointer overhead associated with this dataset. Stata needs something called a pointer to keep track of where each observation is stored in memory, On this computer pointers are 4 bytes but that vanes and the dataset has 74 observations, so 4 x 74 = 296.
!
Data _t.overhead is just the sum of the two rmmbers: 296 + 3,182 = 3.478 is the total amount of i ,
! •
memory ata needs to store and keep track of these data. Programs, saved results, etc., is the total amount of memory Stata has used to store just what it
} _ !_ ! ,:
says: State's programs (ado-files); macros, matrlces, value labels, and all sorts of other things. This is sometin}es referred to as Stata s dynamic meinory. The report shows 2,368 bytes this instant but this numb}r changes fraquently, Here is! a memory report from another session in which we have loaded a dataset with 69,515 observations on _93variables and are in the midst of analyzing it using xtgee:
i
t
. memory
i
ov#rhead (poineers) To];al memory
278,060 6,291,446
1
dat;a
2,363,510
37.57_,
}
d@a
2,641,570
41.99_.
i
prggrams,
i
ToSal
2,689,362
42.75_
!
Fr_e
3,602,086
57.25'/,
+ overhead saved
results,
etc.
47,792
bytes
4.42% 100.00%
0.76'/,
Technical qote Stata fo- Macintosh. The total amount of memory shown by memory is less than the amount you tell your t, lacintosh to _llocate because we need to use some of that memory for other purposes. How much we need is given by 88matsize 2 + 8matsize + k. where k is a constant for you (but varies sligl_tlv across Mhcintoshes). Thus. you will see total memory rise and fall according to the I } i i
value to w tich you set matsize. Let's colmpare matsize = 40 with 80. For matsize = 80, we need 88 • 802 + 8.80
-r k =
563,840 +_'. For matsize = 40, we need 88- 402 + 8.40 + k - 141,120 + k. The difference is then 422,7_0. Conclusion: if matsize was 40 _nd you set matsize 80, memory will report that total memo_: declines by 422,720 bytes. Since it',is in "total memory"that Stata stores 5.'ourdataset. reducing theevalue of matsize is one way to reallocate your memory.
j
! I? I
Using virtual memory Virtual ilemor? refers to using more memory than is physically present on },our computer, This is a feature]provided by/he operating system, not Stata. and is one that you as a Stata user may find yourself so_letimes usin_
++i,+!
V'trtual memory is slow. You will be unhappy if you need to use virtual memory on a daily basis. On the other hand, virtual memory can get you out of a bind and that is the right way to use it with Stata. ......
+'
.........
7
_,.v
IWLVVlIDIUCPIi:ILIUII,_
You do NOT need to set virtual on for Stata to use virtual memory. All set virtual on does is maybe make Stata run a little faster when the operating system is paging a lot. set virtual on will not make Stata run fast, just faster• Virtual memory is most efficient (which is not to say efficient) when the program being executed exhibits something called locality of reference. This is the idea that if the program accesses one location in memory, subsequent memory references will be to a location near that. If you set virtual on Stata's memory-management routines will go to extra work to arrange things so that the idea is true more often. Hence, Stata will run a little faster. If Stata is not using virtual memory, setting virtual on will make Stata run a little slower because Stata will be going to extra work for no good reason. You set virtual • set
on by typing the command
virtual
on
You can check whether virtual is on or off using query: query Status type virtual more rmsg matsize
linesize
79
off off
pagesize trace
23 off
off
:textsize
40
adosize graphics
level
128 off
log
logtype
i00 95 smcl
(closed)
cmdlog
virtual
float
Files
(closed)
is reported on the second line of the left column. To set virtual • set virtual
off
Saved Results memory saves in r () : Scalars number
r(width)
width
r(N_cur) r(N_curmax) r(w_cur)
maximum observations(current partition) max. max. observations (current partition) maximum variables(current partition) maximum width (current partition)
r(M_total)
total
memory
allocated
r(M_data)
total
memory
available
r(M_dyn)
total
programs,
r(size_ptr)
size
r(k_eur)
of
variables
r(k)
of
of
r (matsize)
matsize
r (adosize)
adosize
dataset
memory
saved pointer
to Stata to data results, (bytes)
(bytes) (bytes)
etc.
(bytes)
off,
type
memory -- Memory size considerations
305
Note tt_at there are four saved results that refer to the current partition. At any instant Stata has partitioned the memory _nto observations and "_afiables. The characteristics of the partition can change at any tir_e including tight in the middle of a command, so the first four numbers are really of little interest in that they do not reflect any real constraint. What they do reflect is efficiency. If something should occur that violates any of those limits, Stata will have to silently work to reform the partiti4m, something it is able to do reasonably efficiently and without any disk accesses. Also note that Ehe description of r(l,l_curmax) is not a typographical error. It records the maximum number o_ observations in the current partition if the size of total programs, saved results, etc. (what is recorde, i in r(M_dyn.)) were zero. When Stata is faced with a request that violates the current partition's limits, it considers the possibility of discarding memory copies of ado-files that have not been usedtrecentty. AdO-files are loaded automatically on an as-needed basis, so how tong they are kept in m_mory is only an efficiency issue. Stata considers reducing the memory requirement as an altemativelto| repartition_ng. The oulput produced; by memory can be calculated from the saved results by total memory = r (M_data) overhead (pointers) = _
× r(size_ptr)
data = ...N × r(width) programs, saved results, etc. = r (M_dyn)
References Sasie_i.P. 1_7, ip20:Checl_ingfor sufficientmemoryto add vm'iables.Stata TechnicalBulletin40: t3. Reprintedin Smm TechnicalBulletinReprints.vol. 7. p. 86
Also See l Complemehtarv:
[R] que_
Related:
[R] rnatsize
Baci_groun El:
[U] 7 Setting the size of memory
I
merge m Merge datasets
f '
Syntax
=erge [,'a,'li.tJ using le.ame I, noZabe update replace nokeep -merge(var.ame J Iffilename is specified without an extension, .dta is assumed.
Description merge joins corresponding observations from the dataset currently in memory (called the master dataset) with those from the Stata-format dataset stored as filename (called the using dataset) into single observations.
Options nolabel prevents Stata from copying the value label definitions from the disk dataset into the dataset in memory. Even if you do not specify this option, in no event do label definitions from the disk dataset replace label definitions already in memory. update varies the action merge takes when an observation is matched. By default, the master dataset is held inviolate_values from the master dataset are retained when variables are found in both datasets. If update is specified, however, the values from the using dataset are retained in cases where the master dataset contains missing. replace, allowed with update only, specifies that even in the case when the master dataset contains nonmissing values, they are to be replaced with corresponding values from the using dataset when the corresponding values are not equal. A nonmissing value, however, will never be replaced with a missing value. nokeep causes merge to ignore observations in the using dataset that have no corresponding observation in the master. The default is to add these observations to the merged result and mark such observations with _merge = 2. --merge(varname)
specifies the name of the variable that is to be created that will mark the source
of the resulting observation. The default is _merge (__merge); that is, if you do not specify this option, the new variable will be named _merge. See The two kinds of merges below for details.
Remarks Remarks are presented under the headings The two kinds of merges _e-to-one merge Match merge Updating data 306
_ : :
i i i i ! "
' _ i
_' ' , : -
,
_
merge-- Mergedatasets
307
Disting :ish carefully between merging and appending datasets, and the corresponding Stata commands mez_e and append. Appending refersto the addition of new observations on existing variables. If one thinl:sof a dataset as a rectangle with observations going down and variables going across, appending in_:reasesthe dal_aset'slength. Merging adds new variables to existing observations, increasing the dataset s width. See :[u] 25 Commands for Combiningdata if this is not clear. Say voi have a dataset in which each observation records the characteristics of a particular automobile_such as the car's price, weight, etc. If you have two such datasets, one for domestic and another for imported cars. and you wish to combine them into a single dataset, you are reading the wrong entri,: see [R] append. On the )ther hand. if you have two datasets, one recording price and the other weight, mileage, etc., and ydu wish to combine them into a single set, continue reading; merge does this. In additii)nto merge, another command, j oinby, forms all pairwise combinations of observations within grou _. Say you have one dataset on mothers and fathers and another on their children. If you wish to con _binethem so that each parent is matched with every one of their children (each child is matched wi h both parents), so that a 2-parent. 3-chitd family results in 2 × 3 = 6 observations, see [R]joinby. I
The two kin s of merges i
mez'ge j4ns the observations stored in memory with the observations stored in filename. The disk dataset mus_be a Stata-fOrmatdataset; that is. it must have •beencreated with the savecommand.
_
Stata performs two kinds of merges. If no _arhst Is specified, Stata performs a one-to-one merge. In a one-to-0ne merge, the first observation of one dataset is joined with the first observation of the other ttataseI, the second :observationis joined with the second, and so on. If a varlist is specified. however, St_ta uses those variables to perform a match merge. In a match merge, observations are joined onh, if the values of the variables in the specified varlist match.
i
RegardleSs of the style-of merge being performed, merge always adds a new variable called (by default) _.merge to the dataset. This variable takes on the values 1.2. or 3 to mark the source of the resulting obs_ervation.The-coding is
i i_ i
1.
T e observation occurred only in the master dataset.
2. 3.
T_e observatiofi occurred only in the using dataset. T_e observatio_ is the result of joining an observation from the master dataset with o e from the ui;ingdataset.
When you u_ the update option, this coding is ektended to include
_ I
4.
S me as 3 except that missing values in the master were updated with values from
z
thlusing. .';.
Sa_e_as "_except that some values in tile master disagree with values in the using.
One-to-one merge .
In a one-to!one merge, t_e first observation in the..masterdataset is joined with the first observatioo
i
the same namle occur in bdth the master and the using datasets, the joined ohse_'ation retains those anabtes on,real values, the values of the variables in the master dataset. When the master and in the dstasetfficontaindiffe_nt using _ataset, the s_ondnumbers observation is joined with the second, andjoined so on.with tf variables with using of observations, missing values are the remaining
i
observations f_om.the longer dataset.
V
'
_
°
"
_suu
merge -- Merge Oatasets
> Example You have two datasets stored on disk that you wish to merge into a single dataset. The first dataset, called odd. dta, contains the first five positive odd numbers. The second dataset, called even. dta, contains the fifth through eighth positive even numbers. (Our example is admittedly not realistic, but it does illustrate the concept.) The datasets are • use odd (First five odd numbers) list number 1 2 3 4 5
i. 2. 3. 4. 5.
odd 1 3 5 7 9
• use even (5th
through
8th
even
numbers)
list 1. 2. 3. 4.
number 5 6 7 8
even 10 12 14 16
We will join these two datasets using a one-to-one memory (we just used it above), we type merge using
merge. Since the even dataset is already in odd. The result is
- merge using odd number was int now float list
1. 2. 3. 4. 5.
number 5 6 7 8 5
even 10 12 14 16
odd 1 3 5 7 9
_merge 3 3 3 3 2
The first thing you will notice is the new variable _merge. Every time Stata merges two datasets, it creates this variable and assigns a value of 1, 2, or 3 to each observation. The value I indicates that the resulting observation occurred only in the master dataset, 2 indicates the observation occurred only in the using dataset, and 3 indicates the observation occurred in both datasets and is thus the result of joining an observation from the master dataset with an observation from the using dataset. In this case, the first four observations are marked by _merge equal to 3, and the last observation by .._merge equal to 2. The first four observations are the result of joining observations from the two datasets, and the last observation is a result of adding a new observation from the using dataset. These values reflect the fact that the original dataset in memory had four observations, and the odd dataset stored on disk had five observations. The new last observation is from the odd dataset exclusively: number is 5, odd is 9, and even has been filled in with missing. Notice that number takes on the values 5 through 8 for the first four observations. Those are the values of number from the original dataset in memory--the even dataset--and conflict with the value of number stored in the first four observations of the odd dataset, number in that dataset took on the values 1 through 4, and those values were lost during the merge process. When Stata joins observations and there is a conflict between the value of a variable in memory' and the value stored in the using dataset. Stata by default retains the value stored in memory.
_
i i [
ii
:
When t e command 'merge using
merge -- Merge datasets 309 odd was ;zssued, Stata responded with "number was int now
i
Letls describethe datasets in this example: float"., describe usingodd Contains
data
First
ob#:
I
5
vat :
2
starage
display
valm
vari_ible name
_ype
format
labe_
numb_ ,=r
float
_.9.Og
odd
_loat
Y,9.0g
Sort,_d
five
5 Jul 2000
variable
odd numbers 17:03
label
Odd numbers
by :
!
,
I
Cont_dns
i
ob_,_: sin,=:
i
de:;cribe using
!evenl
data 4 40
5th through
8th even numbers
II Jul 2000
14:12
2 1 vari_ble
name
St QTage _3pe
display format
valu_ label
variable
label
[ I
numb_ _r
i_t
Y,8.Og
even!
float
Y,9.Og
Sort+d
by :
Even
numbers
,
!
i
Note that number is stored as a float in oad.dta,but is stored as an int ineven.dta;see [U] t5.2.2 Numeric storage types. When you mergetwo datasets, Stata engages in automatic variable promotion;! that is, if there are conflicts in numeric storage types, the more precise storage type will be used. The resulting dataset, therefore, will have number stored as a float, and Stata told you
i
this when It said "number was int now float".
I MatchimerSe !
In a maich merge, obgervations are joined if ttie values of the variables in the varlist are the same.
[ if
Since the +alues must lie the same, obviously the variables in the varlist must appear in both the master andtthe using datasets.
!
A mate merge proceeds by taking an observation from the master dataset and one from the usm_ dataset and comparing t_e values of the variable_ in _he vartist. If the varlist values match, then the • 1 .. ' . _ . . observatmr_s are joined, if the varhst values do nqt match, the observation from the earher dataset (the dataset whose var/ist valhe comes first in the sort order) is joined with a pseudo-observation from the l_ter data@ (the other dataset). All the variable_ in the pseudo-observation contain missing values. The actual !observation from the later dataset is _etained and compared with the next observation in the earlier [lataset, and the process repeats.
f
i ! [ i !
l
I
!
,_
_ Ju
merge-
Merge
oaTasel[s
> Example
-_
The result is not nearly so incomprehensible as the explanation. Let's return to the dataset used in the previous example and merge the two datasets on variable number. We first use the even dataset and then type merge number using odd: .
even
use
(Sth
through
8th
even
numbers)
• merge number using odd master data not sorted
r(5); Instead of merging the datasets, Stata reports the error message "master data not sorted". Match merges require that the data be sorted in the order of the varlist, which in this case means ascending order of number. If you look at the previous example, you will observe that the data are in such an order, so the message is more than a little confusing. Before Stata can merge two datasets, however, the data must not only be sorted but Stata must know that they are sorted. The basis of Stata's knowledge is the internal information it keeps on the sort order, and Stata reveals the extent of its knowledge whenever you describe the dataset: • describe Contains
data
from
evenl.dta 4 2
obs: vats: size:
40
5th through 11 Jul 2000 (99.9%
storage
of
memory
display
value
format
label
number
int
7.8.Og
even
float
7,9.Og
Sorted
name
numbers
free)
type
variable
8th even 14:12
variable
Even
label
numbers
by :
The last line of the description shows that the data are "Sorted by:" nothing. We tell Stata to sort the data (or to learn that it is already sorted) with the sort command: • sort
number
describe Contains
data
from
evenl.dta
obs: vars:
4 2
size:
40
5th through 11 Jul 2000 (99.8Y, of memory
storage
display
value
format
label
number
int
7,8.Og
even
float
_,9.Og
Sorted
name
by:
numbers
free)
type
variable
8th even 14:12
variable
Even
label
numbers
number
Now when we describe the dataset, Stata informs us that the data are sorted by that Stata knows the data are sorted, let's try again: • merge number using data not
r(5);
using odd sorted
number.
Now
::
Stata stilltrefuses to ca_ out our request, this time complaining that the using data are not sorted. Both data_ets, the masier and the using, must be in ascending order of number before Stata can
ii
perform
I
As befbre, if you look at the previous exardple you will discover that odd.dta is in ascending order of _umber,but as before, Stata does noi kn_w this yet. We need to save the data we just sorted, use the odd da_.a,sort it, and re-save it:
[ [
aimerge.
1
stve even, replace fil_ even. dta sa#ed . u_ e odd
i
cn, t s ode
!
• s¢rt number i
i
. s_veodd.dta odd, repl_ce fil_ saved
Now we hould be able to merge the two datMets
[ i 1 i
. .
(5tl]
through
number
was
8th even
numbers)
int nbw float
li_st
i
number 5
even 10
2 .i 3 .i
6 7
12 14
I
4.
8
16
[
5.{
1
t
2
;
6. i
2
3
2
7
3
5
2
8;
4
7
2
1
I li
i [ I
[
[ i
_ [ !
i [
odd 9
amerge 3 1 1 1
{
It workedl Let's under_and what happened, Even though both datasets were sorted by number, we immediately discern that the result is no longei in ascending order of number. It will he easier to understan( what happer_ed if we re-sort _he d_ta and then list the data a_ain: . sort li st number I 1 _
number 1
even
odd 1
_merge 2 2
2
2
3
3
3
5
2
4
4
7
2
5 .! 6?
5
10
9
3
6
i2
1
7 8
14 16
1 1
8 _
7 8. i
Notice [hat number now goes from 1 to 8, With no repeated values and no values left ou_ of the sequence. Recall that th_ odd datase[ defined observations for number between I and 5, whereas the even data Jet defined o_servations between 5 a_id 8. Thus, the variable odd is defined for number equal to ]llhrough 5, a_d even is defined for n_ber equal to 5 through 8, 1 For insiance, in the first observation number is l, even is missing, and odd is 1, The value of _mer _ _ indicate_ that this ob_ervat on chme from the usin£ dataset odd dta In the last obserxatio_ number is 8, even is 16, and oddis missing. The Value of _merge. this obser_Jation came f}om the master _' dataset_even.dta.
1, indicmes that
i
i
,/
312 merge -- Merge dalmsets The fifth observation is worth comrnent, number is 5, even is 10, and odd is 9. Both even and odd are defined, since both the even and the odd datasets had information for number equal to 5. The value of _.merge, 3. also tells us that both datasets contributed to the formation of the observation. q
> Example Although the previous example demonstrated, in glorious detail, how the match-merging process works, it was not a practical example of how you wilI ordinarily employ it, Here is a more realistic application. You have two datasets containing information on automobiles. The identifying variable in each dataset is make, a string variable containing the manufacturer and the model. By identifying variable, we mean a variable that is unique for every observation in the dataset. Values for make-- for instance, Honda Accord--are sufficient for identifying each observation. One dataset, autotech.dta, also contains mpg, weight, and length. cost. dta, contains price and rep78, the 1978 repair record. describe
using
Contains
autotech
data
Automobile
1978
obs: vats : size
74 4
:
11 Jul
2000
Data 13:55
2,072
variable
name
storage type
display format
value label
variable
label
make
strl8
7.18s
Make
mpg
int
7.8.Og
Mileage
weight
int
7.8.Og
Weight
(lbs.)
length
int
7,8.Og
Length
(in.)
Sorted
by:
• describe Contains obs: vats :
and Model (mpg)
make using
autocost
data
1978 Automobile Data ii Jul 2000 13:55
74 3
size :
1,924
storage
display
value
type
format
label
make
strl8
price
int
rep78
int
variable
Sorted
The other dataset, auto-
name
by:
variable
label
Y,18s
Make
Model
7.8.0g
Price
7.8.0g
Repair
and
Record
1978
make
We assume that you want to merge these two datasets into a single dataset: • use
autotech
(A1_tomobile . merge
make
Models ) using
autocost
I
I
i
F
merge-- Merge datasets Let's now _xamine the _esult: • de: ,cribe
i
!
from
siz4:
11 Jul 2000
2,¢142 (99.6X of memory sto_age
flee)
value
t_pe
format
label
strl8 iht
_,18s
Make
i_t
XS.0g XS.0g
Mileage (mpg) Weight (IBM.)
lengl h pric(
int i:It
Y,8. Y,8.Og Og
Price
(in.)
rep7_ _mez e
i:it b rte
_,8.0g %8.0g
Repair
Record
vari_b!e
name
make mpg weig_ t
Data
13:55
display
variable
label
and Model
Length
Sorted
1978
by:
Note:
datas_t
has changed
since
last saved
We have alsingle defame! containing all the information from the two original datasets--or at least it appears t_at we do. B_fore, accepting that conclusion, we need to verify, the result. We think that we entered ldata for the _ame cars in each dataset, so every variable should be defined for every car. Although "ateknow it is tintikely, we recognize the possibility that we made a mistake and accidentally ,eft some _ars out of ohe or the other dataset. We can reassure ourselves of our infallibility by tabulating _._merge: . ta_late
I
_mergel
a
I Total _merge
74
}
74 i Freq. !
I
ioooo I00.00 Percent
Ioooo Cure.
We see that __merge is !3 for ever), observation in the dataset. We made no mistake--for obsevvation in autocos_.dta, there was an obs+rvation in autotech.dta and vice versa.
every
Nov,' p_etend that £'e have another dataset containing additional information on these automobile_--automord.dta--and we want to merge that dataset as well. Before we can do so. we muM sort the da{a we have in memory, by make since after a merge the sort order may have changed: . sor mer
I
r(ltO ; _m_rg t already
i
1978 Automobile
7
I
! I .
autotech.dta 74
var_:
I I
}
!
Cont_Lins data ob_,:
i
i i : ii
313
make make
usin_
automore
! defined
After sortin_ the data, S!ata refused to merge the new dataset, complaining instead that ._merge _s already de_ned. Every t_me State merges datase|s it wants to create a variable called _merge (or tlarname if!the _merge(!vamame) option was specified). In this case, there is an _merge variable can rename _merge • _ • • i .... left over frdm the last ti_e we merged. \_,e have _hree choices: We the variable, we can dr_p it, or we can specify a different variable name with the _.merge() option. In this case _merge contains nQ useful lnformatmn we already verified that the prevmus merge went as expected ;o we drop ii and try, again: • dro
_merge
i
Stata performed
our request; whatever new variables
were contained
in automore,
dta
are now
contained in our single, master dataset---perhaps. One should not jump to conclusions. After a match merge, you should always tabulate ..merge to verify that the expected actually happened, as we do below: • tabulate _merge _merge
Freq.
Percent
Cum.
1 2 3
1 1 73
1.33 1.33 97.33
1.33 2.67 100. O0
Total
75
100.00
Surprise! In this case something strange did happen. Some 73 of the observations merged as we anticipated. However, the new dataset automore.dta added one new car to the dataset (identified by ..merge equal to 2) and failed to define new variables for another car in our original dataset (identified by _merge equal to 1). Perhaps this is what should happen, bui it is more likely that we have a mistake in automore, dta. We probably misidentified one car so that to Stata it appeared as data on a new car, resulting in one new observation and missing data on another. If this happened to us, we would figure out why it happened. We would type list make if ._merge==l to learn the identity of the car that did not appear in automore, dta. and we would type list make if _merge==2 to learn the identity of the car that automore, dta added to our data. q
[3 Technical
Note
It is difficult to overemphasize the importance of tabulating ..merge no matter how sure you are that you have no errors. It takes only a second and can save you hours of grief. Along the same lines, one-to-one merges are a bad idea. In the example above, we could have performed all the merges as one-to-one merges and saved a small amount of typing. Let's examine what would have happened. We first merged autotech.dta with autocost,dta by typingmerge make using autocost We could perform a one-to-one merge by typing merge using autocost. The result would be the same; the datasets line up and are in the same sort order, so sequentially matching the observations from the two datasets would have resulted in a perfectly matched dataset. In the second case, we merged the data in memory with automore .dta by typing merge make using automore. A one-to-one merge would have led to disaster, and we would never have known it! If we type merge using automore, Stata would sequentially, and blindly, join observations. Since there are the same number of observations in each dataset, everything would appear to merge perfectly. We speculated in the previous automore.dta included data on no error, things have gone awry. match. For instance, assume that
example that we had an error in automore, dta. Remember that one new car and lacked data on an existing car. Even if there is No matter what, the data in memory and automore, dta do not this new car is the first observation of automore, dta and that it
is some (perhaps mistaken) model of Ford. Assume that the first observation of the data in memory is on a Chevrolet. Stata could and would silently join data on the Chevrolet with data on the Ford. and thereafter data on a Volvo with data on a Saab, and even data on a Volkswagen with data on a Cadillac, and you would never know. Every dataset should carry a variable or a set of variables that uniquely identifies each observation, and then you should always use those variables when merging data. Ignore this advice at your own peril. []
I
! merge -- Merge datasets
315
1 i
121Technical INote , Circu n!s t tances ma Y _rise when you will mer,,e . . _,_two datasets knowino5 there will be mismatches,, hay
_
you have _n analysis d!taset on patients from t!e cancer ward of a particular hospital and you have just recei_ded another d_taset containing their demographic information. Actually. this other dataset contains nbt just their dhmographic information but the demographic information on every patient in
! i
the
i
hospital during the year. You could • m_ge patid using demog d_op if _raerge,=2
or
i
':
m_ge
i i_
l
i
, patid using demog, nokeep
.
.
.
The noke_p opnon tell_ merge not to store observations from the usln_ data that do not appear in the mastel_ There is an!advantage in this, \Vhe_ we merged and dropped, we stored the irrelevant • I -i observanotls and then discarded them• so the data in memory temporarily grew, When we merge with the nokee_ option, the _data never grow beyond what is absolutely necessary,
°
i
In our _automobile example, we had a _ingle identifying variable. Sometimes you will have
! !
idenfifvin ! variables, v_riables that, taken togettJer, are unique for every observation," Let's in_agine that. r_ther than having a single variable called make, we had two variables: manuf and mode_,manuf conl_ainsthe manufacturer arid model contains the model, Rather than having a single varihble recording, sav, Honda Accord . we have two variables, one recording Honda" and another re_ording "Accord". Stata can deal with this type of data. You can go back through our previous ekample and substitute manuf model everywhere you see make.For instance, rather than typing meljge make usihg autocost, we would have typed merge manuf model using autocost
! ! ! i iI i i
Now le_'s make one more change in our assumptions. Let's assume that manuf and model are nol strir_g vari_tbles but are ihstead numerically coded variables. Perhaps the number 15 stands for Honda in the roanifvariable a_d the number 2 stands for Accord in the model variable \Ve do not have to remember mr numeric dudes because we have smartly created value labels telling Stata what number stands for what string of characters. We now go back to the step where we merged autotech.dta with auto :ost. dta: • us, ._ autotech i (Aut)mobile mode_s)
I
, me:.'ge manuf mo_ei using autocost (lab._lmanuf alr4ady defined) (2ab,_lmodel alrdady defined)
I i
! !
Stata makds two minor domments but otherwise carries out our request, It notes that the labels manuf and modell, are already. _lefined.•The messages refer to the value labels named manuf and model
:_ ._
Both• d t tasets contait_t value label definitions that turn the numeric codes for manufacturer and model lntolwords. Whed Stata merged the two daiasets, it already had one set of definitions in memory (obtained _hen we type_t use autotech) and thus ignored the second set of definitions contained in autocost;dta. Stata f_lt obliged to mention the second set of definitions while otherwise ignorin_
i
t
.
i
.
.
.
hem slncq they smght _contaln different codings. In this case, we know they are the same since we CreatedI them. (Him:i You should never give the same name to value label's containin._ different codings.) ! i
_,_
p
,_ ,vWhen one mergeis -Merge oatasets performing a match
merge, the master an_or using datasets may have multiple observationswith the same varlist value. These multiple observations are joined sequentially,as in a one-to-one merge. If the datasets have an unequal number of observations with the same varlist value, the last such observationin the shorter dataset is replicated until the number of observations is equal. ;> Example The processof replicating the observationfrom the shorter dataset is known as spreadingand can be put to practical use. Suppose you have two datasets, costs of your firm, by region, for the last year:
dollars,
dta
contains
the dollar
sales and
• use dollars (RegionalSales & Costs) • list region NE N Cntrl South West
I. 2. 3. 4. sforce,
dta
they operate:
sales 360,523 419,472 532,399 310,565
cost 138,097 227,677 330,499 1.65,348
containsthe names of the individualsin your sales force along with the region in which
• use sforce (SalesForce) list region I. NE 2. NE 3. N Cntrl 4. 5. 6. 7. 8. 9. I0. II. 12.
N Cntrl N Cntrl South South South South West West West
name Ecklund Franks Krantz Phipps Willis Anderson Dubnoff Lee McNiel Charles Grant Cobb .
You now wish to merge these two datasets by region, spreading the sales and cost information across
all observations
for which
it is relevant;
that
is, you want
to add the
costs to the sales force data. The variable sales will assume the value observations, $4t9,472 for the next three observations, and so on.
(Continued
on next page)
variables
$360,523
sales
and
for the first two
I
i
i -
i• ! ! i {
i
i
r-
merge -- Merge da'msets • me::ge region (la_;1
region
using
317
dollars
at_,eady
defined)
• li_t 1
I. _
region NE
2. i
name ']ckland
NE
3. N Cntrl 4, _ N Cntrl 5. N Cntrl
sales 360 523
cost 138,097
merge
Freaks
360 523
138,097
3
Krantz Phipps Willis
419 472 419 472 419 472
227,677 227,677 227,677
3 3 3
3
6.
South
7.
South
_nderson bubnof f
399 532 399
330,499 330,499
3 3
8. 9.
South South
: Lee , McNiel
532 399 532 399
330,499 330,499
3 3
West West West
Grant bharles Cobb
310 310 565 565 310 565
165,348 165,348 165,348
3 3 3
11. 10. i 12.
532
Even th)ugh there arc 12 observations in the gales force data and only 4 obse_'ations in the sales and cost data, all the re%rds The sforce, dta contained dollars. Ira was matched record in dollars, dta'was
merged. The dollars, dta contained one obsen'ation for the NE region. two observations for the same region. Thus, the single observation in to both the observations in sforce.dta. In technical jargon, the single replicated, or spread, across the observations in sforce, dta.
2
UpdatingdOata merge
ith the updaie
option varies merge's actions when an observation in the master is matched
with an oblervationin tile using dataset. Without the update option• merge leaves the values in the master dat_set alone and adds the data for the new variables. With the update option, merge adds the new vagables, but ii also replaces missing values in the master observation with corresponding values fron_ the using. (_dissing values mean numeric missin_ (.) and empty, strings (""_)., The vahtes for __merge are extended: _merge
1 2 i
meaning
obs. from masterdata obs. from usingdata obs. from botlt,masteragrees with using obs. from both, missingin master updated obs from both, masterdisagreeswith using
4 5
i
In the easel of __merge = 5. the master values are retained unless replace
i i
case the m_.,.ter values are updated just as if they had been missing. Pretend tlataset l con[ains variables id. ,,_.and b: dataset 2 contains id. a, and z. You merge the
i
two dataset_ by _d. data_et I being the master d_taset in memory and dataset 2 the using dataset on disk. Con._i_er two obseivations that match and _all the values from the first dataset idl, etc., and those fromlthe second i_,. etc. The resulting d_taset will have variables id _ b. a'. and _anerge. merge's tyt_ical logic is i
i
I
.
_
•
o
I. The factl,that the obsdrvations match means idl = ida, Set id = ida.
:_!
_.'_Variablelo occurs in l_oth datasels. Ignore o,, and _;eta = al.
i _
3. Variable b occurs in c}nlydataset 1 Set b = b_. 4. Variable a' occurs in +nly dataset 2. Set z = ,r>
i
5. Set _me::ge = 3.
!
i
is specified, in which
.?
With update,
the logic is modified:
1. (unchanged.) :
Since the observations
match, idl = id2. Set id = idl
2. Variable a occurs in both datasets: a. If al -- a2, set a = al and set __merge = 3. b. If al contains missing and a2 is nonmissing, update was made•
set a = a2 and set ...merge -- 4, indicating
an
c. If a2 contains missing, set a = al and set __merge - 3 (indicating
no update).
d. If al 5¢ a2 and both contain nonnfissing, set a = al or, if replace regardless, set _merge = 5, indicating a disagreement.
was specified, a = a2 but,
Rules 3 and 4 remain unchanged.
> Example In original.dta you have data on some cars that include the make, price, and mileage rating. In updates, dta you have some updated data on these cars along with a new variable recording engine displacement. The data contain • use
original,
(original
clear
data)
list make 1. Chev. 2. Chev.
pri c e 3,299 4,504
Chevette Malibu
3. Datsun
mpg 29
510
5,079
4. Merc.
XR-7
6,303
5. Olds
Cutlass
4,733
19
3,895
26
7,140
23
6.
Renault
Le Car
7. VW Dasher • use updates, (updates,
24
clear
mpg
and
displacement)
• list make i. Chev.
Chevette
mpg
displac-t 231
2. Chev. 3. Datsun
Malibu
22
200
510
24
119
XR-7
14
302
Cutlass
19
231
25
79
23
97
4. Merc. 5.
Olds
6. Renault
Le Car
7. VW Dasher
By updating our data. we obtain • use
original,
(original • merge
clear
data)
make
using
updates,
update
list make i. Chev.
Chevette
price 3,299
mpg 29
displac~t 231
_merge 3
2. Chev.
Malibu
4,504
22
200
4
3. Datsun
510
5,079
24
119
3
4. Merc. XR-7 5. Olds Cutlass
6,303 4,733
14 19
302 231
4 3
6. Renault
3,895
26
79
5
7,140
23
97
3
Le Car
7. VW Dasher
_,
,
i
_
merge-- Merge datasets
319
All observations merged because all have _..merge > 3. The, observations having _.merge = 3 have _pg:just a_ it was recorded in the original dataset. In observation t. mpg is 29 because the updated data,£et had mpg = . : in observation 3. mpg remhins 24 because the updated dataset also stated that
i i
mpg_is 24"I The ob!ervations having ...merge = 4 have had their mpg data updated. The mpg variable was missing in,observations 2 and 4 and new values' were obtained from the update data.
i
The _ob_ervation having _merge = 5 has its mpg just as it was recorded in the original dataset, ! i
just as do the __merge = 3 observations, but ther_ is an important difference. There is a disagreement "about the %lue of rapg; the original claims it is 26 and the updated, 25. Had we specified the
i
replaGe
i
_mergo =i5. replace
_ption, mpg would now contain the u_ated
25 but the observation would still be marked
affects only _xhich value _is retained in the case of disagreement,
q
ReferenCe Nash,J. D. t994. dmlg: Mergingrag, data and dictionaryfiles. Stata TechnicalBulletin 20: 3-5, Reprintedin Su_ta i.
i
i
Techaicalt BulletinReprints' v°l" 4'matchedmerging._tata pp" 22-25" Weesie. J. 2000. din75:Safe andeasy TechnicalBulletin53: 6-1% Reprintedin Stata Technical Bulletin!eprints, vol. 9, pp. 62-77.
{
i
AlsoSee Compimne _tary:
[R] save, JR]sort
Related::
JR] append, [R] cross, [R] _oinby I
BaeRgroun d:
[u] 25 Commands for combining data
meta
--
Meta analysis
)i!
,
II
,
Remarks Stata
should
have
a meta-an_alysis
command,
but
as of the date
that this manual
was
written,
Stata does not. Stata users, however, have developed an excellent suite of commands for performing meta-analysis, many of which have been published in the Stata Technical Bulletin (STB).
Issue
insert
author(s)
command
description
STB-38
sbet6
S. Sharp, J. Sterne
meta
meta-analysis for an outcome of two exposures or two treatment regimens
STB-42
sbel6.1
S. Sharp, J. Sterne
meza
update of sbel6
STB-43
sbel6.2
S. Sharp, J, Sterne
meta
update; install this ve_ion
STB-41
sbel9
T, J. Steichen
metabias
performs the Begg and Mazumdar (1994) adjusted rank correlation test for publication bias and the Egger et al. (1997) regression asymmetry test for publication bias
STB-44
sbel9.1
T.J. Steichen. M. Egger, J. Sterne
metabias
update of sbel9
STB-57
sbel9.2
T.J. Steichen
metabias
update; install this version
STB-41
sbe20
A. Tobias
galbr
performs _he Gatbraith plot (1988) which is useful for investigating heterogeneity in meta-analysis
STB-56
sbe20.1
A. Tobias
galbr
update; install this version
STB-42
sbe22
J. Sterne
metacum
performs cumulative meta-analysls, using fixed- or random-effects models, and graphs the result
STB-42
sbe23
S. Sharp
raetareg
extends a random-effects meta-analysis to estimate the extent to which one or more covariates, with values defined for each study in the analysis, explains heterogeneity in the treatment effects
STB-44
sbe24
M.J. Bradbum. J. J. Deeks, D. G. Altman
metma, funnel, labbe
meta-analysis of studies with two groups funnel plot of precision versus treatment effect UAbb6 plot
STB-45
sbe24.1
M. J, Bradbum J. J. Deeks, D. G. Altman
funnel
update; install this version
STB-47
sbe26
A. Tobias
metainf, meta
graphical technique to look for influential studies in the meta-analysis estimate
STB-56
sbe26.1
A. Tobias
metainf
update: install this version
STB-49
sbe28
A. Tobias
metap
combines p-values using either Fisher's method or Edgington's method
STB-56
sbe28.I
A. Tobias
metap
update: install this version
STB-57
sbe39
T_ J. Steichen
metatrim
pedbrms the Dural and Tweedie (20001 nonparametric "trim and fill" method of accounting for publication bias in meta-analysis
Additional commands may be available: enter Stata and type search
320
meta analysas.
I
meta-- Metaanalysis
f
321
To downl_)adand install from the Interact the Sharp and Stem meta command, for instance. Stata i
yot_cquld
i
i
2. !- Click Pull down on http://www.stat_.com. Help and select STB and Use :-written Programs. I i
3. Click on stb. !
I
4. Click on stM9. 5. Click on she28.
I
6. Clk k on dick here to install i !
or yot1co aid instead do the following: l. Na_igate to the appropriate STBissue: Type net from http://w_, Type net ¢d stb Type net
I
i i
i i
[
cd stb49
or _. Type net from http ://www.sta_a, com/stb/stb49
2. Tyt e net describe sbe28 3.TyFe net installsbe28
fere.cs
Be_, C. B|and M. Mazurndar 1994 Opet_atingcharactdristics of a rank correlation test for publication bias. Biometric._
0:!lo8 -llOi.
t
Bradbu_n,
ii
'
sta_a, corn
--.
_k/l.J., J, J. Decks, and D. G. Altman. 1998a. sbe24: metan--an alternative mcta.anatys_s command. Stat,_
:Yechnicll Bulletin 44: 4-15. Reprinted in Stata TeJhnica! Bulletin Reprints, vol. 8. pp. 86-100. t998_, sbe24.l: Correction to funnel plot. Stata qgechnicalBulletin 45: 21. Reprinted in Stata Technical Bulletin
_eOrint_, vol. 8, pp, 100. ¢ Egger, M.,IG. D. Smith, M. Schneider, and C. Mindef. I997. Bias in meta-analvsi., detected by a simple, graphical _test_Bt ;tish Medical Journal 315: 629-634.
Gat_rai_h,iMe_licile_. 7:F" 889-894.I988' A note on graphical display of"cstimated;,odds ratios from several clinical trials. Statistics i,,_ t L Abbe:,K. A., A. S. Detsky. and K. O'Rourke. 1987. Meta-anaNsis in clinical research. Annals of Internal Medicine 107): 224-233. , i i !
Sh_sp, S. 1998. sbe23: Meta-analvsis rc_rcssion. Stat_ Technical Bulletin 42: 16-22. Reprinted in Stata Technic_! :Bu_letir,Reprints, vol. 7. pp. 1_18-155. i Sharp, S, _nd J. Sterne. t997. sbet6: Meta-analysis. Siata Technical Bulletin 38:9-1,1 Reprinted in Stata Technical 'Bu_detit Reprints, voi. 7. pp. 100-106. 199ta. sbel6.1: New syntax and output for the meta-analysis command. Stata Technical Bulletin 42:6-8 Rcprint,'.d in Stata Technical Bulletin Reprints, vol. 17,pp. 106-108. 'Te_nk al Bulletin Reprints, vol. 8, p. 84. ....... . 19%b. sbel6.2: Corrections to the recta-analysis!command. Stata Technical Bulletin 43: 15. Reprinted in Slam Steichen. _. J. 1998. sbcl9: Tests for publication bias in meta-ana!ysis. Stata Technical Bulletin 41:9-t5 Reprinted in Stat_ Technicel Bulretin Reprints, rot. 7. pp. 125-133. __..i_ 200_a. sbel9.2: Update of tests for publication _ias in recta-analysis. Stata Technical Bulletin ._7:4.
!
-_'
_]!i;
322
meta u
Meta analysis
. 2000b. she39: Nonparametric trim and fill analysis of publication bias in meta-analysis. Stata Technical Bulletin 57: 8-14. Steichen, T. J.. M, Egger, and J. Sterne. 1998. sbel9.1: Tests for publication bias in recta-analysis. Stata Technical Bulletin 44: 3-4. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp, 84-85. Sterne, J. 1998. sbe22: Cumulative recta-analysis. Stata Technical Bulletin 42: 13-t6. Bulletin Reprints, vol. 7, pp. 143-147.
Reprinted in Stata Technical
Tobias. A. 1998. sbe20: Assessing heterogeneity in recta-analysis: the Galbraith plot. Stata Technical Bulletin 41: 15-17. Reprinted in Stata TechnicaI Bntletin Reprints, vot. 7, pp. I33-I36. 1999a, she26: Assessing the influence of a single study in the meta-analysis estimate, Stata Technical Bulletin 47: 15-17. Reprinted in Stata Technical Bulletin Reprints, vot. 8, p. 108-t10, 1999b. she28: Meta-analysis of p-values. Stata Technical Bulletin 49: 15-t7. Bulletin Reprints, vol. 9, pp, 138-140. 2000a. she20.1: Update of galbr. Stata Technical Bulletin 56: 14, 2000b. sbe26.1: Update of metainf. Stata Technical Bultetin 56: 15. • 2000c. sbe28.1: Update of metap. Stata Technical Bulfetin 56: 15,
Reprinted in Stata Technical
Title !
!
i! i
=-
i m,x
Obtain marginal effects or elasticities after estimation
i
mfx
i
c_mpRte
[if el]o]
at l atlist) eqlist Rot Linear
i
' [in range]
[, d_dx
eyex
dyex
i
eydx
(eqnames) predict ('predict_option)
nodiscrete
noesample
nowght
nose
level(#)
]
=fX r ,play[,level(#)] where!at/_t is {mean I median I zero [varname=# [, varrrame=#] nu,ntist
(for single_quation estimators only)
mamame
(for single-equation estimators only)
[...]]}
DesCriptii:m numerically calculates the marginal effects or the elasticities and their standard errors after esti_mtion. Exactly what mfx can calculate is determined by the previous estimation command and the ]_redict (predict_option) option. At Whichpoints the marginal effects or elasticities are to be evaluated is determined by the at (atlist) option. By default, mfx calculates the marginal effects or the el_Lsticitiesat the means of the independent variables by using the default prediction option
}
mfx ci,mpute
i i i
associate, with the previous estimation command. mfx r _lay replays the results of the previous rafx computation.
!
i
t
i _
Opti dy x s! ifies that marginal effects are to be calculated. It is the default. eyex .sp_:ifies that elasticities are to be calculated in the form of 0 log b'/Olog z.
i
dy_xsp_ :ifiesthat elasticities are to be calculated in the form of Oy/Ologz
i
eydxspe _ifiesthat elasticities are to be calculated in the form of 0 log y/O_. at(atlist specifies the points around which the marginal effects or the elasticities are to be estimated.
i
The d ,.faultis to estimate the effect around_the means of the independent variables. at( mBan median I zero [ varname = # [, varnarne= #] [...]]) specifies that the marginal effecL,or the elasticities are to be evaluated at means, at medians of the independent variables, or at zer,)s. It also allows users to specify pahicular values for one or more independent variables. assuming the rest are means, medians, or zeros. For instance.
i
• ,
robit foreiffp, mpg _eight price !
.
fx compute, at(mean
mpg=30) 323
_
Jii" i
,._d;-e
..A
--
uuttllll
IIIdlylllal
elTeCI$
or
elasTiciTies
after
estimation
at there (numlist) specifies term that the are tooption be evaluated at the numlist. If is a constant in marginal the model,effects add 1ortothe theelasticities numlist. This is for single-equation estimators only. For instance, • probit • mfx
foreign
compute,
mpg at(21
weight 3000
price
6000
I)
at (matname) specifies the points in a matrix format. A 1 is also needed if there is a constant term in the model. This option is for single-equation estimators only. For instance, • probit
foreign
mpg
• mat
A = (2i,
3000,
• mfx
compute,
at(A)
weight 6000,
price I)
eqlist (eqnames) indirectly specifies the variables for which marginal effects (or elasticities) are to be calculated. Marginal effects (elasticities) will be calculated for all variables in the equations specified. The default is all equations, which is to say, all variables. predict (predict__option) specifies which function is to be calculated for the marginal effects or the elasticities; i.e., the form of ft. The default is the default predict option of the previous estimation command. For instance, since the default prediction for probiZ is the probability of a positive outcome, the predict () option is not required to calculate the marginal effects of the independent variables for the probability of a positive outcome. probit . mfx
foreign
mpg
weight
price
compute
To calculate the. marginal effects for the linear prediction (xb), specify predict (xb). • mfx
compute,
predict(xb)
To see which predict options are available,
see help for the particular estimation
command.
nonlinear specifies that y, the function to be calculated for the marginal effects or the elasticities, does not meet the linear-form restriction. For the definition of the linear-form restriction, please refer to the Methods and Formulas section, By default, mfx will assume that y meets the linearform restriction, unless one or more independent variables are shared bv multiple equations• For instance, predictions after heckman
mpg
price,
sel(for=rep)
meet the linear-form restriction, but those after • heckman
mpg
price,
sel(for=rep
price)
do not. If y meets the linear-form restriction, specifying nonlinear or not should produce the same results. However, the nonlinear method is generally more time-consuming. Most likely, users do not need to specify nonlinear after a State official estimation command. For user-written estimation commands, if you are not sure whether y is of linear-form, specifying nonlinear is always a safe choice. Please refer to the Speed and accuracy section for further discussion. nodiscrelze treats dummy variables as continuous ones. Ifnodiscrete is not effect of a dummy variable is calculated as the discrete change in the expected variable as the dummy variable changes from 0 to 1. This option is irrelevant the elasticities, because all the dummy variables are treated as continuous in
specified, the marginal value of the dependent to the computation of computing elasticities.
noesample only affects at (atlist). It specifies that when the means and medians are calculated, the whole dataset is to be considered instead of only those marked in the e (sample) defined by the previous estimation command.
d
J mfx -- Obtain marginaleffects or elasticitiesafter estimation
325
nowghtof ly affects at(atIist). It specifies that Weights are to be ignored when calculating the means and ime [inns for the atlist. nose asks mfx to calculate the marginal effects or the elasticities without their standard errors. Caleula ing standard errors is very time-consuming, Speci_,ing nose will reduce the running time Of dfx. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as s¢ by set level: see [U] 23,5 Specifying the width of confidence intervals.
(95)
Pmarl 1 '
Rentarl_s are presented under the headings ! Obtaining
marginal
effects
after single-equation
Obtaining
marginal
effects
after multiple-equation
Obtaining
three
Speed
forms
(SE) estimation (ME)
esdmation
of elasticities
and accuracy
Obtaining arginal effects after singk -equation (SE) estimation Bef4_re :unning mfx. type help estimation_cmd to see what can be predicted after estimation and to see _e default prediction.
> Example We esti mate a logit model using the auto dataset: io ;it foreign
mpg
price
_ter _ion
O:
log likelihood
=
Ieer_tion
I:
log likelihood
= -36.694B39
I_er_tion
2:
log likelihood
= -36,463_94
_ter_tion
3:
log likelihood
=
I_er_tion
4:
log likelihood
= -36,462189
L_gi
L_g
-45.03_21
-36.46_19
estimates
likelihood
foreign I
Number of obs LR chi2(2) Prob > chi2 = -36.462189
Coef.
Pseudo
Std. Err.
mpg
.2338353
.067t449
price cons
.000266 -7.648111
.0001166 2.043673
z 3.48 2.28 -3.74
R2
= = =
74 17.14 0.0002
=
0.1903
P>]z
[95_ Conf.
Interval]
0.000
.1022338
.3654368
0.022 0,000
.0000375 -11.65364
.0004945 -3.642586
To determine the marginal effects of mpg and price for the probability of a positive outcome at their mean values, issue the mfx command, because the default prediction after !ogit is the probabilit} of a positive outcome and the calculation is requested at the mean values.
_N
.=_
-
-
,,,,A -- uuu==n marginal e_ects or elasticities after estimation
• mfx compute effects after Marginal logit y = Pr(foreign)(predict) = .26347633
/J
variable
dy/dx
Std. Err.
z
P>lzl
[
95Z C.I.
]
X
J
mpg price
The
.0453773 .0000516
first line
of the
output
.0131 .00002
indicates
3.46 2.31
that the
0.001 0.021
marginal
.019702 .071053 7.8e-06 .000095
effects
were
21.2973 6165.26
calculated
estimation, The second line of the output gives the form of fl and the predict would type to get y. The third line of the output gives the value of fl given X, in the last column of the table.
after
a logit
command that we which are displayed
To calculate the marginal effects at particular data points, say, mpg = 20, price = 6000, specify the at () option: mfx compute,at(mpg=20 , price=6000) Marginal effects after logit y = Pr(foreiKn) (predict) = .20176601 variable I mpg price
dy/dx
Std. Err.
I
.0376607 ,0000428
.00961 .00002
z
P>Izl
3.92 2,47
0.000 0.014
[
95_ C.l.
]
.018834 .056488 8.8e-06 .000077
X 20.0000 6000.00
To calculate the marginal effects for the linear prediction (xb) instead of the probability, specify predict (xb). Note that the marginal effects for the linear prediction are the coefficients themselves. mfx compute, predict(xb) Marginal effects after legit y = Linear prediction (predict,xb) =-I.0279779 variable
dy/dx
Std, Err,
.2338353 .000266
mpg price
.06714 .00012
z
P>[z[
3.48 2.28
0.000 0.022
[
95_ C,I.
.102234 .000038
]
.365437 .000495
X 21.2973 6165.26
If there is a dummy variable as an independent variable, mlx will calculate the discrete change as the dummy variable changes from 0 to 1.
gen
record
. replace (34 real
= 0
record changes
= I if rep made)
> 3
_
mfx -- Obtain marginaleffects or elasticitiesafter estimation
327
• _logit foreign mpg record, nolog L_it
estimates
Number of obs LR chi2(2) Prob > chi2 Pseudo R2
Lo_ likelihood = _31.898321
mpg I
.1079219
record cons ._oreign :.
1
_fx
2,435068 -4.689347 Coef.
.0565077
i.91 3.42 43.54 z
.7128444 i.326547 Std. Err.
O.056 0.001 O.000 P>Izl
= = = =
74 26.27 O.0000 0.2917
-.0028311 .2186749 I 3.832217 -7.28933 -2.089363 [957,Conf. Interval] • 03_18
compute
_Ma_gi_al effects after logit ly = Pr(forei_1) (predict) | = .21890034 ! v_i
le mP_ I
_ec_'-rd* I
dy/dx
Std. Err.
z
P>Izl
[
95_ C.I.
.018_528
.01017
1.81
0.070 -,001475
.4272707
.10432
4.09
0,000
]
X
.038381
21.2973
.63163
.459459
.222712
(*) d'r/dxis for discrete change of dummy variable from 0 to I
If nodiscrete iS specified, mfx will treat the:dummy variable as continuous. Ifx compute,
no_iscrete
!Ma_gilLaleffects a_ter iogit y = Pr(foreign) (predict) = . 218900_34 _vari le
dyjdx
Std. Err.
z
P>Izl
[
957,C.I.
]
X
m_g I
,0184528
.01017
1.81:
0.070
-.001475
.038381
21.2973
recI'rd ]
.4163552
.10733
3.88
0.000
.205994
.626716
.459459
q
Obtaining rr arginal effects after multiple-equation (ME) estimation If you ha ;e not read the discussion above on using mfx after SE estimations, please do so. Except tbr the abilily to select specific equations for the calculation of marginal effects, the use of mfx after ME models Olk)ws atmost exactly the same form as fix SE models. The details of prediction statistics that are specific to particular ME models are documented with the estimali,)n command. Users of mfx after ME commands should first read the documentation of predict for the estimation command. For a general introduction to the ME models, we will demonstrate mfx after heckman and mlogit.
...........
F /_i
_ ........
_v,_
v, _,aau_lu_a,
_ImatlON
Example Heckman selection model Number of obs . heckman mpg weight length, sol(foreign = displ) nolog (regression model with sample selection) Censored obs
: i
_Ik_r
Log likelihood = -87.58426 Coef.
Std, Err.
z
=
74
=
52
Uncolored Wald chi2(2)obs
= =
22 7.27
Prob > chi2
=
0.0264
P>Izl
[95_ Conf. Interval]
mpg weight length _cons
-.0039923 -.1202545 56.72567
.0071948 .2093074 21.68463
-0.55 -0.57 2.62
0.579 0.566 0.009
-.0180939 -.5304895 14.22458
.0101092 .2899805 99.22676
-.0250297 3.223625
.0067241 .8757406
-3.72 3.68
0.000 0.000
-.0382088 1.507205
-.0118506 4.940045
/athrho
-.9840858
.8112212
-1,21
0.225
-2.57405
.6058785
/lnsigma
1.724306
.2794524
6.17
0.000
1.176589
2.272022
-.7548292
.349014
-.9884463
.5412193
5.608626 -4,233555
1.567344 3.022645
3.243293 -10.15783
9.698997 1.690721
foreign displacement _cons
rho sigma lambda
LK test of indep, eqns. (rho = 0):
chi2(1) =
1.37
Prob > chi2 = 0.2413
heckman estimated two equations, mpg and foreign; see [R] heckman. Two of the prediction statistics a_er heckman are the expected value of the dependent variable and the probability of being observed. To obtain the marginal effec_ of the independent variables of all the equations for the expected value of the dependent variable, specify predict (yexpected) with mfx. . mfx compute, predict(yexpected) Marginal effects after heckman y = E(mpg*IPr(foreigm)) = .56522778 variable
dy/dx
weight length displa~t
-.0001725 -.0051953 -.0340055
(predict, yexpected)
Std. Err. .00041 .01002 .02541
z -0.42 -0.52 -1.34
P>Izl
[
95Z C.l.
0.675 0.604 0.181
-.000979 -.02483 -.083802
]
.000634 .01444 .015791
X 3019.46 187.932 197.297
To calculate the marginal effects for the probability of being observed, since only the independent variables in equation foreign affect the probability of being observed, specify eqlist (foreign) to restrict the calculation. mfx compute, eqlist(foreign)
predict(psel)
Marginal effects after beckman y = Pr(foreign) (predict, psel) = .04320292 variable
dy/dx
dispia-t
-.0022958
Std. Err. .00153
z -1.50
P>Izl
[
95Z C.I.
0.133
-.005287
]
.000696
X 197.297
q
V
mfx -- Obtain ma
inal effects or elasticities after estimation
329
E>ExampleI predict after mlogit has a special feature that most other estimation commands do not. It can predict m_ltiple new variables by issuing predict only once; see [R] mlogit. This feature cannot be adopte 1into mix. To calculate the marginal effects for the probability of each outcome, run _z septtra_ely for each outcome.
• _ ogit rep78 mpg disp_,
notog
MuI_inomialregression
Log likelihood=
-82.27874
rep78 I
Numberof obs LR c_i2(8) Prob > chi2 PseudoR2
Coef.
Std. Err.
z
P> z[
= = = =
69 22.83 O. 0036 0,1218
[95Z Conf, Interval]
i 1
mpg
displacement cons
-.0021573
.2104309
-0.01
0.992
-.0052312 -1.566574
.0126927 6.429681
-0.41 -0.24
0.680 0.808
-.0301085 -14.16852
.4145942
.4102796 .0196461 11.03537
• mpg _is_,acement _cons
.01509S4 .1235325 ,0020254 ,0063719 -2.09099 3.664348
0.12 0,32 -0.57
0.903 0.751 0.568
-.2270239 -.0104634 -_.272981
.2572147 .0145142 5.091001
mpg d_s .acement _cons
.0070871 ,0883698 -,0066993 .0053435 .7047881 2.704785
0,08 -1.25 0.26
0.936 0.210 0.794
-.1661146 -.0171723 -4.596492
.1802888 .0037737 6,006069
,0808327 ,0983973 -.0231922 .0119692 .652801 3.545048
0.82 -1.94 0,18
0.411 0.053 0.854
-.I120224 -.0466514 -6,295365
.2736878 ,0002671 7,600967
5 i
mpg displacement _cons (Dut,:ome rep78==3 • mf: compute,
is the comparison
group)
predict(outcome(1))
Marg nal effectsafter mlogit y = Pr(rep78==1)(predict,outcome(I)) = ,03438017
-.0003566 _y/dx -.0000703
.C_679 Std. Err. -0.05 z .00041
-0.1V
0.958 .01295] P>lz[ -.013663 [ 95Z C.I.
21.2899 X
0.864
198.000
-.000873
.000732
• mfI compute,predict(outcome(2)) Marginaleffects after mlogit = .12361544 l y =I Pr(rep78==2)_predict,outconm(2)) variable
I
dy/dx
mpg d_sp]a-t
[ ] i
.0008507 .0006444 4
Std. Err.
.01277 .00067
z
0.07 0.96
P>Iz]
[
95Z C.I.
0.947 0.336
-.024183 -.000668
]
.025885 .001957
X
21.2899 198.000
....v
,,,,_ --
i":_i
,JUmm
marginal
effects
or elasUciUes
after estimation
mfx compute, predict(outcome(3)) Marginal effects after mlogit y = Pr(rep78==3) (predict, outcome(3)) = .48578012
' 1
variable
I
mpg displa-t
t
t
dy/dx
-.0039901 .0015484
Std. Err. .01922 .00108
z -0.21 1.43
P>Iz[
[
95_ C.I.
0.836 0.151
-.041682 -.000567
]
.033682 .003664
X 21.2899 198.000
. mfx compute, predict(outcome(4)) Marginal effects after mlogit y = Pr(repYS--=4) (predict, outcome(4)) = .30337619 variable
dy/dx
mpg displa-t
-.0003418 -.0010654
Std. Err. .01707 .00106
z -0.02 -1.01
P>IzF
[
95_ C.l.
0.984 0.313
-.033805 -.003136
]
.033122 .001005
X 21.2899 198.000
• mfx compute, predict(outcome(5)) Marginzl effects after mlogit y = Pr(rep78==5) (predict, outcome(5)) = .05284808
variable I
dy/dx
displa~t mpg I
-.0010572 .0038378
Std. Err.
.00047 .00561
z
-2.24 0.68
P>[zl
[
95_ C.I.
]
0.025 0.494
-°001984 -.000131 -.007167 .014843
X
198.000 21.2899
q
Obtaining three forms of elasticities mfx can also be used to obtain all ttu'ee forms of elasticities. option
elasticity
eyex,
0 log y/O log x Oy/O log x 0 log y/Ox
dyex eydx
b, Example We estimate a regression model using the auto dataset. The marginal effects for the predicted value y after a regress are the same as the coefficients. To get the elasticities of torm Olog;_/Ologz, specify the eyex option:
r
mfx -- Obtain marginaleffects or elasticitiesafter estimation
, re_ress =_
weight length
Source
SS ' 1_616.08062
_ Model Ftesidual
df
827.378835
Number of F( 2, Prob > F
2
MS _ 808.040312
71
11.643223
73
33.47;J0474
'
Std. Err. .001586 ,0553577 6.08787
obs = 71) = =
R-squared Adj R-squared Root MSE
t i-2.43 _-1.44 7.87
P> ]t ] 0.018 O. 155 O. 000
= = =
33t
74 69.340,0000 0.6614 0,6519 3,4137
[957 Conf. Interval] - .0070138 -. 1899736 35,746
-.0006891 .0307867 60,02374
•'mr: compute, eyex Elasl icities after regress y = Fitted_values (predict) = 21.297297 I
varieble I
e_/ex
Std. Err.
z
P>Izl
[
95_,C.I.
]
0.015 0.151
-.987208 -.104891 -1.66012 .2554t4
X
t weJght [ -.5460497 !length { -.7023518 --
.22509 .48867
-2.4B -1.44
3019.46 187.932
,, [
The first line of the output indicates that the elasticities were calculated after a regressestimation, I
The titlecha o_ _ge the insecond of the table ingives percent b' for column a 1 percent change x. the form of the elasticities, 0 log y/O log x, the If'the in lependent v_ables have been log-transformed already, then we will want the elasticities of the fO_ 0 log y/Oz _stead. ge_ Inweight = In(weight) gen inlength = _n(lengZh) reg::essropeInweight inlen_h Source SS df
•
Number of obs =
MS
Model
" I_51.28916
2
825.644581
R,sidual
79_2.170298
71
11.1573281
2_43.4594_6
73
33.4720474
F( 2, Prob > F
' Total
,
mpg
Coef.
4 I
1 weight l_length | cons 1
I! I [
• =fx lcompute,
-_3.5974 -9.816726 181. 1196
Std. Err.
P>It I
t
'
4.692504 t0.40316 22.18429
-2.90 -0.94 _.16
0.005 0.349 0.000
74
71) = =
74.00 0.0000
R-squared = Adj R-squared = Root MSE =
0,6758 0.6667 3.3403
[957. Conf. Interval] -22.95398 -30.56004 136.8853
-4.24081i 10.92659 225.3538
eydX
_Elastlcitiesafter': regress _y = Fitted _alues (predict) =[ 21.2972_}7 va_iat [e l
eyldx
Std. Err.
ln_en _h ,ln_tei_ at I
-.4609376 -. 6384565
.48855 .22064
z -0.94 -2.89
P>Izl
[
957.C l
0.345 0.004
-1.41847 .496594 -1,0709 -.206009
]
X 5.22904 7.97875
t
...... _ _ii; • '
........
-.w
umL_=:= _;,,,_illlli_ll,
lon
Note that although interpretation is the same, the results for eyex and eydx differ since we are estimating different the models. If the dependent variable were log-transformed, we would specify dyex instead. <1
Speed and accuracy mfz numerically calculates the derivatives and the second derivatives, so half of the digits of the accuracy of the estimation command are expected to be lost. For instance, if the predicted values from an estimation command are of 16 digits accuracy, i.e., they are accurate to le- 16, the accuracy of the marginal effects calculated by mfx might fall to le - 8, and the accuracy of the standard errors of the marginal effects might fall to le - 4 in the worst case. Users of mfx should also be aware of the speed issue. The linear method is generally much faster than the nonlinear method, but it might still take a while if there are multiple equations and quite a few independent variables. For those cases where y fails to meet the linear-form restriction, such as after mlogit,mfx might take a long time, varying from seconds to hours depending on the number of independent variables. Specifying nose will reduce the running time of mfx considerably. The table below gives a general idea of the accuracy of the linear and nonlinear methods and how long it takes to produce variables. the results. All the estimations listed in the table were run on the auto dataset, usingthem 9 independent
Estimation regress probit
logit tobit
Method linear nonlinear linear nonlinear linear nonlinear linear nonlinear
Speed (sec.) Accuracy of dydx 1.32 --1.263e - 09 13,24 3.289e - 10 1.35 9.221e - 13 65.43 4.336e - 12 1.31 1.647e - 11 85._1 5.921e- t2 1.44 2.683e- 10 12.39 1.353e- 10
(Continued on next page)
Accuracy of stds 1,139e - 09 5,361e - 07 1.888e - 08 7.841e - 08 t.889e - 07 1.356e- 06 3.059e 10 9.734e - 07
}
_-
, i
mfx -- Obtain marginal effects or elasticities after estimation
t
333
Say ed!R_uits
!
:mftx s_ves in e(): 1 SCala q (Xmfx_y) Maw
value of _ gwen x
s e (Xmfx_tTpe)
dydx,
e Xmfx_discrete)
discrete
e Xmfx_cmd)
mfx
elXmfx_dummy)
corresponding toinde_ndent variables: I meansdummy,0 meanscontinuous
e Xmfx_label_p)
labelof the predict
e Xmfx_method)
linear or nonlinea_
I
Mittric :s e Xmfx_dydx)
i
eyex,
eydx
or;dyex
or nodiscrete
option
marginal effects
e Xmfx.._e_dydx)
standard erro_ of the marginal effects
Xmfx_eyex) e!Xmfx_se_eyex)
elasticities of form eyex standard errors of elasticities of form eyex
e_Xmfx_eydx) e_ Lmfx_se_eydx)
elasticities of form eydx sta_ard errors of elasticities of form eydx
e__mfx_dyex)
elasticities of formdyex /
i!
e_ _fx_se_dyex e(_,mfx_..X)
I)
standard points
errors which of elasticities form or dyex around marginalofeffects elasticities were estimated
,
1
MethOdS_LndFormulas m_fx:is mplemented
as an ado-file.
Nonlinear_ethod Suppose
the function
margined elect
variable
xi:
effects is g = F(X,/3).
OF(v__'fl)
roo l
where
i
"
of independent
for the marginal
var(ra_)= Om_ Var(3)L O_j
° i
to be calculated
1
OJ_
_ 0_1
_f12
Oq3n /
Let mi
be the
334
mfx -- Obtain marginal effects or elasticities after estimation
Linear method y meets the linear-form
restriction if
1. y = F(XlOl
, X2_2 ....
, Xs/3s),
2. X1, X2 .....
X8 are mutually exclusive.
Under the linear-form restriction, be calculated as
where s is the number of equations,
tile marginal effects of independent
and
variable j in equation i can
OF(X,_) mij
--
_
"/Jij
The variance of rr,ij is !
var(.',,u) = L-5__I
L o_ J
where
CQ/_
Ornij/c)3kt
\0_11
'
"'"
'
O_Xs
'
"'"
'
Oq_sl
"'"
'
O_ss
]
can be calculated as Omij
02 F
O_kl- o(x_)o(xk_)_7,z where I(i -- k,j
OF
+ Ox_ _r(i= k,j = l)
= l) .- 1 if i = k and j -- l, and = zero otherwise.
AlsoSee Related:
[R] probit, [R] truncreg
Background:
[U] 23 Estimation and post-estimation [R] predict, [P] _predict
commands,
I I
I ,,! i
_
Title
,..... mkdir
Create directory
ii Syntax
Doul3le quot_s may be used to enclose
the director,,', n_me and the quotes must be used if the director,
contains
embeidde_ blanks.
Desc_i_icn m_dir
i
:reates a new directory (folder).
i
l Ii i
i
i
!
public specifies Options t
that directory-name is to be readable by everyone: otherwise, the directoD' will be created !according to the default permissions 6f your operating system.
Remarks Exampl i._s: Windov's . mk_ir myproj . mk@ir c:\projeCts\myproj mk_ir "c:\My Projects\Project I"
Unix
i
[
i
I
:_
. mk ir myproj !
i
i
mk_ir ~/projectis/myproj
Macim_ sh mk_firmyproj mk,fir
:hdisk :p_ojects :projecti
. mk_[ir ":Hard Disk:My Projec%s:Project 1"
AlsoSee Related:
[R] ed, [R] copy, JR] dir, [R] drase, [R] shell, [R] type
Backgrour d:
[U] 14.6 File,naming conventions
_35
mkspline -- Linear spline construction
1 Syntax mkspline
newvarl #1 [newva,_ #2 [...]]newvark
mkspline
stubname
# = oldvar
[if
exp]
= oldvar [if exp] [in range]
[in range]
[, marginal
_pcZile
]
Description mkspline
creates variables containing a linear spline of oldvar.
In the first syntax, mkspline creates newvarl,..., with knots at the specified #1, -.., #k--i-
newvark containing a linear spline of oldvar
In the second syntax, mksplino creates # variables named stubnamel ..... stubname# containing a linear spline of oldvar. The knots are equally spaced over the range of oldvar or are placed at the percentiles of oldvar.
Options marginal specifies that the new variables are to be constructed so that, when used in estimation, the coefficients represent the change in the slope from the preceding interval The default is to construct the variables so that, when used in estimation, the coefficients will measure the slopes for the interval. pctile is allowed only with the second syntax. It specifies that the knots are to be placed at percentiles of the data rather than equally spaced based on the range.
Remarks Linear splines allow estimating the relationship between b' and x as a piecewise linear function. A piecewise linear function is just that: it is a function composed of linear segments straight lines. One linear segment represents the function for values of x below zo. Another linear segment handles values between z0 and zl, and so on. The linear segments are arranged so that they join at z0, zl, .... which are called the knots. An example of a piecewise linear function is shown below.
336
_kspline -- Linear spline construction A piecewise
i :_ I
linear functidn
i
j 5i 4 N
i
I
337
I
,
I
Z
0
i
2
3
X
Example You wish ito estimate a model of log income on education and age using a piecewise linear function for age: lninc = bo + bt ettuc + f(age) -}-_t The knots at to be at ten-year intervals: 20. 30, 40, 50, and 60. mksp
ine
age1
. regr._ss lnin¢
20 educ
age2
30
age3
40
age4
! 50 age5
60
age6
= age,
marginal
agel'age6
(ou_ t omitted)
:
i
Since you s[ecified the marginaloption, you could test whether the age effect is the same in the
i
30-4(I" and @-50 intervals by asking whether th_ age4 coefficient were zero. With the marginal option, coeft_cients measure the change in slope :from the preceding eroup, Specifvine marginal chanoe_ onN the interpretation of the coefficients: the same model is estimated in either case. That is. without tl_e marginaloption, the interpretation of the coefficients _ould ha_e been
--
I
al
if age<20 30 it 20 a2 _
i l
dy _ dage
as a4 %
l ;i
With the ma_ ginal option specified, the interpretation is a6 Otherwise.
i!
i I
i g
if 30 <_age < 40 if 40 < age < 50 if 50_
al al
dy dage
+ a2
if age < 20 if 20 _< age < 30 if 30 < age < 40
!
al + ct2 + o3 al + a2 + a3 + a4 ' o,3 + a4 al + a2 -r-, Ctl +
a2 +
03
_
+ a4 _
if 40 < age < 50 if 50 _< age < 60
o.5 (t5 +
a6
otherwise.
_ _'
mKspllne-- Linearspline construction
> Example
l
As a second example, pretend you have a binary outcome variable called outcome.You are beginning an analysis and wish to parameterize the effect of dosage on outcome. You wish to divide the data into five equal-width groups of dosage for the piecewise linear function. mkspline dose 5 = dosage logistic outcome dosel-dose5 (output omitted )
mkspline dose 5 = dosage creates five variables, dosel, dose2 ..... dose5, equally spacing the knots over the range of dosage, ff dosage varied between 0 and 100, mkspline dose 5 = dosage has the same effect as typing • mkspline dosel 20 dose2 40 dose3 60 dose4 80 dose5 = dosage
The pctile option sets the knots to divide the data into five equal sample-size groups rather than five equal-width ranges. Typing mkspline dose 5 = dosage, pctile
places the knots at the 20th, 40th, 60th, and 80th percentiles of the data.
Methods and Formulas mksplineis implemented as an ado-file. Let V/, i - 1.... , n, be the variables to be created, ki, i = 1,.... , n- t be the corresponding knots, and 3) be the original variable (tile command is mkspline V1 kl V2 k2 ... Vn = V). Then V1 = min(V, kl)
If the marginal
option is specified, the definitions are V_=V V/ - max(0,12 - ki-1)
i = 2 .... ,n
In the second syntax, mkspline stubname # = 12, let rrt and M be the minimum and maximum of "12.Without the pctile option, knots are set at rn + (M - m)(i/_) for i - 1,... ,rt - 1. If pctile is specified, knots are set at the lO0(i/n) percentiles, for i = 1..... n, - 1. Percentiles are calculated by egen's pctile() function.
References Gould, W. W. 1993. sg]9: Linear splines and piecewise linear functions. in Stata Technical Bulletin Reprints, vol. 3, pp, 98-104. Greene.
W. H. 2000.
Newson, R. 2000. 57: 20-27.
Econometric
B-splines
Analysis.
parameterized
4th ed. Upper
Saddle
Also See Related:
i
[R] fracpoly
River.
by their values at reference
Panis. C. t994. sg24: The piecewise linear spline transformation. Stata Technical Bulletin Reprints, vol. 3, pp. 146-149.
Stata
Techt_ical Bulletin
15: 13-17.
Reprinted
NJ: Prentice-Hall.
points on the z-axis.
Stata Technical
Bulletin
Stata
Technical
18: 27-29.
Bulletin
Reprinted
in
l
l
! ! _
_
Title .....
ml --
_1 ,
aximum likelihood estimation i _
,ml.
] il
il
]
i i[
i T
I
i i
ii
I
Syntax
I
ral regale,
i
method progname eq [eq ...] iweight] [if exp] [in range] [, robust. __cluster(varname) _tle(string) nopreserve collinear missing IfO(#k #it)continue waldtest(#) const_aints(numlist) obs(#) noscvars ]
i i
!
ml sea@h
[ [/]eqname[:]
#tb #ub ] [.. i]
F repeat (#) nolog trace
i
L,- ,
,
,'estart norescale
--
]
I i
I i !
ml plot
i
[eq,,a/ne:]name
[# [# [#]]].[, saving(filename[,
replace]>
i
mt init{ m! init l
{ [eqname:]name=#1/eqnaine=# # [# ...], copy
i
} [...]
:
ml init
,,,atname
[, skip
copy]
{ }
ml repcrt i ml trac_
{ on i off }
]
oi
mlco_
'i
ml max
[cle_ Ionlof_ _] ize [, difficult nolog __trace'gradient hessian shovstep iterate(#) Itolerance(#)!tolerance(#) nowarning novce
_.
i
i
sscore:(newvarnames) noout_t level(#) eform(string) noc!e_r
ml
graph
[#] [
saving(fiienarae[, r_place])
]
{
mt dis
lay
[
no_eader
eform(string)
tirst 389
neq(#)
plus
level(C/)
i
i
•,--,, ";i_i I
,,,, -- Maximum IIKelIItOOCl estimation
where method is { If [ dOIdl Idldebug Id2 1d2debug } and eq is the equation to be estimated, enclosed given to the equation, preceded by a colon:
[
in parentheses,
and optionally
with a name to be
([eqname: l [varnames =] [varnames] I" eq-options] ) or eq is the name of a parameter such as sigma with a slash in front /eqname and eq_options are
which is equivalent
noconstant offset
(varname)
to
(eqname:)
exposure
(varname)
fweights, pweights, aweights, and iweights are allowed, see [U] 14.1,6 weight. With all but method lf, you must write your likelihood-evaluationprogram a certain way if pweights are to be specified, and pweights may not be specifiedwith method dO. ml shares features of all estimationcommands;see [U] 23 Estimation and post-estimation commands. To redisplay results,the type ml display.
Syntaxof ml model in noninteractivemode ml model method progname
eq [eq ...]
[weight]
lif exp]
[in rangej,
maximize
I robust cluster(varname) title(string)nopreserve collinear missing ifO(#k #u) continue waldtest(#) constraints (numlist)obs(#) noscvars init(ml_init_args) search({on Iquietly Ioff}) _repeat(#) b_ounds(ml_search_bounds) difficult .nologtrace gradient hessian showstep iterate(g) _Itolerance(#) tolerance(g) nowarning novce score(newvarlist) I Noninteractive
mode is invoked by specifying
option maximize.
Use maximize
when ml is to
be used as a subroutine of another ado-file or program and you want to carry forth the problem, from definition to posting of final results, in one command.
Syntax of subroutinesfor use by method dO,dl, and d2 evaluators mleval
newvarname
= vecname
mleval
scalarname
mlsum
scalarnamelnf
= exp [if exp] [, noweight
mlvecsum
scalarnamelnf
rowvecname : exp [if exp] [, eq(#)
mlmatsum
scalarnamelnf
matrixname = exp [if
= vecname,
[, eq(#)] scalar
[eq(#)]
]
exp] [, eq(#[,#i)
i ]
ml -- Maximum likelihood estimation
341
Syntax of user-writtbn evaluator Summar
of notation
The lo :-likelihood fdnction is In L(Olj, 02j,..:., OEj) where Oij = xijbi andj = 1,..., N indexes observ ttions and i _ t,..., E indexes the liftear equations defined by mt model. If the likelihood satisfie[ the linear-fc_rm restrictions,
Method
it can be decomposed
as In L = Z;=I
In g(Olj, 02j,...,
It evaluators:
program dv:fr_n:jvgname args
inf _,thetal [theta2
...
]
/* if you _eed to createany intermediate results: */ tempvax trap1 trap2... quietly gen double "tmpl" = ... }
°''
;
quietly
replace
"lnf"
= ...
end t
wbe ' !
"!n_]" "the_tal" "th_ta2"
vafihble to be filled in with observation-by-observation values of In £j vari_bte containing evaluation of 1st equation Ou=xljb_ varihbte containing evaluation of 2nd equation Ozj=x,._b2
l
Method d D evaluators:i _ '
prol_ :am define p_ogname version args Code b inf tempvar_etal theta2 ... mleval "_hem1" = "b', eq(1) mleval "_eta2" = "b', eq(2)
/* if there is a 02 */
/* if you _eed to create any intermediateresutts: */ •
!
tempvar
1 i
mlsum "l_ff" --- ...
}
end!
'f_npt trap2...
gen double
"tmpl " = ...
"
where} i i
i !
"todd"
"b" i "lnf_ t
always contains 1 {may be ignored) full parameter row vector b=(bl,,b2,...,bE) scal_ to be filled in with overat_ In L
Method dl evaluators: prog 'am defineprOgname version
! i
tempvar tbetal lheta2 ... mleval "t_etal" = "b', eq(I) mleval "ttheta2" = "b', eq(2)
i _!
/* if there is a 02 */
/* if you n_ed to create any intermediateresults: */ tempvar t_pt trap2 ... gen doubl_e "tmpl " = . . . ...
0,_).
F
.
o,,,_
ml --mls_ Maximum "lnf"
Ii''_P
ilKellrlooO = ...
estimation
if "todo'==O ] "Inf'==. { exit }
iI i
tempname dl d2 ... mlvecsum "Inf" "d1" = formulaforOlngj/O91j, eq(1) mlvecsum "Inf" "d2" = formulafor0 Ing#/002#, eq (2) matrix
"g" = ('dl','d2", ... )
end where "todo"
contains0 or 1
"b" "lnf"
O_'Inf'to be filled in;l_'lnf" and "g" to be filled in fullparameterrow vectorb=(bl,b_,...,bE) scalar to be filled in with overall In L
"g"
row vector to be filled in with overall g---01nL/0b
"negH"
argument
"gl" "g2"
variable variable
to be ignored optionally optionally
to be filled in with 01ng#/0bl to be filled in with ,9 In £_/0b2
Method d2 evaluators: program
define progname version 7 args todo b inf g negH [gl [g2 ... ] ] tempvar thetal thcta2... mleval "thetal" mteval "theta2"
= "b', = "b',
eq(1) eq(2) /*
if there is a 02 */
/* if you need to create any intermediate tempvar tmpl unp2 ... gen double "tmpl= ... mlsum "lnf"
results:
./
= ...
if "todo'==O
[ "inf'==. { exit }
tempname dl d2 ... mlvecsum mlvecsum
"lnf ""dl" "lnf" "d2"
= formula for 0 In gj/OOL7, = formula for 0 tn ej/OO2j,
eq(1) eq(2)
matrix "g" = ('dl','d2", ... ) if "todo'==l [ "inf'==. { exit } tempname dll d12 d22 ... mlmatsum "inf" "dll"= fotmtlla for 02 In_3/08_j, eq(1) mlmatsum "Inf" "dl2" = formulafor- 02 Ingj/OOljOe2j, ecl(l,2) mlmatsum "Inf" "d22" = fonntlla for-02 Inej/OO_j, eq(2) end
matrix
"negH"
= ('dll','dl2",
...
\
"d12'','d22",
...
)
where "todo"
contains
0, 1, or 2
"b" "lnf"
O_'lnf" to be filled in: l_'lnf" and "g" to be filled in: 2_" lnf', "g', and "negtt" to be filled m full parameter row vector b--(bl,b2,...,bE) scalar to be filled in with overall In L
"g" "negH"
row vector to be filled in with overall g=01n L/Ob matrix to be filled in with overall negative Hessian -H=--02
"gl" "g2"
variable variable
optionally optionally
to be filled in with Oln_j/Obl to be filled in with Olngj/Ob2
In L/ObOb"
i
_
ml -- Maximumlikelihood estimation
1
Global
SML__rl SML@2
_ _ i
name of first dependent variable nam_ of second dependent variable, if any
SHL_ _amp
variable containing
1 if observation
SML_'
variable containing
weight associated with observation or 1 if no weights specified
to be used; 0 otherwise
Method If evaluators can ignore SML_samp, but restricting calculations to the SML_samp==l subsaml_te will speedi execution. Method If evaluators must ignore SML_w;application of weights is handl_ by the me_hod itself. Method dO. dl. and _]2 can ignore $ML_samp as tong as ml model's nopreserve
:i
i ,_ ! i
"
[• i
343
m_crosfor use!byall evaluators !
option is not
specifie_l. Method d0_ dl, and d2 will run more quickly if nopreserve is specified. Method dO, dl. and d2 evaluator_ can ignore $ML_w only if they use mlsum, mlvecsum, and mlmatsum to produce final results.:
Description ml cle
r clears the current problem definition. This command is rarely, if ever. used because,
when you t_pe ml modest, anv, previous problem is automatically cleared. m2 mod_t defines the!currenl problem. ml query displays a 8escription of the current problem.
ral check verifies thai the log-likelihood evahaator you have written seems to work, We strongly recommend using this command. ml sea: ch searches for (better) initial values. We recommend using this command. ml plot
provides a g_aphical way of searchip_g for (better) initial ]
values.
ml init Iprovides a Way_of setting initial values to user-specified values. ml report reports thi: values of tn L, its gradient, and its negative Hessian at the initial values •l " or current l_ameter estimates b0. ra! trac_
traces the execution of the user-deft=ned log-likelihood evaluation program.
,
ml co counts the _umber of times the user-defined log-likelihood evaluation program is called. It was inteqded as a del_ugging tool for those developing ml and it now serves little use besides entertainment, ml count! clear clears the coun{er, ml count on turns on the counter• ml count
!
without argt}ments report S the current values of the counters, ml count off stops counting calls.
i i
ml maxillizemaxlmi_es the likelihood function and reports final results. Once ml maximize has succe.tsfully completed, tl_e previously mentioned ml commands may no longer be used--ml graph
[
and rat dis ,lay may be ,fused" m! grapl graphs the ltog-likelihood values against the iteration number.
i ! ?
! , :_ [
,
. i
rnl disp: ay redisplav_ final results. prognam_I is the namd of a program you write to evaluate the tog-likelihood function. In this documentation, it is referled to as the user-written evaluator or sometimes simply as the cvaluator. The progra_._._youwrite is,_,ritten in the style required by the method you choose. The methods are If, dO. d l. If evaluator. and ]t2. Thus. if _ou choose to use methotl If, your program is called a method Method If e_aluators are Zrequired to evaluate th_ observation-by-observation log likelihood In (_, .j : I,.....,'It. Method dO evaluators are required |o evaluate the overall log likelihood tn L. Method d I evaluator_ are required io evaluate the overall log likelihood and its gradient vector g = 0 In L/Oh Method d2 e_,aluators are [equired to evaluate the Overall log likelihood, its gradient, and its neeative Hessian matrix -H = -0-°In L/ObOb'
:_
344 ml -- Maximum likelihood estimation mleval is a subroutine used by method dO, dl, and d2 evaluators to evaluate the coefficient vector b that they are passed. mlsum is a subroutine used by method dO, dt, and d2 evaluators to be returned.
to define the value In L that is
mlvecsum is a subroutine used by method dl and d2 evaluators to define the gradient vector g that is to be returned. It is suitable for use only when the likelihood function meets the linear-form restrictions. mlmatsum is a subroutine used by method d2 evaluators to define the negative Hessian -H matrix that is to be returned. It is suitable for use only when the likelihood function meets the linear-form restrictions.
Optionsfor use with ml model in interactive or noninteractive mode robust
and cluster
(varname)
specify the robust variance estimator, as does specifying pweights.
If you have written a method If evaluatOr, robust, is nothing to do except specify the options.
cluster
(), and pweights
If you have written a method dO evaluator, robust, cluster(), Specifying these options will result in an error message.
and pweights
will work. There wilt not work.
If you have written a method dl or d2 evaluator and the likelihood function satisfies the linear-form restrictions, robust, cluster (), and pwe±ghts will work only if you fill in the equation scores; otherwise, specifying these options will result in an error message. title
(stritTg) specifies the title to be placed on the estimation
output when results are complete.
nopreserve specifies that it is not necessary for ml to ensure that only the estimation subsample is in memory when the user-written likelihood evaluator is called, nopreserve is irrelevant when using method lf. For the other methods, if nopreserve is not specified, then ml the original dataset) and drops the irrelevant observations before This way, even if the evaluator does not restrict its attentions results will still be correct. Later, ml automatically restores the ml need not go through these machinations evaluator calculates observation-by-observation
saves the data in a file (preserves calling the user-written evaluator. to the $ML_samp==l subsample, original dataset.
in the case of method If because the user-written values and it is ml itself that sums the components.
ml goes through these machinations if and only if the estimation sample is a subsample of the data in memory. If the estimation sample includes every observation in memory, ml does not preserve the original dataset. Thus. programmers must not damage the original dataset unless they preserve the data themselves. We recommend interactive users of ml not specify nopreserve; chances of incorrect results.
the speed gain is not worth the
We recommend programmers do specify nopreserve, but only after verifying that their evaluator really does restrict its attentions solely to the $ML_samp==l subsample. collinear bpecifies that ml is not to remove the collinear variables within equations. There is no reason one would want to leave coltinear variables in place, but this option is of interest to programmers who. in their code. have already removed collinear variables and thus do not want ml to waste computer time checking again.
+ {
I ,
_
ml -- Maximum likelihoodestimation
345
missings])ecifies that observations containing variables with missing values are not to be eliminated from th_ estimation somple. There are two reasons one might want to specify missing:
! _
Prograrr _ers may wi_h to specify" missingbecause, in other parts of their code, they have already eliminat_ observatio+hs with missing values and thus do not want ml to waste computer time
i ! + •
looking again. All user_ may wish tO specify missingif their model explicitly deals with missing values. Stata's heckmaa command i_ a good example of this. In such cases, there will be observations where
_
missing values are allbwed and other observations where they are not--where their presence should cause the observationl to be eliminated. If you specify missing,it is your responsibility to specify an if e rp that elimidates the irrelevant obserVations.
_; !
If0(#k #u_ is typically _sed by programmers. It Specifies the number of parameters and log-likelihood value o_the constant-only model so that ml++ can report a likelihood-ratio test rather than a Wald
i
i } + + i
+_
test. Th_se values w_re, or they may have been determined by t + perhaps, analytically_idetermined, • • a previous estimation! of the constant-only m6del on the estimation sample. "
1
Also so the continueoption directly below: If you specify IfO(),it must be safe for you to specify the missing option, too, else how did you cal
even if specified, is ignored if robus,t,
cluster(),
or pweights
is specified because in
that casf a likelihood-ratio test would be inappropriate. continue is typically specified by programmers. It does two things:
!
First, it specifies tha_ a model has just bee_ estimated, by either ml or some other estimation
i !
commmtd such as l_git,and that the likelihood value stored in e(ll) and the number of parameters stored in ]e(b) as of this instant are the relevant values of the constant-only model.
i
The cmrent value of!the log likelihood is used to present a likelihood-ratio test unless robust. cluster(), or pwe_ghts is specified because, in that case. a likelihood-ratio test would be inappro _riate. +
+ + ;
_
Second. continue e(b)
sets the initial values b0 for the model about to be estimated according to the
c_trrently sto
re+ ,
The cot relents madel about specifying missing
with lIO () apply equally well in this case.
wa!dtest #) is typically specified by programmers. By default, ml presents a Wald test. but that is overridclen if option_ lf0() or continue are specified, and that is overridden again _so we are back tolthe Wald testi if robust, watdt
cluster(),
or pweights
are specified.
st (0) prevet_ts even the Wald test frOm being reported. !
waldte_st (-1)
"
!
"
"
"
"
"
t
+stht_ default. It specifies that a Wald test _s to be performed, ff tt _s performed, b}
! ;
constra@ng all coef_cients except for the intercept to 0 in the first equation. Remaining equations are to b_ unconstrai@d. The logic as to whet_er a Wald test is performed is the standard: perform
! +
the Wald test if nei!_er lf0() nor continue cluster, or pweig_ts were specified.
+ i
were specified, but force a Wald test if robust.
watdtest(k) for < --I specifies that a Wald test is to be performed, if it is performed, by constraining all coefficients except for intercepts to 0 in the first hi equations: remaining equations are to e unconstrair_ed The logic as to whether a Wald tes_ is performed is the standard.
+
_ l:J
o,m ml m Maximum iiKellrlOOdestimation waldtest (k) for k > 1 works like the above except that it forces a Wald test to be reported even if the infommtion to perform the likelihood-ratio test is available and even if none of robust, cluster, or pweights were specified, waldtest(k), k > 1, may not be specified with lf0(). constraints (numlist) specifies the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are defined using the constraint command and are numbered: see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. obs (#) is used mostly by pro_mers. It specifies that ultimately stored in e (N), is to be #. Ordinarily, ml Programmers may want to specify th_s option when, in for N observations, they first had to modify the dataset observations.
the number of observations reported, and works that out for itself, and correctly. order for the likelihood-evaluator to work so that it contained a different number of
noscvars is used mostly by programmers. It specifies that method dO, dl, or d2 is being used but that the likelihood evaluation program does not calculate nor use arguments "gl ", "g2", etc., which are the score vectors. Thus, m! can save a little time by not generating and passing those arguments.
Options for use with ml model in noninteractive mode In addition to the above options, the following options are for use with ml model in noninteractive mode. Noninteractive mode is for programmers who use ml as a subroutine and want to issue a single command that will carry forth the estimation from start to finish. maximize
is not optional. It specifies noninteractive
mode.
init (ml_init__args) sets the initial values be. mI_init_args ml init command.
are whatever you would type after the
search({onlquietlyloff }) specifies whether ml search is to be used to improve the initial values, search (on) is the default and is equivalent to running separately ml search, repeat (0). search(quietly) is the same as search(on) except that it suppresses ml search's output. search(off) prevents the calling of ml search altogether. repeat(#) repeat
is ml search's repeat (0) is the default.
bounds(ml_search_bounds) model issues is 'ml search
() option and is relevant only if search(off)
is not specified.
is relevant only if search(off) is not specified. The command ml ml--search_bounds, repeat (#) '. Specif_,ing search bounds is optional.
difficult, nolog, trace, gradient, hessian, showstep, iterate(), Itolerance(), tolerance(),nowarning,novce,and score() are ml maximize'sequivalent options.
Options for use when specifying equations noconstant specifies that the equation is not to include an intercept. offset (varname) specifies that the equation is to be xb + varname: that the equation is to include varname with coefficient constrained to be 1. exposure (varname) xb + ln(varname).
is an alternative to offset (varname); it specifies that the equation is to be The equation is to include In(varname) with coefficient constrained to be 1.
V
ml -- Maximumlikelihoodestimation
347
Options f( use withlml search repeat(#1
specifies the number of random attempts that are to be made to find a better initial-_alue
! ! ! ) } !
vector. "he default ropeat(lO). repea_ (0) specifiesi__at no random attempts are to be made. More correctly, repeat (0) specifies that no "andom attempts are to be made if the initial initial-value vector is a feasible starting point. If it is aot, ml search will make random attempts even if you specify repeat (0) because it has no _tltemative. Tl_e () option refers to the number of random attempts to be made to ) repeat improv_ the initial values, When the initial starting value vector is not feasible, ml search will make u]_to 1,000 rafidom attempts to find starting values. It stops the instant it finds one set of values t aat works and then moves into its improve-initial-values logic.
i
repeat (k), k > O. _pecifies the number of random attempts to be made to improve the initial values. !
)
nolog spe!ifies that no 6utput is to appear while ml search !
looks for better starting values. If you
!
specify tolog and th_ initial starting-value vector is not feasible, ml search wilt ignore the fact that youIspecified the nolog option. If ml search must take drastic action to find starting values, it feels },ou should kriow about this even if you attempted to suppress its usual output.
! !
trace spee ifies that you want more detailed output about ml search's actions than it would usually provide. This is more entertaining than useful, ml search prints a period each time it evaluates the likel hood functioh without obtaining a better result and a plus when it does.
! l i
restart stecifies that rlndom actions are to be taken to obtain starting values and that the resulting starting _alues are notto be a deterministic function of the current values. Users should not specify this opti )n mainly bejzause, with restart, mt search intentionally does not produce as good a set of st lrting values _s it could, restart is included for use by the optimizer when it gets into
I :
workingltogether, serious tlouble. Thedor_ndom hot result actions in a are long, to endless ensure that loop. the actions of the optimizer and ml search, restar I implies no_escale, which is why !we recommend you do not specify restart. In i i )
)
testing, 4ases were discovered where rescale _worked so well that, even after randomization, the rescaler _;ould bring (he starting _alues right back to where they had been the first time and so defeated lthe intended irandomization. norescale! specifies that_ml search is not to engage in its rescaling actions to improve the parameter vector. We do not recbmmend specifying this Option because rescaling tends to work so well.
Options for use with ml plot saving(fil_'name[,
replace I)_
specifies that the graph is to be saved in fiIename . gph.
Options for use with ml init skip specif es that any phrameters found in the specified initialization vector that are not also found in the m,)del are to bE ignored. The default action is to issue an error message. ) ,_ector b) position ratbter than by name. i )
)
copy speci_es that the ]isi of numbers or the initialization vector is to be copied into the initial-value
r
348
ml -- Maximum likelihood estimation
Options for use with mi maximize difficult specifies that the likelihood function is likely to be difficult to maximize. In particular, difficult states that there may be regions where -H is not invertible and that, in those regions, ml's standard fixup may not work well. difficult specifies that a different fixup requiring substantially more computer time is to be used. For the majority of likelihood functions, difficult is likely to increase execution times unnecessarily. For other likelihood functions, specifying difficult is of great importance. nolog,
trace,
gradient,
hessian,
and showstep
control the display of the iteration log.
nolog
suppresses reporting
trace
adds to the iteration log a report on the current parameter
gradient
vector,
adds to the iteration log a report on the current gradient vector.
hessian
adds to the iteration log a report on the current negative Hessian matrix.
showstep iterate
of the iteration log.
adds to the iteration log a report on the steps within iteration.
(#), ltolerance
iterate(16000) Convergence
(#), and tolerance
tolerance(le-6)
(#) specify the definition of convergence.
ltolerance(le-7)
is the default.
is declared when mreldif(bi+l,bi) _< tolerance () or
reldif{lnL(bi+l),InL(bi)}< Itolerance()
In addition, iteration stops when i -- iterate(); in that case. results along with the message "convergence not achieved" are presented. The return code is still set to 0. nowarning is allowed only with iterate (0). nowarning suppresses the "convergence not achieved" message. Programmers might specify iterate (0) nowarning when they have a vector b already containing what are the final estimate,; and want ml to calculate the variance matrix and post final estimation results. In that case, specify 'init(b) search(off) iterate(0) nowarning notog'. novce is allowed only with iterate (0). novce substitutes the zero matrix for the variance matrix which in effect posts estimation results as fixed constants. score (newvarlist) specifies that the equation scores are to be stored in the specified new variables. Either specify one new variable name per equation or specify a shorl name suffixed with a *. E.g., score(sc*) would be taken as specifying scl if there were one equation and scl and sc2 if there were two equations. In order to specify score(), either you must be using method If, or the estimation subsample must be the entire dataset in memory, or you must have specified the nopreserve option. nooutput quietly
suppresses display of the final results. This is different from prefixing ral maximize in that the iteration log is still displayed (assuming nolog is not specified).
with
level(#) is the standard confidence-level option. It specifies the confidence level, in percent, for confidence intervals of the coefficients. The default is leveI(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals, eform(string)
is ml display's
eform()
option.
noclear specifies that after the model has converged, the ml problem definition is not to be cleared. Perhaps you are having convergence problems and intend to run the model to convergence. If so, use ml search to see if those values can be improved, and then start the estimation again.
ml -- Maximumlikelihood estimation
349
!
i
__
! ! l ,
Options for use with ml graph saving(f/enamel,
replace])
specifies that the graph is to be saved in filename.gph.
Options fo_"use witfi ml display noheader suppresses the display of the header above the coefficient table that displays the final log-likelihood value.!the number of observations, and the model significance test. e:form (str than b intercer coeffici first
_ng) displays !the coefficient table in exponentiated form: for each coefficient, exp(b) rather s displayed _tnd standard errors and _onfidence intervals are transformed. Display of the t, if any. is suppressed, string is the table header that will be displayed above the transformed rots and musi be 1t characters or fewer in length, for example, efornl("0dds ratio").
displays a coefficient table reposing resqlts for the first equation only, and the report makes
it appe_ that the fi@ equation is the only e_uation. This is used by programmers who estimate ancillary, parameters !in the second and subsdquent equations and will report the values of such parameters themselves. neq(#) is!an alternative to first,
neq(#)
displays a coefficient table reporting results for the first
# equat ons. This is !used by programmers who estimate ancillary parameters in the # _I_1 and subsequent equations!: and will report the values of such parameters themselves. i
plus displays the coefficient table just as it would be ordinarily, but then, rather than ending the table
I
in a lin._ of dashes, _nds it in dashes-plus-sign-dashes. This is so that programmers can write additional display c_e to add more results to the table and make it appear as if the combined result ii one table. Pl'ogrammers typically specify plus with options first or neq().
i
t i i
i
i
level.(#) !is the standard confidence-level option. It specifies the confidence level, in percent, for confidence intervals 6f the coefficients. The default is level(95) or as set by set level" see [U] 23,_ Specit'ying lhe width of confidence intervals.
Options fol use with_mleval eq(#) spet:ifies the equation number _ for which Oij = xijbi if eq() is not specified.
is to be evaluated, eq(1)
is assumed
scalar as_erts that the _th equation is known to evaluate to a constant; the equation was specified as () (ha he: ). or/ndtne on the ml model staiement. If you specify this option, the new variable " • created is created as !a scalar. If the ith equation does not evaluate to a scalar, an error message is issue4.
Options fm use with imlsum t
noweight
•
_pecifies that'welghts_ ($ML_v) are to be ignored when summing the likelihood function.
Options for use with 1 i rnlvecsum eq(#)
specifies the equa_ion for which a gradient vector Oln L/Obi
is eq(ll.
is to be constructed. The defauh
_,,
_f,_w
Ill!
--
IflaA|lilUlll
III_III|U_
u_||lna[|on
Options for use with mlmatsum eq(#[,#])
specifies the equations for which the negative
default is eq(:t), which means the stone as eq(1,1), eq(i,j) results in -021n L/Ob_Ob_.
Hessian matrix is to be constructed.
which means -021n L/OblOb'
The
1. Specifying
Remarks For a thorough discussion of ml, see Maximum Likelihood Estimation with Stata (Gould and Sribney 1999). The book provides a tutorial introduction to ml, notes on advanced programming issues, and a discourse on maximum likelihood estimation from both theoretical and practical standpoints. ml requires that you write a program that evaluates the log-likelihood function and, possibly, its first and second derivatives. The stvle of the program you write depends upon the method chosen; methods If and dO require your program evaluate the log likelihood only; method dl requires your program evaluate the log likelihood and gradient; method d2 requires 3,'our program evaluate the log likelihood, gradient, and negative Hessi_m. Methods If and dO differ from each other in that, with method If, your program is required to produce observation-by-observation log-likelihood values In gj and it is assumed that In L = }"_j In £j; with method dO, your program is required to produce the overall value In L. Once you have written the program--called an evaluator--you define a model to be estimated using ml model and obtain estimates using ml maximize. You might type • ml model ... ml maximize
but we recommend that you type • ml ml • ml • ml
model ... check search maximize
ml check will verify your evaluator has no obvious errors, and ml search will find better initial values. You fill in the ml model statement with (1) the method you are using, (2) the name of your program, and (3) the "equations". You write your evaluator in terms of 01, 02, ..., each of which has a linear equation associated with it. That linear equation might be as simple as 0i -- b0, or it might be 0i = blmpg + b2weight+ ha, or it might omit the intercept b3. The equations are specified in parentheses on the ml model line. Suppose you are using method If and the name of your evaluator program is myprog. statement ml model If myprog
The following
(mpg weight)
would specify a single equation with 0_ = blmpg + b2weight _-b3. If you wanted to omit ha, you would type . ml model if myprog
(mpg weight, nocons)
and if all you wanted was 0i -- b0, you would type • ml model if myprog With
multiple
equations,
you
• ml model if myprog
() list
the
equations
(mpg weight)
()
one
after
the
other:
so if you
typed
ml -- MaximumIikelih_
I I
t
i
i
I
351
i you would ,e specifying 01 = blmpg+ b2weight_ b3 and Oz = b4: You would write your likelihood in terms of )t and 02. If _he model was linear reg_ssion, 01 might be the xb part and 0.2.the variance of the resid lals. i When you specify thelequations, you also specify any dependent variables. If you type . m! +odel L
if myp+g
(price = mpg weight_
()
price woulu be the one !and only dependent varihble, and that would be passed to your program in SML,yl. If _sour model _d two dependent variables, you could type 1 • ml _odel
If
mypr_g
I
i )
estimation
(price
displ
= mpg Oeight)
()
i
and then $_L_yl
would !be price
and $ML_y2 _ould be displ.
You can specify however many
dependent "_ariables are _ecessary and specify them on any equation. It does not matter on which equation vo6 specify them: the first one specifie4 is placed in $ML.yl, the second in SNL'y2, and SO on,
Example
I
Using m4thod If. we aketo produce observation-by-observationvalues of the log likelihood. The probit ]og-li_elihood funclion is
= f lne(O j)
lngj
l lncI'(4Olj)
Otj
=
xjb
if
=1
if yj = 0
1
The foltowiI_g is the mettiod If evaluator for this ikelihood function: progrem
define myp_obit version 7. args
Inf t_etal
quietly quietly
re_lace re,lace
"inf" = in(norm('thetal')) "inf" = in(norm(-_thetal'))
if SML_yI==I if $ML_yI==O
end
i If we wante_ to estimate a model of foreignonmpg and weight, we would type i
. ml m_del
!
if mypr_bit
i
(foreign
= mpg weight)
• ml m _ximize
i
The 'forei_ =' part siecifies that y is foreign.The 'mpg weight'part specifies that OIj == blmpgj + b_9_eightj + b,t. The result of running this is
i
I ml m_del If myprrbit ml m _ximize
(foreign
= mpg weight)
initia [:
log likelihood
e -51. 29289
altern itive:
log likelihood
= -45.055271
Iterat ton O:
log likelihood
± -45.05527:
Iterat [on I: festal _: Iterat [on 2:
log likelihood log likelihood log likelihood
= -27.9041L = -45.05527: = -26.8578
Iterat Lon 3: iterat LOn 4:
log likelihood log likelihood
= -26,844191 = "26.84418!
Iterat .on 5:
log likelihood
= -26.84418!
_.og li]_elihood t
i i
Number
of obs
=
74
)
Wald
chi2(2)
=
20.75
= -6.844189
Prob
> chi2
=
0.0000
I i
352
mi -- Maximum likelihood estimation foreign
Coef.
mpg weight _cons
Std. Err.
-.1039503 .0515689 -.0023355 .0005661 8.275464 2.554142
z
P>_zl
[95_Conf.Interval]
-2.02 -4.13 3.24
0.044 0.000 0.001
-.2050235 -.0028772 -.003445 -.0012261 3.269438 13.28149
q
> Example A two-equation, two-dependent variable model is little different. Rather than receiving one theta, our program will receive two. Rather than there being one dependent variable in $ML_yI, there will be dependent variables in $ML_yl and $gL_y2. For instance, the WeibuI1 regression log-likelihood function is Ingj=(tje°'_)
exp(°_) +dj{Olj-Olj+(e
°lj
t)(lntj
Ou) }
Olj = Xjbl 023
_
S
where tj is the time of failure or censoring and dj = I if failure and 0 if censored. We can make the log likelihood a little easier to program by introducing some extra variables: pj -- exp(O2j) Mj = tj exp(-01j)
pj
Rj = In tj - 01j In gj -= -Mj
+ dj{Olj - Olj -}-(pj - 1)nj }
The method If evaluator for this is program
define myweib version 7.0 args
Inf
thetal
theta2
tempvar
p M R
quietly
gen
double
"p"
= exp('theta2")
quietly
gen
double
"M"
= ($ML_yl*exp(-'thetal'))"p"
quietly
gen
double
"R" = in($ML_yl)-'thetal"
quietly
replace
"Inf"
= -'M"
+ SML_y2*('thetal"-'thetal"
+ ('p'-l)*'R')
end
We can estimate a model by typing • ml model If myweib ml maximize
(studytime
died
= drug2
drug3
age)
()
Note that we specified ' ()" for the second equation. The second equation corresponds to the Weibull shape parameter s and the linear combination we want for s contains just an intercept• Alternatively, we could type ml model
if myweib
(studytime
died
= drug2
drug3
age)
/s
Typing /s means the same thing as typing (s:) and both really mean the same thing as ()• The s, either aher a slash or in parentheses before a colon, labels the equation. It makes the output look
f
prettier and that is all:
ml -- Maximumlikelihoodestlmation
• ml_ odel
if mywe Lb (studytime
died
= dr_g2
drug3
353
J
age) /s
• ml _ ax ,
initi_ .i:
log likelihood
=
)
alterl .drive :
log likelihood
= -356.142716
resca]e:
log likelihood
"- -200.802_I
log likelihood
= -138,692342
log
= -136.692_
rescale Iterat
eq:
ion
O:
likelihood
-7_
I
Iteral ion i:
log)likelihood
= -124.117_
i {
Iteral ion 2: Iteral ion 3:
log) likelihood log I likelihood
-113.889iB -iI0.303_
I
Iteralion Iteraiion 5: 4: Iteral;ion 6: i i
log log)ilikelihood likelihood log I likelihood ! .!
Log l:kelihood
= -_I0.26736
i
I Coef.
(not
concave)
-110.267_ -II0.267_7 = -110.267_
Std.
Err.
z
Number of obs Weld chi2(3)
= =
Prob
=
> chi2
P>_zl
48 35.25 0.0000
[95_, Conf.
Interval]
!
drug2 drug3
I. 12966 I). 45917
.2903917 .2821195
3.488 _.172 5
O.000 0.000
.4438086 .9062261
I. 582123 2.012114
)
age _cons
-.0_7&728 6._60723
.0_5688 1. i52845
_3,266 5.257
0.00t 0.000
-. t074868 3. 801188
-. 0268587 8L 320269
.1402154
3.975
0.000
.2825162
.8321504
S
i
i
I
lej,
, OE,land they are required to evaluate the overall log-likelihood In/J rather than
!
Use mle;al to produce) the thetas from the coefficient vector. Use mlsm to sum th_ components that enter into InL. In, the ca se of WeibuI!, In L = _ In gj, and otir method dO evaluator is version progr tm define
7.!0 we4bO
i
tempvar args todot_etal )b inf theta2 mleval "tl_etal" = "b', eq(1)
i
mleval "t_eta2" = "b" local t "_ML_yl"
I
local
l
eq(2) : /* th_s
is just
for readability
d "_ML_y2"
=empvar quietly
p M double gen R
quietly gEn double quietly g(n double mlsum "In_" =-'M"
i
I
i
4
> Example Method 0 evaluatorsTeceiveb = (bl, b2,..., bE), the coefficientvector,ratherthanthe already evaluated 0_. 02....
i
_73333
i i
!
i I
•
i _cons
l i
i
"p, = exp('theta2") "R" = in('t')_'thetal" "M' = ('t'*ex_(-'thetal'))_'p _ + 'd'*('theta_'-'thetal_ + ('p'-I)*'R')
To estimatei our model using this evaluator, we would type . ml
odel dO wei 0 (studytime
i
died = dr_g2
-
drug3
age) /s
*/
i i !
[3 Technical Note i
354 ml -- Maximum likelihood estimation Method dO does not require In L = _-']d In gj, j = 1,..., N, as method If does. function might have independent components only for groups of observations. Panel have a log-likelihood value in L - Y-_iIn Li where i indexes the panels, each of multiple observations. Conditional logistic regression has In L = _k in Lk where k pools. Cox regression has In L = _--_(t)In L(t ) where (t) denotes the ordered failure
Your likelihood data estimators which contains indexes the risk times.
To evaluate such likelihood functions, first calculate the within-group log-likelihood This usually involves generate and replace statements prefixed with by, as in tempvar
contributions.
sumd
by group:
gen
double
"sumd*
= sum($ML_yl)
Structure your code so that the log-likelihood contributions are recorded in the last observation each group. Let's pretend that variable is named "cont ". To sum the contributions, code t empvar
of
last
quietly by group: gen byte "last" mlsthm "inf" ="cont" if "last"
=
(_n==_N)
It is of _eat importance that you inform mlsum as to the observations that contain log-likelihood values to be summed. First, you do not want to include intermediate results in the sum. Second, mlsttm does not skip missing values. Rather, if mlsum sees a missing value among the contributions, it sets the overall result "lnf" to missing. That is how ml maximize is informed that the likelihood function could not be evaluated at the particular value of b. ml maximize will then take action to escape from what it thinks is an infeasible area of the likelihood function. When the likelihood function violates the linear-form restriction In L - _j In t74, j -- 1,..., N, with In gj being a function solely of values within the jth observation, use method dO. In the following examples we will demonstrate methods ,51 and d2 with likelihood functions that meet this linear-form restriction. The d t and d2 methods themselves do not require the linear-form restriction, but the utility routines mlvecsum and mlmatstm do. Using method dl or d2 when the restriction is violated is a difficult programming exercise. El
> Example Method dl evaluators are required to produce the gradient vector g = 0 In L/Ob as well as the overall log-likelihood value. Using mlvecsura, we can obtain O ln L/Ob from 0 in L/OOi, i = 1,..., E. The derivatives of the Weibull log-likelihood function are
Olntj OOla
= pj(Mj
dj)
Oln t 9
o02j= dj - Rjpj(Mj-dj) The method dl evaluator for this is program
define weibl version 7 args
todo
tempvar
b inf g
/* g is new
tl t2
mleval
"tl"
= "b',
eq(1)
m!eval
"t2"
= "b',
eq(2)
local local
t "@HL_yl" d "$ML yi"
_empvar
p M R
quietly
gen
double
"p"
= exp('ti")
*/
5
ml-- Maximum likelihoodestimat_n
355
i
quietly g+n double "M" = ('t'*ex _(-'tl'))_'p" quietly g,..n double "R" = in('t')_'tl" mlsum "in:'"= -'M" + _d'*('t2"-'_1" + ('p':-I)*'R') if "todo'::=0I "Inf"--=.{ exit }:
/* <-- new */
tempname [i d2 mlvecsum Inf" "dl" = "p'*('M'-'d'), eq(1) mlvecsum inf" "d2" = "d" - "R'*!p'*('M'-'d'), eq(2) matrix "g = ('dl','d2")
/* /* /* /*
<-<-<-" <--
new new new new
*/ */ */ */
end
i
We obtained this code )y starting with our meihod dO evaluator and then adding the extra lines method dl Irequires. To }stimate our model using this evaluator, we could tvpe i
. ml
i ! but we rec(,mmend that _'ou first substitute method dtdebug for method dl and type t i . ml _odel dldebu, weibl (studytime died= . ml _aximize
i
!
Example i evaluatorsfarerequired to produce _H Method_2
= 021nL/ObOb', the negative Hessian matrix, as well as _he gradmnt aM log-hkehhood value, mlmatsumwall help calculate 021nL/ObOb from the negativ_ second derivatives with respect to theta. For the Weibull model, these negative second derivatives are
t,=
0013r_ gJ O21n -
I
pj(alj--dy
+ RjpjMj)
O01jOO2J
'i
i
d__ag2drug3 age) /s
Method d l([ebug will codaparethe derivatives we calculate with numerical derivatives and thus verify that our pr(,gram is corrdct.
!
i
drug3 age) /s
. ml _ximize
i
i
del dl weir1 (studytime died = _g2
02In gj O0_j_ - pjRj(RjpjMj
+ Mj - dj)
i
The metho¢ d2 evaluatol is progr am define version args
wei b2 7
todo
b lnf
g negH
/* negH added
*/
i
tempvar t: t2 mleval "t:" = "b', eq(1) local t ",_;ML_yI" local d "_IML_y2" tempvar p M R mleval "t,'i" = "b', eq(2) quietly g_n double "p" = exp('t2/) quietly g_n double "M" = ('t'*exp(-'tl'))"p" quietly g_n double "R" = in('t')_'tl" mlsum "InJ" = -'M" + "d'*('t2"-'_l" + ('p'-I)*'R') if "todo'_=O I "inf"==. { exit } tempname il
i
d2
mlvecsum .lnf"
"d2
"R'*_p'*('M'-'d'),
mlvecsum "dl" : :p:*!'M'-','), matrix "g]41nf" = ('d1",'d_'){ if "todo'!=l I "Inf'==. exit }
eq(2)
eq(1) /* new from here down */
I
i
llKellnOOO dll d12 d22
estimation
mlmatsum
"Inf"
"dll"
= "p"2
mlmatsum
"inf"
"d12"
=-'p'*('M'-'d"
ml --tempname maxlmum
ouu
. TI'I I :
* "M',
eq(1) + "R'*'p'*'M'),
mlmatsum "inf" "d22" = "p'*'R'*('R'*'p'*'M" matrix "negH" = ('d11",'d12" \ "d12'','d22")
eq(l,2)
+ "M" - "d')
, eq(2)
end
We started with our previous method dl evaluator and added the lines that method d2 requires. We could now estimate a model by typing . ml model d2 weib2 • ml maximize
(studytime
died
= drug2
danlg3 age)
/s
but we would recommend you fist substitute method d2debug for method d2 and type . ml model d2debug . ml maximize
weib2
(studytime
died
= drug2
drug3
age)
/s
Method d2debug will compare the first and second derivatives we calculate with numerical derivatives and thus verify that our program is correct. <3 As we stated earlier, to produce the do except specify robust, cluster(), For methods dl and d2, these options restrictions and you fill in the equation
robust variance estimator with method lf, there is nothing to and/or pueights. For method dO, these options do not work. will work if your likelihood function meets the linear-form scores. The equation scores are defined as
Olngj
01nt_
00 U '
002j ' ""
Your evaluator will be passed variables, one for each equation, which you fill in with the equation scores. For both method d t and d2, these variables are passed in the sixth and subsequent positions of the argument list. That is. you must process the arguments as args
todo
b Inf
g negH
gl g2
...
Note that for method dl, the "negH" argument is not used; it is merely a place holder.
> Example If you have used mlvecsumin your method dl or d2 evaluator, it is easy to turn it in a program that allows the computation of the robust variance estimator. The expression that you specified on the right-hand side of mlvecsum is the equation score. Here we turn the program that we gave earlier in the method d l example into one that allows robust,cluster(),and/or pweights. program
define weibl version 7 args
todo
tempvar
b Inf
g negH
gl g2
/* negH,
tl t2
mleval
"t1"
= "b',
eq(1)
mleval
"t2"
= "b',
eq(2)
local
t "SML_yl"
local
d "$ML_y2"
tempvar
p M R
quSo.t]y
gen
quietly quietly
gen double gen double
double
"p"
= exp('t2")
"M" = ('t'*exp(-'tl'))^'p" "R" = in('t')-'t1"
gi,
and
g2
are
new
*/
t
......
ml --
Maximum
likelihood
estimation
357
mlsum "Inf_ = -'M" + "d'*('t2"-'til"+ ('p'-l)*'it') tempname di d2 if "todo'--POi "inf'==. { exit } quietly re_lace "gl" = "p'*('M'-'d') quietly replace "g2" "d" - "R'*_p'*('M'-'d") mlvecsum "_nf" "dl" "gl', eq(1) mlvecsum "Inf" "d2" = "g2", eq(2)
i
i i
/* /* /* /*
<-<-<-<--
new new changed changed
=/ */ *I */
matrix "g'l= ('dl",'d2")
i !
i
To estimate _ur model an_t get the robust variance estimates, we type . tal _bdel dl weib ! (study'time . ml m_ximize
Saved ResUlts ml saves!n Scalars
died
= drug2
clrug3
age)
/s,
robust
,
e()"
i
e(r4)
number o ! observations
e(N_clust)
number of clusters
number o_ _arameters number of equations
e(rc) e(chi2)
return code X2
e(k_dv)
number of dependent variables
e(ic)
number of iterations
e(df_m)
model degrees of freedom
e(rank)
rank of e(V)
e(il)
log tikelih)od
e(rank0)
rank of e(V)
e(ll_0)
log likelih)od, constant-only
e(k) e(k_eq)
_
for constant-only
model
model
if Ltt sak,ed by e(chi2type) Macros i !
e(cmd) e (depvar) I
ml name of d_:pendent variable
e(vcetype) e(user)
covariance estimation metl_od name of liketihood-evatuator program
e(wtype) e(wexp)
weight weight typ,: expression
e(opt) e(chi2type)
type Wald ofor optimization LR: type of model X2 test
name of cluster variable
e(cnslist)
constraint
e(V)
variance-covariance
!I
e(clustva_) I
Matrices e(b)
i
e(ilog) Functions
I
e (sample)
]
References!
coefficient
!
_ector
iteration lo
(up to 20 iterations) t marks estin_ation sample
Complementaiy:
Also See
matrix of
the estimators
i
Gould. W. and V(. Sribne_. 1999,. Maximum Likelihood
i
numbers
i
[It] maximize, [R] nl, [P] es_:imates, [P] matrix
Estimation with Stata. College Station, TX: Stata Press.
I Title I mlogR-
Maximum-likelihood
'
I
multinomial
[[1
I
(polytomous)logistic
I
I
I
cluster
(varnarne)
.
!
regression
I
!
I
Syntax (clist)
constraints
r_Errnoconstant
robust
maximize_options
: may be used with mlogit;
fweights, mlogit
iweights,
and pweights
shares the features
level
(#)
]
wh reli,, oftheform <-#][, by ...
(newvartist)
score
]
see [R] by. are allowed;
of all estimation
see [U] 14.1.6 weight.
commands:
see [U] 23 Estimation
and
post-estimation
commands.
Syntaxfor predict predict [_pe] newvarname(s) o_utcome(outcome)
]
Note that you specify one new variable These statistics are available the estimation sample.
[if exp] [in range] [, { p I xb [ stdp [ stddP }
with xb, stdp,
both in and out of sample;
and stddp
and specify
type predict
. ..
either one or k new variables if
e(sample)
...
with p.
if wanted only for
Description mlogit estimates maximum-likelihood multinomial logit models, also known as polytomous logistic regression. Constraints may be defined to perform constrained estimation. Some people refer to conditional logistic regression as multinomial logit. If you are one of them. see [R] elogit. See [R] logistic for a list of related estimation commands. A maximum of 50 categories can be; estimated with Intercooled Stata.
Stata: 20 categories
with Small
Options basecategory (#) specifies the value of depvar that is to be treated as the base category. The default is to choose the most frequent category. constraints(ctist) specifies the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are defined with the constraint command: see [R] constraint, constraints(I) specifies that the model is to be constrained according to constraint 1: constraints(I-4) specifies constraints 1 through 4: constraints(I-4,8) specifies constraints 1 through 4 and 8. It is not considered an error to specify nonexistent constraints so long as some of the constraints exist. Thus, constraint (1-999) would specify that all defined constraints be applied. 358
!t mlogit
M_lximum-likelihoodmultihomial(polytomous)logistic regression
359
I
robust
specities that • thdI Huber/White/sandwich.
_
estimator of variance is . to be used in place of the °
i
tradition_ calculatlon_ see [U] 23.11 Obtaining robust variance estimates, robust combined with ct_ ster () alley, s observations which ard not independent x_ithin cluster (although they must
l
1 be indep-mdent between clusters). If you sl_ecify pwei ts, robust is implied: see [U] 23.13 Weighted estimation.
! i
ctuster(vmzame) specifies that the observations are independent across groups (clusters) but not nece;sarily withi_ I.groups• vamame specifies to which _oup each observation belongs; e.g., clustez (personid) I In data with repeated Observations on individuals, cluster() affects the
_i !
estimate_ standard errors and variance-covaflance matrix of the estimators (VCE). but not the estimate_ coefficients: see [U] 23.11 Obtaining robust variance estimates, cluster() can be
I
used wi_ pweights to produce estimates for unstratified cluster-sampled data_ but see the svymlo; it command in [R] s_3_estimators for a command designed especially for survey data.
i
clustm
I
by itself.
I :: ! i
specifying robust
cluster()
is equivalent to typing cluster()
score(neu:varlist) creat :s k - 1 new variables, where k is the number of observed outcomes• The first vari _ble contains OlnLj/O(x3bl): the second variable contains OlnLj/O(xjb2); and so on. Note tha if you were o specify the option scoi'e(sc*), Stata would create the appropriate number of new ariables and Lheywould be named sol. sc2 .... sc(k - 1). level(#) specifies the c, ,nfidence level, in percent, for confidence intervals. The default is level(95) or as set by set lew.q: see [U] 23.5 Specifying the width of confidence intervals. rrr
i !
I.
() implies rbbust:
reports the estimate coe cients transformed to relative risk ratios, i.e.. e b rather than b: see De.sculpton of modellbeto,a for an explanation of this concept. Standard errors and confidence intervals are similarl_ transformed. This opti0n affects how results are displayed, r_ot how they are estin,ated, rrr m_v be specified at estimation or when replaying previously estimated results.
noconstantsuppresses the constant term in the model.
'
ma.vimize_@tions contr specify hem.
ohthe maximization
process: see [R] maximize. You should never have to
Options for predict p, the defmtlt, calculates the probabifity of a positive outcome conditional on one positive outcome within g_oup. If you d._ not also specify the outcome(outconm) option, you must specify k new variables. For instance I say you estimated your model by typing mlogit insure age male and that insure takes on!three values. Then you could type pt-edict pl p2 p3, pr to obtain all three predicted
i
probabilities. It does trot matter _xhich category mlogit calculatel all three probabilities correctly.
chose as the base category: predict
wilt
'_:
If you a!so specify t_e outcome(outcome)
i , i,_
insure Itook on values I. 2, and 3. Typing predict p:t, pr outcome(i) would produce the same pt_ as above, pzedict p2, pr outcome(2) the same p2 as above, etc. If insure took on ' 7_ 22. and 93. you would specify outcome(7) . outcome(22), _aiue._ and outcome(93) •
!
xb
_l
strip
i
=
calculatis
the
linear
_rediction•
You
must
option, then you specie, one new variable. Say that
also
specify
the
outcome(Ol',tC_'_t*Te)
option•
calct}lates the standard error of the linear prediction. You must also specify, the out-
! come(ot_tcome) option.
i:._I ,
oou mloglz -- Maxtmum-I|l(ellltood multinomial (polytomous) logistic regression stddp calculates the standard error of the difference in two linear predictions. You must specify option outcome(outcome) and in this case you specify the two particular outcomes of interest inside the parentheses; for example, predict sed, sgddp outcome(i,3). outcome (outcome) specifies for which outcome the statistic is to be calculated, equation() is a synonym for outcome ()" it does not matter which you use, and the standard rules for specifying an equation() apply.
Remarks Remarks are presented under the headings Description of model Estimating unconstrained models Obtaining predicted values Testing hypotheses about coefficients Estimating constrained models
mlogit performs maximum likelihood estimation of models with discrete dependent (left-handside) variables. It is intended for use when the dependent variable takes on more than two outcomes and the outcomes have no natural ordering. If the dependent variable takes on only two outcomes, estimates are identical to those produced by logistic or logit; see [R] logistic and [R] logit. If the outcomes are ordered, see [R] ologit.
Descriptionof model For an introduction to multinomial logit models, see, for instance, Aldrich and Nelson (1984, 73-77), Greene (2000, chapter 19), Hosmer and Lemeshow (t989, 216-238), and Long (1997, chapter 6). For a description with an emphasis on the difference in assumptions and data requirements for conditional and multinomial logit, see Judge et al. (1985, 768-772). Consider the outcomes 1, 2, 3, ..., rn recorded in y, and the explanatory variables X. For expositional purposes, assume there are rn = 3 outcomes. Think of these three outcomes as "buy an American car", "buy a Japanese car", and "buy a European car". The values of y are then said to be "unordered". Even though the outcomes are coded 1, 2, and 3, the numerical values are arbitrary in the sense that 1 < 2 < 3 does not imply that outcome 1 (buy American) is less than outcome 2 (buy Japanese) is less than outcome 3 (buy European). It is this unordered categorical property of y that distinguishes the use of mlogit from regress (which is appropriate for a continuous dependent variable), from ologit (which is appropriate for ordered categorical data), and from logit (which is appropriate for two outcomes and which can therefore be thought of as ordered). In the multinomial logit model, we estimate a set of coefficients _(i),/3(2) to each outcome category: exB(1)
Pr(y
= 1) - eX_(a) + eX_(_) + e xo(a_ eX_ (2_
Pr(y
= 2) -
eX_(_) + eX3(2) + e x_(a) eXf_(3)
Pr(y
--: 3) -- eX_(_) -I-e X3(2)
+
e X2(3)
and 9 (3) corresponding
t
_-_-
mlogit-- Maximum-likelihoodmultinomial(polytomous)logisticregression
The mcktel, however, is unidentified in the sense that there is more than one solution to 9 (t), fl(2), and /:1(3) that leads to the same probabilities for y = 1, y = 2, and y - 3. To identify the model. one of [3(1). 3 (2), or 3 (3) is arbitrarily set to 0--it does not matter which. That is, if we arbitrarily set /3{I) = 0, the remaining coefficients fir2) and /_(3) would measure the change relative to the y = [ group. If we instead set 3 (2) -- 0, the rerfiaining coefficients fl(1) and fl(3) would measure the change relative to the y = 2 group. The coefficients would differ because they have different interpretations, but the predicted probabilities for y =-- 1, 2, and 3 would still be the same. Thus, eitherparameterization would be a solution to the same underlying model.
) !
!
361
,
Se_fing ,30) = O, the equations t_ome
i
1 Pr(y = 1) =
!
1 4-,eXt3(2_+ eXt3_3_ £Xfl(2)
1
pr(y= 3)-
+ eXO (a) eX_
(a)
1 +ieX_ _(2_+ eXt3(3_
Th_ relative probability of y = 2 to the base category is Pr(y = 2) = eX¢_2_ Pr(y = 1) Let us !c_ll this ratio the relative risk and let us fu_her assume that X and 3, k(2) are vectors equal to (Zl , x2, -., xk) and ,,_i change in xi; is then
,u2
" ... , ,3k ), respectively. The ratio of the relative risk for a one-unit
(2) eft1
(2) xa+'"+t3
i
, (zi+ll+"'+/3t_
(2} xk
R!2)
Thus, the exponentiated value of a coefficient is the relative risk ratio for a one unit change in the correstlonding variable, it being understood that risk is measured as the risk of the category relative to the base category.
I
I° stimat!ng unconstrained models Exam r
You have data on the type of heat_h insurance a'Oailable to 616 psychologically depressed subjects lJ.S. (Tarlov et al. 1989: "Wellset al. 1989). The insurance is categorized as being either an in the ' indemr/ity plan (i.e.. regular fee-for-service insurance which may have a deductible or coinsurancc
l "I I i
e
_ ' * i
i r
rate) or a prepaid plan (a fixed up-front payment allowing subsequent unlimited use as provided, for instance, bv an HMO_.The third possibility is that ttie subject has no insurance whatsoever. You wish to explore the demographic factors associated with each subject's insurance choice. As an introduction Io the data, one of the demographic factors is the race of the participant, coded as white or nonwhite: " t,!_bul
: i!,
_;e insure
i
nonwh:tte,
cal.2
col
'i _*I:)
._oz
mmgrE-
nonwhite MaxJmum-Iil_elihoodmultinomial (polytomous) logistic regression
insure
0
1
Total
Indemnity
251 50.71
43 35.54
294 47.73
Prepaid
208 42.02
69 57.02
277 44.97
Uninsure
36 7.27
9 7.44
45 7.31
495 100.O0
121 100.O0
616 100.O0
Total
Pearson chi2(2) =
9.5599
Pr = 0,008
Although insure appears to take on the values Indemnity,Prepaid, and Uninsure, it actually takes on the values 1, 2, and 3. The words appear because a value label has been associated with the numeric variable insure; see [U] 15.6.3 Value labels. When you estimate a multinomial logit model, you can tell mlogit which group m use as the base category or you can let mlogit choose. To estimate a model of insure on nonwhite, letting mlogit choose the base category, we type . mlogit znsure nonwhite Iteration Iteration Iteration Iteration
O: I: 2: 3:
log log log log
likelihood likelihood likelihood likelihood
= = = =
-556.59502 -551.78935 -551.78348 -551.78348
Multinomial regression
Number of obs LR chi2 (2) Prob > chi2 Pseudo R2
Log likelihood = -551.78348
Std. Err.
z
P>[z[
= = = =
616 9.62 0.0081 0.0086
insure
Coef.
[95_,Conf. Interval]
Prepaid nonwhite _cons
.6608212 -. 1879149
.2157321 .0937644
3.06 -2.00
0.002 0.045
.2379942 -.3716896
i.083648 -.0041401
.3779585 -1.941934
.407589 .1782185
O.93 -10.90
O.354 0,000
-.4209012 -2.291236
i.176818 -1.592632
Uninsure nonwhite _cons
(Dutcome insure==Indemnity
is the comparison group)
mlogit chose the indemnity group as the base or comparison group and presented coefficients for the outcomes prepaid and uninsured. According to the model, the probability of prepaid for whites (nonwhite= 0) is e
Pr(insure
Similarly, for nonwhites,
= Prepaid)
the probability
-.188
= 1,+ e_.la8 + e_1,942 -- 0.420
of prepaid is e-,188+.661
Pr(insure
-- Prepaid)
-----1 + e--188+'661 +
e -1"942+'378
--:_ 0.570
mlogit -- Maxlmum-likelihoodmultinomial(polytomous)logistic regression
363
TtJese results agree with the column percentages presented by tabulatesince the mlogit model is fully _atutated. That is, there are enough terms in the model to fully explain the column percentage in ea_zh,,_ell'. Note that the model chi-squared and the tabulate chi-squared are in almost perfect agreemeiat; )oth are testing that the column percentages of insure are the same for both values of nonwh_$e. , q
Example_ By s_eci}'ying the basecategory() option, ),ou can control which category' of the outcome variai_le 'is ireated as the comparison group. Lef{ to its own, mlogit chose to make category, 1. indemnit},, the base category. If we wanted to make category 2, prepaid, the base, we type ilo_it insure nonwhite, base(2) iIt,_ra_ionO: !It,,ra_ion1: IIt,,ra_ion2: !It_ra1_ion3:
log log log log
likelihood = -556.5950_ likelihood -55i.78936 likelihood = -551.78348 likelihood = -551.78348
IMu;.ti_omialregression ;
;
iLog l_kelihood = -551.78348 linsure
'
Coef.
Std. Err.
z
Number of obs LR chi2(2) Prob > chi2
= = =
616 9.62 0.0081
Pseudo R2
=
0.0086
P> [zl
[95_ Conf. InterVal]
IIndem_ity n_nwhite _cons
-.6608212 .1879149
.2157321 .0937644
-B.06 2. O0
O.002 O. 045
-I.083648 .0041401
-.2379942 .3716896
_/_._nsu_e n_nwhite _cons
-.2828628 -i.754019
.3877302 .1805145
-l). 71 -_.72
O.477 O. 000
-i.0624 -2.107821
.4966741 -1.400217
'(O_tc e insu_e==Prepaid is the comparison group)
Theb_s_ca_egory()option requires thatwe specify thenumericvalueofthecategory, sowe could not type__ _a_e(Prepaid). t Alflao_ghlthe coefficients now appear to be different, note that the summary statistics reported at the toi) ale i_entical. With this paramelerization the probability of prepaid insurance for whites is • !
1 Pv(insure
= Prepaid)
-- rj+ e "188+ e-1"754 -- 0.420
Thinsi$ tNe same answer we obtained previously. q
b. Examfllei Bv_sp_cif:ing rrr,which we can do at estimation time or when we redisplav results, we ._ee the -i. tterms of relative risk ratios: model.m
• mlogit,
364
rrr
Multinomial
regression
Number
of obs
=
616
mlogit-- Maximum-likelihood multinomial(polytomous) logisticregression LR chi2 (2) = 9.62 Prob
Log
likelihood
= -551.78348
insure
KRK
Indemnity nonwhite
.516427
.7536232
> chi2
Pseudo
Std.
Err.
R2
=
0.0081
=
0.0086
z
P>Izl
[957, Conf.
Interval]
.1114099
-3.06
O. 002
.3383588
.7882073
.2997387
-0.71
0.477
.3456254
1. 643247
Uninsure nonwhite
(Outcome
insure==Prepaid
is the
comparison
group)
Looked at this way, the relative risk of choosing an indemnity over a prepaid plan is 0.52 for nonwhites relative to whites.
<1
> Example One of the advantages of mlogit over tabulateis that continuous variables can be included in the model and you can include multiple categorical variables. In examining the data on insurance choice, you decide you want to con_ol for age, gende5 and site of study (the study was conducted in three sites): . mlogit
insure
age male
nonwhite
site2
sites
Iteration
O:
log
likelihood
= -555.85446
Iteration
1:
log
likelihood
= -534.72983
Iteration
2:
log
likelihood
= -534.36536
Iteration
3:
log
likelihood
= -534.36165
Iteration
4:
log
likelihood
= -534.36165
Multinomial
Log
regression
likelihood
= -534.36165
Number of obs LK chi2(lO)
= =
Prob
> chi2
=
0.0000
R2
=
0.0387
Pseudo
Std.
Err.
z
P>IzI
[95_
Conf.
615 42.99
insure
Coef.
Interval]
age male
-.011745 .5616934
.0061946 .2027465
-1.90 2.77
0.058 0.006
.9747768
.2363213
4.12
0.000
.1i30359
.2101903
0.54
0.591
-.2989296
site3
-.5879879
.2279351
0.010
-I.034733
_cons
.2697127
.3284422
0.82
0.412
-.3740222
.9134476
age male
-.0077961 .4518496
.0114418 .3674867
-0.68 1.23
0.496 0.219
-.0302217 -.268411
.0146294 1.17211
Prepaid
nonwhite site2
-2.58
-.0238862 .1643175
.0003962 .9590693
.5115955
1.437958 .5250013 -.1412433
Uninsure
nonwhite
(Outcome
.2170589
.4256361
0.51
0.610
-,6171725
1.05129
site2 site3
-1.211563 .2078123
.4705127 .3662926
-2.57 -0.57
0.010 0.570
-2.133751 -.9257327
-.2893747 .510108
_cons
-i.286943
.59_3219
-_.17
0.030
-5.447872
-.1260135
insure==Indemnity
is the
comparison
group)
_-
mlogit -- Maximum-likelihoodmultinomial(polytomous)logistic regl'ession 365 These results suggest that the inclination of nonwhites to choose prepaid care is even stronger than it was withbut controlling. We also see that subjects in site 2 are less likely to be uninsured, <1
Obtaining predicted values i ' ' ! '
> Example After estimation, predict can be used to obtain predicted probabilities, index values, and standard errors of the index, or differences in the index. For instance, in the preceding example we estimated a model ofiinsurance-choice on various characteristics. We can obtain the predicted probabilities for outc6m¢ 1 by typing
I
. :predict pl if e(sample),
i
(Qpt_on
p assumed;
(:_9_issing
values
• :summarize
pl
%_ariable
outcome(1)
predicted
probability_
generated)
Obs
Mean
Std_ Dev.
Min
Max
!
pl
615
.4764228
.10B2279
.1698142
,71939
! i
Note!that v_e included if e(sample) to restrict the calculation to the estimation sample. If you look back at the!previous example, the muhinomial logit model was estimated on 615 observations: there mustl be,missing values in our dataset.
I _-
AlthOugh we typed outcome(I), specifying 1 for the indemnity categoD_. we could have typed outcome(_ndemnity).For instance, to obtain the probabilities for prepaid, we could type
l
_l
pr _edict
p2 if
e(sample),
(optibn p assumed; equatiion prepaid r._:30_) ; -;prddict
p2
outcome(prepaid)
predicted
probability)
not found
if e(sample),
outcome(Prepaid)
(d.ptibn p assumed;
predicted
(49 _issing
values
generated)
• summarize
p2
Vs_riable
Obs
p2
615
probability)
Mean
.4504065
Std. Dev.
.1125962
Min
.1964103
Max
.7885724
When specifying the label, it must be specified exactly as it appears in the underlying value label (or how it appears in the mlogit output), and that includes capitalization. Here, we have used predict to obtain probabilities for the same sample on which we estimated. That is not necessary. We could use another dataset that had the independent variables defined (in our example, age, male. nonwhite site2, and site3) and use predict to obtain predicted probabilities; m this case. we would not specify if e(sample). <1
........
_., --
,.._^,,,,u.,-._=l.lUUU
lpoP_omous)lOgiStiCregression
IllUlUNOml81
_k _:
predict
can also be used to obtain the "index" values--the
• predict idxl, outcome(Indemnity) (1 missing value generated) • summarize idxl Variable idxl
I
Obs
t
643
_ xi/3} )--as well as the probabilities:
xb
Mean
Std. Dev.
0
Min
Max
0
0
0
The indemnity category was our base category--the categorY for which all the coefficients were set to O--and so the index is always O. For the prepaid and uninsured categories: predict idx2, outcome(Prepaid) (1 missing value generated)
xb
predict idx3, outcome(Uninsure) (1 missing value generated) • summarize idx2 idx3 Variable Obs idx2 idx3
643 643
xb
Mean -.0566113 -1.980747
Std. Dev. .4962973 .6018139
Min
Max
-1.298198 -3.112741
1.700719 -.8258458
We can obtain the standard error of the index by specifying the predict se2, outcome(Prepaid) (1 missing value generated)
stdp
option:
stdp
list p2 idx2 se2 in 1/5 I. 2. 3. 4.
p2 .3709022 .4977667 .4113073 .5424927
5.
idx2 -.4831167 .055111 -.1712106 .3788345
se2 .2437772 .1694686 .1793498 .2513701
-.0925817
.1452616
(We obtained the probability p2 in the previous example.) Finally, predict can calculate the standard error of the difference in the index values between two outcomes with the stddp option: predict se_2_3, outcome(Prepaid,Uninsure) (1 missing value generated)
stddp
list idx2 idx3 se_2_3 in 1/5 idx2 I. -.4831167 2. .055111 3. -.1712106 4. .3788345 5. -.0925817
idx3 -3.073253 -2.715986 -1.579621 -1.462007 -2.814022
se_2_3 .5469354 .4331917 .3053815 .4492552 .4024784
In the first observation, the difference in the indexes is -.483 - (-3.073) of that difference is .547.
= 2.59. The standard error
mlogit-- Maximum-likelihoodmultjnomial(potytomous)logistic regression i
367
>Ex mp, b It is more difficult to interpret the results from mlogit than clogit or logit since there are multiple equations. For example, suppose one of the independent variables in your model takes on the Values 0 and I and you are attempting to bnderstand the effect of this variable, Assume the coefficient On this variable for the second outcome, fl(z), is positive. You might then be tempted to reason '_hat!theprobability of the second outcomd is higher if the variable is I rather than O. Most of
]
i
the time, t_at will be true but occasionally you will be surprised. It could be that the probability of; som_ oiheri category will increase even more (s@ fl,,a) > fl(2)) and thus the probability of outcome
i
i
i
i
i
i
I
!
i
Ptedicti6n can be used to aid interpretation. 2 ac|uaily (ells relative to that outcome•
! ]
predlctibns;by race. our Forpreviously this purpose, we caninsurance-choice use the "methodmodel, of recycled predictions", we Cbndnu!ng with estimated we wish to describe inthewhich model's
i
varv,, charadteristics of interest across the whole dmaset and average the predictions. That is, we have , data !on!bo_h whites and nonwhites, and our individuals have other characteristics as well. We will first preiend that all the people in our data are wfiite but hold their other characteristics constant. We then Icaltul_ite the probabilities of each outcome. Next we will pretend that all the people in our data are rJon_'hiie, still holding their other characteristics constant. Again we calculate the probabilities of each outcome. The difference in those two sets of calculated probabilities, then, is the difference due to ra_e; _hokling other characteristics constant.
_i
. ge_ I
I
l
i i
byte
nonwhold
/*
= nonwhite
. re_lace
nonwhite
= 0
(426 real
changes
made)
. pr_)ict
wpind,
outcome(Indemnity)
(@pti_bn p assumed;
save real
/* make
predicted
/*
predict
*/
race
everyone
i
white
*/
probabilities'
*/
probability_
(t missing value generated) • predict wpp, outcome(Prepaid) (dptipn p assumed; predicted (_ missing value generated) • !pre'dict wpnoi,
probabilityl
outcome(Ullinsure)
(_tibn p assumed; predicted (_ missing value generated) • _ep_ace (l_I_ real
i
probability)
i
nonwhite=l changes made)
/* make
everyone
nonwhite
*/
i i
• _re_ict nwpind, outcome(Indemnity) (_tipn p assumed; predicted probability
i _ i
(I missing
i
• _re_ict
value nwpp,
generated) outcome(Prepaid)
(_pti_n p assumed; predicted (1 missing value generated) pre
ict nwpnoi,
i I
outcome(Uninsure)
(optipn p assumed; predicted mi#sing value generated) replace (5R8 real
probability)
i
probability!
nonwhite--nonwhold changes made)
/* restore
real
race
*/
.p,n.p*
• V_riable
Obs
Mean
Std' Dev.
Min
Max
wpind
643
.5141673
.08_2679
.3092903
wpp wpnoi
643 643
.4082052 .0776275
,09( ,3286 .03( 0283
.1964103 .0273596
,6502247 .1302816
nwpind nwpp Inwpnoi
643 643 643
:3112809 .630078 .0586411
.08:7693 .095 9976 02_ 7185
.1511329 .3871782 0209648
.535021 .8278881 ,0933874
.71939
i
i
.........................
tl_,,s, Ttu=-uu=
I lug!SIlO
regress=on
unadjusted. The means reported above are the values adjusted for age, sex, and site. Combining the results gives Earlier in this entry we presented a cross-tabulation of insurance type and race. Those values were Unadjusted white nonwhite
Adjusted white nonwhite
Indemnity
.51
.36
.52
.31
Prepaid
.42
.57
.41
.63
Uninsured
.07
.07
.08
.06
We find, for instance, that while 57% of nonwhites in our data had prepaid plans, after adjusting for age, sex, and site, 63% of nonwhites choose prepaid plans,
Q Technical Note Classification of predicted values followed by comparison of the classifications with the observed outcomes is a second way predicted values can help interpret a multinomial logit model. This is a variation on the notions of sensitivity and specificity for logistic regression. Here, we will adopt a three-part classification with respect to indemnity and prepaid: definitely predicting indemnity, definitely predicting prepaid, and ambiguous. (1
predict indem, missing value predict
prepaid,
(I missing
value
outcome(Indemnity) generated)
index
outcome(Prepaid)
index
predict sediff, outcome(Indemnity,Prepaid) (i missing value generated) gen type = I if diff/sediff (504 missing values generated)
obtain
difference
*/
I "Def
type
insure
Ind"
/* _ its
standard
error
*/
> 1.96
/* definitely
prepaid
*/
_ diff/sediff!=.
/_ ambiguous
type = 2 if type==. changes made)
• tabulate
/*
indemnity
replace (404 real
values
*/
/* definitely
type = 3 if diff/sediff changes made)
clef type
indexes
stddp
< -1.96
replace (i00 real
label
obtain
generated)
gen diff= prepaid-indem (I missing value generated)
label
/*
2 "Ambiguous"
3
type
"Def
*/
*/
Prep"
/* label
results
,/
type type
insure
Def
Ind
Ambiguous
Def
Prep
Total
Indemnity
78
183
33
294
Prepaid Uninsure
44 12
177 28
56 5
277 45
Total
134
388
94
.......
616
One substantive point learned by this exercise is that the predictive power of this model is modest. There are a substantial number of misclassifications in both directions, though there are more correctly classified observations than misclassified observations. A second interesting point is that the uninsured look overwhelmingly come from the indemnity system rather than the prepaid system.
as though they might have Q
i
W
mlogit-- Maximum-likelihoodmultinomial(polytomous)logisticregression
369
Tes@nghypothesesaboutcoefficients
i '
,
E=mp,* i i HypOtheses about the coefficients are tested with test just as they are after any estimation c0mina6d;___ see [R] test. The only important point to note is test's syntax for dealing with multiple equa_tior_models. You are warned that test bases its results on the estimated covariance matrix and tlht _alikelihood-ratio test may be preferred; see Estimating constrained models below for an example
ofl t st.
I' i f o_e simply lists variables after the test c0effici_nts are zero across all equations: • itest I) (]2) (I3) (i4)
site2
command, one is testing that the corresponding
site3
[Prepaid]site2 = 0.0 [Uninsure]site2 = 0.0 [Prepaid]site3 = 0.0 [Uninsure]site3 = 0.0 chii(4) Prob > chi2
19.74 0.0006
= :
One :ca0 test that all the coefficients (except the constant) in a single equation are zero by simply typlrig tlhe outcome in square brackets:
:
. test i (i 1) i (' 2) ( 3) (i 4) (I 5) :
'
!
[Uninsure] [Uninsure]age = 0.0 [Uninsure]male = 0.0 [UninsureJnonwhiZe = 0.0 0.0 [UninSure]site2 = [Uninsure}site3 chi2(5)
=
= 0.0 9.31
Prob > ahi2 =
0,0973
S_ectfic_tion of the outcome is just as with predict; you can specify the label if the outcome variable is!lal_ele_t, or you can specify the numeric value of the outcome. We would have obtained the same te_t _is above had we typed test [3], since 3 is the value of insure for the outcome uninsured
i !
Tt_e t_vo syntaxes can be combined. To test that the coefficients on the site variables are 0 in the equation! corresponding to the outcome prepai& we can type •
_est [Prepaid]: site2 site3 (! I) ( !2) '
i
[Prepaid]site2 : 0.0 [Prepaid]site3 = 0.0 chii( 2) = Prob > chi2 =
10.78 0.0046
sFeqfied the outcome and then followed that with a colon and the variables we wanted to test We can _ _lso ' test that coefficients are equal across equations. To test that all coefficients except the constlmt ' • !are equal for the prepaid and uninsured outcomes, _est
;
(il) (i2) (!3) (!4) (i5) ! !
[Prepaid=Uninsure] [Prepaid]age - [Uninsure]age = O,0 [Prepaid]male - [Uninsure]male = 0.0 [Prepaid]nonwhite - [Uninsure]nonwhite = 0.0 [Prepaid]site2 - [Uninsure]site2 = 0.0 [Prepaid]site3 - [Uninsure]site3 = 0.0 chii( 5) = 13.80 Prob
> cbi2
=
0.0169
To test that only the site variables are equal: • test
......
[Prepaid=Uninsure]:
site2
site3
,vu,. -- ,eux-,,um-.Ke.nooo
(1) (2)
[Prepaid]site2
-
[Uninsure]site2
[Prepaid]site3
-
[Uninsure]
=
chi2(2) Prob
> chi2
multinomial (polytomous) logistic regression = 0.0
site3
= 0.0
12.68
=
0.0018
Finally, we can test any arbitrary constraint by simply entering the equation, coefficients as described in [U] 16,5 Accessing coefficients and standard errors. hypothesis is senseless but illustrates the point: test (i)
( [Prepaid] age+ [Uninsure] .5 [Prepaid]age Prob
= 2- [Uninsure] nonwhite
+ [Uninsttre]nonwhits
1)
=
> chi2
=
chi2(
site2)/2
specifying the The following
+ .5 [Uninsure]site2
= 2.0
22.45 0.0000
Please see [R] test for more information on test. All that is said about combining across test commands (the accum option) is relevant after mlogit.
hypotheses q
Estimating constrained models mlogit can estimate models with subsets of coefficients constrained to be zero, with subsets of coefficients constrained to be equal both within and across equations, and with subsets of coefficients arbitrarily constrained to equal linear combinations of other estimated coefficients. Before estimating a constrained model, you define the constraints using the constraint command: see [R] constraint. Constraints are numbered and the syntax for specifying a constraint is exactly the same as the syntax for testing constraints; see Testing hypotheses about coe£IJciems above. Once the constraints are defined, you estimate using mlogit, specifying the constraint () option. Typing constraint (4) would use the constraint you previously saved as 4. Typing constraint (1,4,6) would use the previously stored constraints 1, 4, and 6. Typing constraint (1-4,6) would use the previously stored constraints 1, 2, 3, 4, and 6. Sometimes, you will not be able to specify the constraints without knowledge of the omitted group. In such cases, assume the omitted group is whatever group is convenient for you and include the basecategory () option when you type the mlogit command.
> Example Among other things, constraints can be used as a means of hypothesis testing. In our insurancechoice model, we tested the hypothesis that there is no distinction between having indemnity insurance and being uninsured• We did this with the test command. Indemnity-style insurance was the omitted group, so we typed • test (i) (2)
[Uninsure] [Uninsure]age [Uninsure]male
= 0.0 = 0.0
(3)
[Uninsure]nonwhite
(4) (5)
[Uninsure]site2 [Uninsure]site3 chi2( Prob
5) =
> chi2
= 0.0 = 0.0 = 0.0
=
9.31 0.0973
]
_r
mlogit-- Maximum-likelihoodmUltinomial(polytomous)logistic regression
!
(Had!indemnity not been the omitted group, we would have typed test :
i
,
.)
;T_e r_sults produced by test are based On the estimated covariance matrix of the coefficients _e an approx_mauon. Since the probabthty of being uninsured is quite low, the log hkehhood m_y be honlinear for the uninsured. Conventional statistical wisdom is not to trust the asymptotic
• .
! I
[Uninsure=Indellmity]
371
_ I i
.......
answbr dnder these circumstances, but to perform a likelihood-ratio test instead. State _asa lrtest likelihood-ratio test command; to use it we must estimate both the unconstrained anktttte cbnstrained models. The unconstrained model is what we have previously estimated. Following the ifistr_ction in [R] Irtest. we first save the unconstrained model results:
)
!
.
_rtest, saving(O)
TO e_imme the constrained model, we must re-estimate our model with all the coefficients except the c6nsthnt set to 0 in the Uninsure equation. We define the constraint and then re-estimate:
I
donstraint define 1 [Uninsure]
I
_logit insure age male nonwhite site2 site3, constr(1)
I
(I)
[Uninsure]age
(3) (2) (4) (5)
[Unins_mre]nonwhite [Uninsure]male = 0,0 = 0.0 [Uninsttre] site2 = O. 0 [Unins_Ire] site3 = 0.0
!Iteration O: iIteration 1 : _Iteration 2: riferation 3:
log log log log
= 0.0
likelihood likelihood likelihood likelihood
= = = =
-555.85446 -539.80523 -539.75644 -539.7(6643
!Mu]'tinomialregression ' Log) likelihood = -539.75643
Number of obs LR chi2(5)
= =
615 32.20
Prob > chi2
=
0.0000
Pseudo K2
=
0.0290
t i insure ,
Coe_.
Std. Err.
z
P> ]Z)
[95Y.Conf. Interval]
I
Prepaid age male nonwhite site2 sil;e3 _cons
-.0107025 .4963616 .942137 .2530912 -. 5521774 .1792752
.0060039 .1939683 .2252094 .2029465 .2187237 .3171372
(dropped) (dropped) (dropped) (dropped) (dropped) -1.8735i
.1601099
-1.78 2.56 4.18 t. 25 -2.52 O.57
O.075 O. 010 O.000 0.212 O.012 O.572
-.0224699 .1161908 .5007347 -. 1446767 -, 9808678 -.4423023
.0010649 .8765324 i.383539 .6508591 -. 1234869 .8008527
Uni_sure age male inonwhite site2 site3 _cons
-11.70
0.000
-2.18732
-I.5597
(Outcome insure==Indemnity is the comparison group)
We ,ca_ new perform the likelihood-ratio test: , l_test Mlo_it : likelihood-ratio test !
chi2(5) = Proh > chi2 =
10,79 0.0557
The lil_eli_ood-ratio ehi-squared is 10.79 with 5 degrees of freedom just slightly greater than the ma_ic _ J4 .05 level. Thus. we should not call _is difference significant. l
o TechnicalNote 372In certain mlogit circumstances, -- Maximum-likelihood a multinomialmultinomial logit model(polytomous) should be estimated logistic regression with conditional logit; see [R] ciogit. With substantial data manipulation, clogit is capable of handling the same class of models with some interesting additions. For example, if we had available the price and deductible of the most competitive insurance plan of each type, this information could not be used by mlogit but could be incorporated by cZogit. 71
Saved Results mlogit saves in e(): Scalars e (N) e (k_cat) e(df..m) e(r2_p) e(11)
number of observations number of categories model degrees of freedom pseudo R-squared log likelihood
e (1I_0) e (N_clust) e(chi2) e(ibaseeat) e(basecat)
log likelihood, constant-only model number of clusters X2 base category number the value of depvar that is to be treated as the base category
mlogit
e (clustvar)
name of cluster
Macros e (emd)
variable
(depvar) name of dependent variable e(wtype) weight type e(wexp) weight expression Matrices
covariance estimation method e(chi2type)Waldor LR: _ype of model X2 test e(predict) program used to implement predict
e (b) e (cat) Functions
e (V)
e
e(sample)
coefficient vector category values
e (vcetype)
variance-covariance matrix of the estimators
marks estimation sample
Methods and Formulas The model for multinomial
logit is
Pr(Y/=
k) = rr_=1
"j-_--O
This model is described in Greene (2000, chapter t9). Newton-Raphson
maximum likelihood is used; see [R] maximize.
In the case of constrained equations, the set of constraints is orthogonalized and a subset of maximizable parameters is selected. For example, a parameter that is constrained to zero is not a maximizable parameter. If two parameters are constrained to be equal to each other, only one is a maximizable parameter.
mlogit -- Maximum-likelihoodmultinomial(polytomous)logistic regression •
'
373
L_t r!be the vector of maximizable parameters, Note that r is physically a subset of the solution p_a_ete_rs b. A matrix T and a vector m are defined b=Tr_m t
wiih [he lconsequence that df=df_ T, db dr d2f = T d2f _, db 2 d-_ -t T 'consists of a block form in which one part is a permutation of the identity matrix and the other pa_ describes how to calculate the constrained parameters from the maximizabte parameters,
ReferenQes Aldrich.J.! H. and F. D. Nelson. t984. LinearProbabilit):Logit, and Probit Models. Newburv Park, CA: Sage _Puliticaiions. Grdene.W_H. 2000. EconometricAnalysis,4th ed. Upper SaddleRiver,NJ: Prentice-HalL HamiltOn,_..C. 1993 sqv8: Interpretingmultinomia]logisticregression.Stata TechnicalBulletin[3: 24-28. Repnnted in _tat_TechnicalBulletin Reprints,vol. 3, pp. 176-181. Hefidri_kx, IJ 201)0.sbe37:Specialrestrictionsin multii_omiallogisticregression,Stata TechnicalBulletin56: 18-26_ Hosrner.D_W., Jr,, and S. Lemeshow 1989.AppliedLogisticRegression.New York:John Wiley & Sons. lSecot_d !editionIforthcomingin 2001.) Judge,G. _,, W. E. Griffiths,R. C. Hill.H. Lfitkepohl.and T,-C.Lee. 1985, The Theoo"and Practiceof Econometrics. !2db_d,New York:John Wiley & Sons. Long, 1. SI 1997.Regression Models for Categoricaland Limited Dependent Vari,_bles.ThousandOaks, CA: Sage PuNica_ions, Tarlbx:, _A. _,..1. E. Ware,Jr., S. Greenfield,E. C, Nelson,E. Pen-in.and M. Zubkoff. 1989. The medicaloutcome_ study.3aurnalof the American MedicalAssociation,262: 925-930. We_ls.K. E R. D. Hays, M A. Burnam,W, H. Rogers,IS.Greenfield,and J. E. Ware,Jr. 1989, Detectionof depressive disdrderfor patientsreceivingprepaidor fee-for-servicecare, Journalof the AmericanMedical Association262 3298-3 _02.
i
AlsoiSie Corn_en_entary:
[R] adjust, [R] constraint. [R] lincom. [R] lrtest. [R] mfx, [R] predict, [R] test, [R] testnL [R] xi
Re_t0d:
[R] clogit, [R] logistic, [R] nlogit, [R] ologit, [R] svy estimators
Baekgtouhd:
i
[U] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and p_st-estimation commands. [t;] 23.11 Obtaining robfist variance estimates. [u] 23.12 Obtaining scores,
I
[R] maximize
°
more
-- The --more--
i
message
1
]
[
Syntax
set more{ onloaf } set
_p_agesize #
Description set more on, which is the default, tells Stata to wait until a key is pressed before continuing when a --more-message is displayed. set more off
tells Stata not to pause or display the --more--
set pagesize
# sets the number of lines between --more--
message. messages.
Remarks When you see --more
at the bottom of the screen
Press ...
and Stata...
letter t or Enter
displays the next line
letter q
acts as if you pressed Break
space bar or any other key
displays the next screen
In addition, you can press the More button, or click on --more--,
to display the next screen.
--more is Stata's way of telling you it has something more to show you, but that showing you that something more will cause the information on the screen to scroll off. If you type set more oft.--more--conditions at full speed. If you type set more on, --more Programmers
conditions
will never arise and Stata's output will scroll by will be restored at the appropriate
should see [p] more for information
Also See Complementary:
[R]
query, [P] more
Background:
[U] 10 ---more---
conditions
374
on the more programming
places.
command.
.
Title
,
-- Change missing to coded missing value and vice versa
Synta _ercode
varlist [if exp] [in range] , my(#) [override
]
m_d, code varlist [-if exp] [in range] , my(#)
Destcttipl:ion m_veJ code changes all occurrences of mis_ing to # in the specified varlist. m_d_code changes all occurrences of # to missing in the specified varlist.
options my(#) }pecifies the numeric value to which or from which missing is to be changed and is not opti@al. oveJ'ri[te specifies that the protection provided by mvencode is to be overridden. Without this option, m_rer_code refuses to make the requested change if # is already; used in the data.
, !
Remalrk One _occasionalty reads data where missing (e.g., failed to answer a survey question, or the data were ndt collected, or whatever) is coded wi|h a special numeric value. Popular codings are 9. 99. 29, -9_), and the like. If missing were encoded as -99, ' • mvdecode _all, my (-99)
would translate the special code to the Stata missing value ' ' Use this command cautiously since. even if L99 were not a special code, all 99's in the data would be changed to missing. Conxlersely, one occasionally needs to export data to software that does not understand that .' friends r_issmg value, so one codes missing With a special numeric value. To change all missings to -99i _nvencode
_all, my(-99)
mvenco_leis smart: it will automatically recast variables upward if necessary, so even if a variable is strred as a byte. its missing values can be recoded to. say, 999. In addition, mvencode refuses to make th_ change if # _-99 in this case) is already used in the data, so you can be certain that your coding ig unique. You can override this feature by including the override option. _. Example O_ur_utomobile dataset (described in [U] 9 State's on-line tutorials and sample datasets) contains 74 observations and 12 variables• Let us first attempt to translate the missing values in the data to 1: 375
. mvencode
_
..............
,
_all,
my(1)
make : string ,..,,-,,_ variable ,,,,oo.,y ignored .u uuueu rmssmg rep78: already I in 2 observations
foreign: already no action taken
i in
22
value aria vice versa
observations
r(9) ; Our attempt failed, mvencode first informed us that make is a string variable--this is not a problem but is reported merely for our information. String variables are ignored by mvencode. It next informed us that rep78 already was coded 1 in 2 observations and that foreign was already coded 1 in 22 observations. Thus, 1 would be a poor choice for encoding missing values because, after encoding, you could not tell a real 1 from a coded missing value t. We could force mvencode to encode the data with 1 anyway by typing mvencode _all, my(l) override and that would be appropriate if the ls in our data already represented missing data. They do not, however, and we will code missing as 999: • mvencode
_all,
make:
mv(999)
string
rep78:5
variable
missing
ignored
values
This worked, and we are informed that the only changes necessary were to 5 observations
of rep78.
<1
> Example Let us now pretend that we just read in the automobile data from some raw dataset where all the missing values were coded 999. We can convert the 999's to real missings by typing • mvdecode
_all,
mv(999)
make : string variable ignored rep78:5 missing values
We are informed that make is a string variable and so was ignored and that rep78 observations with 999. Those observations have now been changed to contain missing.
contained
5 q
Methods and Formulas mvencode and mvdecode are implemented
Also See Related:
[R] generate, [R] recode
as ado-files.
:
!
Title•
........
i
-- Multivariateregression
Syntax :mere!depvarlist = vartist [weigh,] [if expl [in range] [, noconstantcorrnoheader
by..
_ : _nay be used with m-rreg; see JR] by.
aw_i_tsland_: fweights are allowed;see [Ul 14.1.awei_t. mvteff sh_es the features of all estimation commands; see [U] 23 F_imation and l_t-estimati_
commands.
SyntaxIfo_predict predict I
[type] newvarname [if exp] Iin range][,
{ _b !stdp Iresiduals I_difference
i
equation(eqno
[,eqnoj)
Is_tdap }]
i
These gtati{sticsare available both in and out of sample: type predict :theiesfi_nation sample. :
...
if e(s_ple)
...
if wanted onl? fl)r
i I
Desaripti_n I
T avte_ estimates multivariate regression models.
Optienis :
1
no_:o_st_mt omits the constant term from the estimation. corr _lis ys the correlation matrix of the residuals between the equations. _ !la noheddei" suppresses display of the table reporting F statistics. R-squared, and root mean square errbr a _ove the coefficient table notable suppresses display of the coefficient table. leve2 (# specifiesthe confidencelevel, in percent, for confidenceintervals. The default is level (95) or as _t by set level: see [U] 23.5 Specifying the width of confidence intervals 1
Optionsf_r predict t
oqu o (qo[.qnot
,ow ich eq tiooareefem g.
equat _on() is filledin with one eqno for options zb, stdp, and residuals, equation(#1) would mean the calculation is to be made for the fi_stequation, equation(#2) would mean the second, and so on. Alternatively, you could refer to the equations by their names, equation(income) Wotild"efer to the equation named income and equation(hours) to the equation named hours. '
37"/
not -specify equation(), results are as if you specified equation(#1). oreff you do mvreg Multivariate regression difference and stddp refer two equations; e.g., equat ion be specified, equation() is prediction of equation(#1)
to between-equation concepts, To use these options, you must specify (#1, #2) or equation (income, hours). When two equations must not optional. With equation(#1,#2), difference computes the minus the prediction of equation(#2).
xb, the default, calculates the fitted values--the
prediction
of xj b for the specified equation.
stdp calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is also referred to as the standard error of the fitted value. residuals difference
calculates
the residuals.
calculates the difference between the linear predictions
of two equations in the system.
stddp is allowed only after you have previously estimated a multiple-equation model. The standard error of the difference in linear predictions (xljb - x2jb) between equations 1 and 2 is calculated. For more information on using predict
after multiple-equation
estimation commands,
see [R] predict.
Remarks Multivariate regression differs from multiple regression in that several dependent variables are jointly regressed on the same independent variables. Multivariate regression is related to Zellner's seemingly unrelated regression (see [R] sureg) but, since the same set of independent variables is used for each dependent variable, the syntax is simpler and the calculations faster. The individual coefficients and standard errors produced by mvreg are identical to those that would be produced by regress estimating each equation separately. The difference is that mvreg, being a joint estimator, also estimates the between-equation covariances, so you can test coefficients across equations and, in fact. the tesl: syntax makes such tests more convenient.
> Example Using the automobile data. we estimate a multivariate regression for "space" variables (headroom, trunk, and turn) in terms of a set of other variables including three "pertbrmance variables" (displacement, gear_ratio, and mpg): . mvreg
headroom
trunk
turn
Parms
= price RMSE
mpg
displ
gear_ratio
Equation
Obs
"R-sq"
headroom
74
7
.7390205
O. 2996
4. 777213
trunk
74
7
3. 0523 L4
0. 5328
12. 7265
turn
74
7
2. 132377
0. 7844
length
weight
F
40.
62042
P 0. 0004 0. 0000 0. 0000
mvreg-- Multivariateregression
¢-...
i
Coef.
Std.
Err.
t
P>(t[
[95_ Conf.
379
Interval]
he. oom price I mpg displacement g#ar_ratio ! length weight _cons
-.0000528 -.0093774 .0031025 .2108071 .015886 -.0000868 -.4525117
.000038 .0260463 .0024999 .3539588 .012944 ,0004724 2.170073
-1.39 -0.36 i,24 O.60 I.23 -0.18 -0.21
0.168 O.720 0.219 O.553 O.224 0.855 O.835
-.0001286 -.061366 -.0018873 -.4956976 -.0099504 -,0010296 -4.783995
.0000229 .0426112 .0080922 .9173118 .0417223 .0008561 3.878972
, price ' mpg _is_lacement 1 g_ar_ratio length c weight _cons
.0000445 -. 0220919 .0032118 -,2271321 .170811 - ,0015944 -13.28253
,0001567 .1075767 .0103251 I.461926 .0534615 ,001951 8. 962868
0,28 -0.21 0.31 -0.16 3.20 -0.82 - 1.48
0,778 O.838 0.757 0.877 O,002 0.417 0. 143
-.0002684 -. 2368159 -,0!73971 -3.145149 .0641014 - ,0054885 -31. 17249
.0003573 .1-926322 .0238207 2.690885 .2775206 ,0022997 4.607429
price mpg displacement
-.0002647 -.0492948 .0036977
,0001095 .0751542 .0072132
-2.42 -0.66 O. 51
O.018 O.514 O.610
-.0004833 -.1993031 -. 0106999
-.0000462 .1007136 .0180953
gear_ratio --length
-. 1048432 .072128
I.021316 .0373487
-0.10 I. 93
0.919 O. 058
-2. 143399 - J)024204
1.933712 .1466764
_cons
20. 19157
3.22
O.002
7.693467
32. 68967
i
_un!
I
i !
i
6.261549
!
We should have specified the corr option so that we would also see the correlations between the residu_ils _ of the equations. We can correct our omission because mvreg--like all estimation com_ahds!--typed without arguments redisplaysiresutts The noheader and notable (read no-table) options Sul_press redisplaying the output we have already seen:
g
• mv_'eg, notable noheader corr
i
COrrl_lationmatrix of residuals: headroom trunk turn h@ad]'oom i.0000 t]'unk O.4986 I.0000 urn -0.1090 -0.0628 1.0000 Breu _ch-Pagantest of independence: chi2(3) =
19.566, Pr = 0.0002
The Breusc h-Pagan test is significant, so the residuals of these three space variables are not independent of eachiott er. t I
The thre_eperformance variables among our ihdependent variables are mpg, displacement, and gear_ratio. We can jointly test the significance of these three variables, in all the equations, by typing
i
(Continued on next page)
I!iI'!
• test
mpg
(1) ,
displacement
[headroom]mpg
(2)
[truak]mpg
(3)
[turn]mpg
gear_ratio
= 0.0
= 0.0 -- 0.0
(4)
[headroom]displacement
(5)
[trunk]displacement
(6)
[ttu'-n]displacement
(7)
[headroom]gear_ratio
(8) (9)
[trul_k]gear_ratio [t_rn]gear_ratio F(
9,
67)
Prob
= 0.0 = 0.0 = 0.0 = 0.0
= O. 0 = 0.0
=
0.33
> F =
0.9622
These three variables are not, as a group, significant. We might have suspected this from their individual significance in the individual regressions, but this multivariate test provides an overall assessment with a single p-value. We can also perform a test for the joint significance of all three eqtmtions: - test
[headroom]
(output
omitted
• test
)
[trunk],
(output
omitted
• test
accum
)
[turn],
accum
(I)
[headroom]price
(2)
[headroom]mpg
= 0.0
(3) (4)
[headroom]displacement [headroom]gear_ratio
= 0.0 = 0.0 = 0.0
(5)
[headroom]length
(6) (7)
[headroom]weight = 0.0 [tr%L_k]price = 0.0
= 0.0
(8)
[trunk]mpg
(9)
[trumk]displacement
= 0.0
(i0)
[trunk]gear_ratio
(II)
[trunk]length
= 0.0
(::[2)
[trunk]weight
= 0.0
(13)
[turn]price
= 0.0 = 0.0
= 0.0
(14) [turn]mpg= 0.0 (15)
[turn]displacement
(16)
[turn]gear_ratio
= 0.0 = 0.0
(17)
[turn]length
= 0.0
(18)
[turn]weight
= 0,0
F(
18, Prob
67)
=
> F =
19.34 0.0000
The set of variables as a whole is strongly significant. individual equations.
We might have suspected this, too, from the q
C3Technical
Note
The mvreg command provides a good way to deal with multiple comparisons. If we wanted to assess the effect of length, we might be dissuaded from interpreting any of its coefficients except that in the trunk equation. [trunk]length--the coefficient on length in the trunk equation has a p-value of .002, but in the remaining two equations it has p-values of only .224 and .058. A conservative statistician might argue that there are 18 tests of significance in mvreg's output (not counting those for the intercept), so p-values above .05/18 = .0028 should be declared insignificant
i ,
'
,
,
mweg -- Multivariate _k:m
at!the 5! _level. A more aggressive but, in our opinion, reasonable approach would be to first note
1
Then_ w: :hree would work through the individual using test, inpossibly = .0083 that _he equations are jointly significant, variables so we are justified making using some .05/6 interpretation. (6_becau,,e there are 6 independent variables) for the 5% significance level. For instance, examining lemg_h:
!
._stlength (t)
[headroom]length = O.0
(2) (3)
[t_,-_]le_h = o.o [t_m]lem_h = 0.0 F(
3, Prob
67) = > F =
4.94 0.0037 i
The r_por_ed significance level of .0037 is less than ,0083, so we will declare this variable significant. [tru_]iengeh is certainly significant with its p-value of .002, but what about in the remaining two equationsiwith p-values .224 and .058? Performing a joint test: . l;_s_,[headLrooI_]length [t_]length ((!))
I
[tttrn] lenl_ch= O.0 [headroom]length = O.0
F( 2,Prob 67) = > F =
2.91 0.0613
At t_isipolnt; reasonable statisticians could disalgee. The .06 significance value suggests no interpretation t_ut}hese were the two least-significant values out of three, so one would expect the p-value to be a litkte high. Perhaps an equivocal statement is warranted: there s_ms to be an effect, but chance cannot Ibe _xcluded. Q
SavedReSults mvreg '
_aves
in e () : Scalars e(N)
number of obsep;atior_
e (k)
number of parameters ifincluding constant)
e(k_eq)
number of equations
e(df_I)
residual degrees of freedom
e(chi2)
Breusch-Pagan
e (df_chi2)
degrees of freedom for Breusch-Pagan
X2 (corr
only) X2 (curt
Macros e (cmd)
mvreg
e(eqn_es)
names of equations
e(r2) e (rinse)
R-squared for each eqt!ation RMSE for each equatidn
e(F)
F statistic for each eqdation
e(p._F)
significance of F for each equation
e(predic_)
program used to implement predict
Matrices
I
e (b)
coefficient vector
e(V)
variance-covariance
e (Siuna)
_
malrix of the estimators
matrix
i Functions e(sample)
t.
marks estimation samptd
only)
•
Methods and Formulas _
......implemented :, ,.u,.,,,m i -gresslon mvregis as ,at_ an ado-file.
p independent variables (including the constant), the parameter estimates are Given given qbyequations the p × qand matrix B-
(XtWX)-lxtwY
where Y is an n × q matrix of dependent variables and X is a n x p matrix of independent variables. W is a weighting matrix equal to I if no weights are specified. If weights are specified, let v: 1 x n be the specified weights. If fweight frequency weights are specified. W = diag(v). If aweight analytic weights are specified, W = diag{v/(l'v)(l'l)}, which is to say, the weights are normalized to sum to the number of observations. The residual covariance matrix is R={YIWy
B tX' (
.WX)B}/(n-;)
The estimated covariance matrix of the estimates is R ® (X _WX)-I These results are identical to those produced by sureg when the same list of independent variables is specified repeatedly; see [R] sureg. The Breusch and Pagan (1980) X2 statistic--a
Lagrange multiplier statistic--is
given by
q i-I =nZ
.2 z=l .,4=1
where vii is the estimated correlation between the residuals of the equations and n is the number of observations. It is distributed as X 2 with g(q - 1)/2 degrees of freedom
References Breusch. T. andStudies A. Pagan. t980. The LM test and its applications to model specification in econometrics. Review of Economic 47: 239-254.
Also See Complementary:
[R] adjust, [R] lincom, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce
Related:
[R] reg3, [R] regress, [R] regression diagnostics. [R] sureg
Background:
[U] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and post-estimation commands
{
Title -- Negative binomial regression
syntax nbrqg depvar [indepvars] [weight] [if exp] [in range] [,
{
d!spersion({mean Iconstant} ) level(#)irr exposure(varname)oflset(varneme) r_bust cluster(varname)score(ne_'vars) noconstantcoBsr, raints(numlist)
!
[
n__test
nolog maximize_options ]
fffibr_ E depvar[indepvar,] [_eight] [ifexp][inrange][,inalpha(varlist) level(#) irrr ; e_posure (varname) offset (vantame) robust cluster (varname) score (newvars) , ? nc constmat constraint s (numlist) n_log maximtze._options j by ../ : may be used with nbrog; see IR] by, f_ei_hts iweights, and p-aeights are allowed; see [U] 14,1.6 weight, T_ese_,con mands share the features of all estimation commands; n_re_ m_
see [U] 23 Estimation
and post-estimation
commands,
be used with sw to perform stepwise estimation; see [R] sw.
Syntax!fo predict p!cec .ct [_pe] newvarname where st_nsnc is
n ir xb stdp
[if exp] Iin range] [, statistic
predicted number of events (the default) incidence rate (equNalent to predict ..., linear prediction standard error of the prediction
In _dd!itiqn. relevant only after gnbreg _alpha lnalpha stdplna
i
nooffset
i
n nooffset)
are
predicted values of alpha predicted values of In(alpha) standard error of predicted In(alpha)
The,_e !tati tics are available both in and out of sample; type predict the _esti_ation sample.
...
if
e(sample)
...
_f wanted only Io_
Desclriptign nbreglesttmates a negative binomial maximum-likelihood regression of depvar on varlist, where dep_,ar is / _ nonne_ative count variable. In this model, the count variable is believed to be gene|atcd cept that the greater than that of a true Poisson. This cxuu by _ Pbislon-like process, ex variation is variation ls referred to as ox.erdispersion. See [R] poisson before reading this entry. 1 383 l
i ;:!
_breg is a generalized negative binomial regression; the shape p_'ameter alpha may also be parameterized. Persons who have panel data should see [R] xtnbreg
Options dispersion ( {mean I constant } ) specifies the pararneterization of the model, dispersion (mean), the default, yields a model with dispersion equal to 1+ cr exp(zib + offset/); that is, the dispersion is a function of the expected mean: exp(xib + offseti), dispersion(constant) has dispersion equal to I + 6; that is. it is a constant for all observations. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. irr
(95)
reports estimated coefficients transformed to incidence rate ratios, i.e., e b rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored, irr may be specified at estimation or when replaying previously estimated results.
exposure(varname) and offset(varname) are different ways of specifying the same thing, exposure () specifies a variable that reflects the amount of exposure over which the depvar events were observed for each observation; ln(varname) with coefficient constrained to be 1 is entered into the log-link function, o:ffset() specifies a variable that is to be entered directly into the log-link function with coefficient constrained to be I, so exposure is assumed to be e varnarne. robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster () allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,
robust
is implied; see [U] 23.13 Weighted
estimation.
cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups, varname specifies to which group each observation belongs; e.g., cluster (person±d) in data with repeated observations on individuals, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U] 23.11 Obtaining robust variance estimates. cluster() by itself.
implies robust;
specifying
robust
cluster()
is equivalent
to typing cluster()
score (newvars) creates newvar containing % = OlnLj/0(x/b) for each observation j in the sample. The score vector is _7] OlnLj/Ob --: _ujxj; i.e., the product of newvar with each covariate summed over observations. If two newvars are specified, then the score from the ancillary parameter equation is also saved. See [U] 23.12 Obtaining scores. noconstant
suppresses
the constant term (intercept) in the regression.
constraints (numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command: see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. nolrtest suppresses fitting the Poisson model. Without this option, the Poisson model is fit and the likelihood is used in a likelihood-ratio test of the alpha parameter, This option is only valid for nbreg; gnbreg has no likelihood-ratio comparison test (see Technical Note in the section on gnbreg within this entry).
nbreg-
385
n(_logst_ppresses i the iteration log. maxb_iz__options control the maximization process; see [R] maximize. You should never have to specif_, them, although we often recommend specifying trace. lr_alpha (varIist) is a/lowed only with gnbreg. If this option is not specified, gnbreg and nbreg :will produce the same results because the _hape parameter will be parameterized as a constant. lntalt.ha() allows specifying a linear equation for In(alpha). Specifying lnalpha(male old) means in(alpha) = ao + almale .a a2old, where a0, al, and a2 are parameters to be fitted along wilh t ae other model coefficients.
Options n. I !
t i
Negative binomialregression
predict
the _efault, calculates the predicted number of events, which is exp(xjb) if neither of_s_t(varname) nor exposure(varname) was specified when the model was estimated: exp(xib + offset) if offset(varname) was specified: or exp(x_b) • exposure if exposuite (varname) was specified.
ir caicul ttes the incidence rate exp(xjb), which is the predicted number of events when exposure is I. Thi ; is equivalent to n when neither offset (varname) nor exposure (varname) was specified when he model was estimated. xb.calcul ires the linear prediction. I
strip caklulates the standard error of the linear prediction. •
a!pha, l_alpha, and stdplna are relevant after gnbreg estimation only; they produce the predicted values !of alph_ or In(alpha) and the standard error of the predicted In(alpha), respectively. nooffse_ is relevant only if you specified offset(varname) or exposure(vamame) when you esffma_ed the model. It modifies the calculations made by predict so that they ignore the offset or ex_ost_re variable: the linear prediction is treated as xjb rather than x./b + offseb. and specifying predilzt ... is equivalent to specifying predict ... , nooffset Jr.
Remarks See Lo_ng(1997. chapter 8) for an introduction to the negative binomial regression model and lot a discassibn of other regression models for count data. Negati_,e binomial regression is used to estimate models of the number of occurrences (counts) of an event when the event has extra-Poisson Variation; that is. it has overdispersion. The Poisson re_essionl model is Yi "- PoisSon(#i ) where tti = exp_xi,O + offseti) for obser_.led counts Yi with covanates xi for the ith observation. One derivation of the negative binomial i_ that individual units follow a Poissdn regression model, but there is an omitted variable u_ such th_atc"_ follows a gamma distribution With mean 1 and variance a:
5'i "_ Poissorl(/z_) wheTe
_,,. _
/_' = exp(xi/_ and ssu
+ offset/+
ui)
nbreg -- Negative binomial regression
c ~ gamma(1/ , (Note that the scale (i.e., the second) parameter for the gamma(a, A) distribution is sometimes parameterized as 1,/A: see the Methods and FormuIas section for the explicit definition of the distribution. ) We refer to a as the overdispersion parameter. The larger a is, the greater the overdispersion. The Poisson model corresponds to a = 0. nbreg parameterizes c_ as In a. gnbreg allows In G to be modeled as In _xi = ziT, a linear combination of covariates z,. nbreg will estimate two different parameterizations of the negative binomial model. The default, described above and also given by the option dispersion(mean), has dispersion for the ith observation equal to 1 + a. exp(xd3 + offset/). The alternative parameterization, given by the option dispersion(constant), has dispersion equal to 1 + 6, i.e. it is constant for all observations. The Poisson model corresponds to 6 = 0.
nbreg It is not uncommon to pose a Poisson regression following data appeared in Roddguez (i993):
model and observe a lack of model fit. The
list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. II. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
cohort 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3
age_mos 0.5 2.0 4.5 9.0 18.0 42.0 90.0 0.5 2.0 4.5 9.0 18.0 42.0 90.0 0.5 2.0 4.5 9.0 18.0 42.0 90.0
deaths 168 48 63 89 102 81 40 197 48 62 81 97 103 39 195 55 58 85 87 70 10
exposure 278.4 538.8 794.4 1,550.8 3,006.0 8,743.5 14,270.0 403.2 786.0 1,165.3 2,294.8 4,500.5 13,201.5 19,525.0 495.3 956.7 1,381.4 2,604.5 4,618.5 9,814.5 5,802.5
gen logexp = in(exposure) quietly tab cohort, gen(coh) poisson deaths cob2 cob3, offset(logexp) Iteration Iteration Iteration Iteration
O: I: 2: 3:
log log log log
likelihood likelihood likelihood likelihood
Poisson regression
Log likelihood = -2159.5159
= = = =
-2160.0544 -2159.5182 -2159.5159 -2159.5159 Number of obs LI%chi2(2) Prob > chi2 Pseudo R2
= = =
21 49.16 0.0000 0.0113
l• i ! !
T
_
nbreg-
I
deaths
Coef.
Std. Err.
z
Negative binomial regression
P>Izl
[95_, Conf.
387
Interval]
, J
I
coh2
-. 3020405
.0573319
coh3
.0742143
.0589726
-3.899488 (offset)
.0411345
_cons logexp
-5.27 1,26 -94.80
0.000
-. 4144089
-. 1896721
O. 208
-. 0413698
.1897983
O. 000
-3.
98011
-3.818866
. _oisgof Goodness-of-fit Prob
chi2
> chi2(18)
=
4190.689
=
0.0000
The extreme significance of the goodness-of-fit X2 indicates the Poisson regression model is inap_opTiate suggesting to us that we should try a negative binomial model: Lbreg deaths Negative _
I
coh2
binomial
Lo_ likelihood
=
coh3,
offset(loge_p)
nolog
regression
Number of obs LR chi2(2) Prob > chi2
= = =
21 0.40 0.8171
-131.3799
Pseudo
=
0.0015
deaths
Coef.
Std. Err.
z
R2
P>[zl
[95_, Conf.
coh2
-. 2676187
.7237203
-0.37
O. 712
-I.
coh3
-. 4575957
.7236651
-0.63
O. 527
- I.875753
.511856
-4.08
0.000
_cons
-2.086731
logexp
(offset)
/Inalpha
.5939963
686084
-3.08995
.2583615
.0876171
Interval] I. 150847 .9609618 -I.083511
1. 100376
l
alpha Li_elihood
I. 811212 ratio
test
of
.4679475 alpha=O:
I. 09157 chibar2(01)
= 4056.27
Prob>=chibar2
3.005295 = 0.000
Our original Poisson model is a special case of the negative binomial--it corresponds to a = O. nbreg, however, ' estimates a indirectly, estimating instead In a. In our model. In c_ = 0.594. meanin_ that a 't.81 (nbreg undoes the transformati6n for us at the bottom of the outputk In!order to test o = 0 (equivalent to lna, = -ao),
nbreg
performs a likelihood-ratio test. The
stag_erin_z._ ...r ,_2 value of 4.056 asserts that the probability that we would observe these data conditional on cti= t_ is ,_irtually zero. i.e., conditional on!the process being Poisson. The data are not Poisson. It is _ot ]accidental thal this _2 value is quite close to the goodness-of-fit statistic from the Poisson regre_sio! itself. 1 "
t
Q TechnicaI!Note Tl'/e u_ual Gaussian test of ct = 0 is omitted since this test occurs on the boundary, invalidating the u_ual! theory, associated with such tests However. the likelihood-ratio test of a. = 0 has been modifiedlto be valid on the boundao,. In partieular, the null distribution of the likelihood-ratio lest statistic _, not the usual ;_2 but rather a 50:50 mixture of a )_o _ (point mass a, zerot and a t'_. denoted as _02i. 5[ee Gutierrez et al. (2001) for more details.
' : r_
[] Technical Note v,,,,
...,._
_ _egatwe olnomla! regression
The negative binomial model deals with cases where there is more variation than would be expected were the process Poisson. The negative binomial model is not helpful if there is less than Poisson
i
Poisson models arise because of independently generated events. Overdispersion comes about if some of the parameters (causes)of of Poisson areitsunknown. obtain underdispersiom the variation--if the variance the the count variableprocesses is less than mean. ButTounderdispersion is uncommon. sequence of events would have to somehow be regulated; that is, events would not be independent, but controlled based on past occurrences. []
gnbreg gnbreg is a generalization of nbreg. Whereas in nbreg a single tn _ is estimated, gnbreg allows In a to vary observation by observation as a linear combination of another set of covariates: ln c_=z_. We will assume the number of deaths is a function of age whereas the in _ parameter of cohos. To estimate the model, we type gnbreg
deaths
Fitting
age_mos,
constant-only
Inalpha(coh2
coh3)
O:
log
likelihood
=
Iteration
I:
log
likelihood
= -148.64462
-187.067
= -132.49595
Iteration
2:
log
likelihood
Iteration
3:
log
likelihood
= -131.59338
Iteration
4:
log
likelihood
= -131.57949
log
likelihood
= -131.57948
Iteration
5: full
model:
Iteration
O:
log
likelihood
= -124.34327
Iteration
I:
log
likelihood
= -117.72418
Iteration
2:
log
likelihood
= -117.56349
Iterazion
3:
log
likelihood
= -117.56164
Iteration
4:
log
likelihood
= -117.56164
Generalized
offset(logexp)
model:
Iteration
Fitting
negative
binomial
regression
Number LR
likelihood
deaths
= -117.56164
Cool.
Pseudo
Std.
Err.
z
P>IzI
of obs
=
chi2(1)
Prob Log
is a function
=
21 28.04
> chi2
=
0.0000
R2
=
0.1065
[95X
Conf.
Interval]
deaths age_mos _cons logexp
-,0516657 -1.867225 (offset)
.0051747 .2227944
-9,98 -8.38
0,000 0.000
-.061808 -2.303894
-.0415233 -1,430556
cob2
.0939546
.7187747
0.13
coh3
.0815279
,7365477
0.II
0.896
-1.314818
1.502727
0.912
-1.362079
1.525135
0.356
-1.486614
.5346978
inalpha
_cons
-.4759581
.5156502
-0.92
We find that age is a significant determinant of the number of deaths. The standard errors for the variables in the In c_ equation suggest that the overdispersion parameter does not vary across cohorts. We can test this by typing
i
-' _
i
i
'
nbreg -- Negative binomialregresldon
. !test coh2 cob3 i
_ I)
[inalpha] coh2 = O.0
d 2)
[inalpha]coh3
Prob
:
= 0.0
2)
chi2(
0.02
=
> chi2 =
0.9904
There isl no evidence of variation by cohort in these data.
i
[3Techr icl Note
!
NOte the intentional absence of a likelihood-ratio test for o_ = 0 in gnbreg. The test is affected by the .,ame boundary condition the affects the comparison test in nbreg, however, when a is paramet(:rized by more than a constant term the null distribution becomes intractable. For this reason we recot nmend using nbreg to test for overdispersion and if overdispersion exists, only then model
i
the over tispersion using gnbreg.
! 1
Predicted values '
i
After!nbreg
and gnbreg,
predict
returns the predicted number of events:
_breg deaths coh2 coh3, nolog Negative binomial regression
Lo_ likelihood
:
= -108.48841
deaths
Prob
=
O. 9307
=
0.0007
[95Y. Conf.
Interval]
> chi2
Err.
z
.2978419
_cons
4.435906
.2107213
21.05
O. 000
4.0229
-. 0538792
•2981621
-0.18
O. 857
-. 6382662
.5305077
- 1.207379 .29898
.3108622 .0929416
-I. 816657 .1625683
-.5980999 .5498555
• _redict
ratio
test
of alpha=O:
O. 20
P> Izl
chibar2(01)
O. 843
R2
.0591305
"LiKelihood
=
-. 5246289
434.62
Prob>=chibar2
count
(o_tion n assumed; _lmmarize Variable
i
= =
coh2
/ Inalpha alpha
i
Std.
2I O. 14
Number of obs LE chi2(2)
Pseudo
Coef.
coh3
deaths
predicted
number
of events)
count Obs
Mean
Std. Dev.
Min
Max
i
i
deaths
21
84.66667
i
count
2i
84.66667
48.84192
10
4.00773
80
(Continuett on next page)
|
389
197 89. 57143
.64289 4.848912
= 0.000
Saved Results nbreg and gnbreg save in e O" Scalars e (N) e (k) e (k_eq)
number of observations number of variables number of equations
e (N-clust) e(re) e (chi2)
number of clusters return code X2
e(k_dv) e (df_.m)
number of dependent variables model degrees of freedom
e(chi2_c) e (p)
_2 for comparison test significance
e (r2_p)
pseudo R-squared log likelihood
e (ie) e (rank)
number of iterations rank of e (V)
log likelihood, constant-only model log likelihood, comparison model
e(ram.k0) e(alpha)
rank of e(V) for constant-only model the value of alpha
e (cmd)
nbreg or gnabreg
e (opt)
_ype of optimization
e(depvar) e(title) e(wtype)
name of dependent variable title in type estimation outpul weight
e(chi2type) e(chi2_ct)
e (11)
e(ll_O) Macros e(ll_c)
e(wexp) weigh!expression e(clustvar) name ofcluster variable
e(offset#) e(dispers)
Wald or LR; type of model X _ test Wald or LR; type of model X 2 test corresponding to e(chi2_c) offset forequation # mean or constant
e(user) e(vcetype)
e(predict)
program used to implement predict
name covanance of likelihood-evaluator estimation method program
Matrices e (b)
coefficient vector
e(ilog) Functions e(samp2e)
iteration log (up to 20 iterations)
e(Y)
variance-covariance the estimators
matrix of
marks estimation sample
Methodsand Formulas nbreg
and gnbreg
See [RJ poisson
are implemented
and Feller
(1968,
as ado-files. 156-164)
for an introduction
to the Poisson
distribution.
A negative binomial distribution can be regarded as a gamma mixture of Poisson random variables. likelihood is The number of times something occurs, Yi, is distributed as Poisson(ui# i). That is, its conditional f(Yi
where _ui = e×p(xij3+
offseh)
Jui) --
(uilzi)U'e-_"_'
r(y, + 1)
and u_ is an unobserved g(u)
parameter
with a gamma(I/a,
i/a)
density:
= u(1-,_)t%__,/,_ cei/c'F(1/a)
This gamma distribution has mean 1 and variance c_, where o_ is our ancillary parameter. (Note the scale (i.e., the second) parameter for the gamma(a, A) distribution is sometimes parameterized l/A; the above density defines how it has been Parameterized here.)
that as
I
nbreg -- Negative binomialregression
j,,,_
i
391
The unconditional likelihood for the ith observation is therefore
/0
f(Y_) =
:
i
f(Yi I u)g(u) du
_
+y,)
, r(y + 1)r(m)
whe/re _i = I/(1 + c_#i) and m 1/a. SolUtions for a are handled by searching for lna since c_ is requiied to be greater than zero. The ,cores and log likelihood (with weights and offsets) are given by
,(z) = digamma _ction evaluated at z _ls(z) = trigamma function evaluated at z '
a = exp(Tl
m = I/a
Pi = 1/(1 + c_#i)
Pi -- exp(xd3 + offset/)
; c= i=1
I
scOrei[3)/ = Pi (Y_ - #i) scOre(iT-}i = -m
1 + a#i {_(_ui-yi)
In the, :ase of gnbreg,
tn(l +°_gi) 4g'(Yi
_m)-_b(m)}
a is allowed to vary across the observations according to the parameterization
[IIO_i m _:W"
M_in_ization for gmbreg is via the If linear-form method and for nbreg described in [R] ml.
is via _he d2 method
i
Refemnccs Cameron, A! C. and P. K Trivedi. 1998. Regression analysis of count dat_. Cambridge: Feller, W, 1§68. An Introduction
Sons.
to Probabititv
Theory and Its Applications,
Cambridge
University Press.
vol 1 3d ed. New York: John Wile_ &
i
Ounen'ez. R 1 G., S. L. Carter, and D. M Dmkker. 200t, On boundary-value l_,]le_m, _forthcoming. !
likelihood
ratio rests. Stata Technical
Hilbe, J. 19_8. sggl: Robust variance estimators /or MLE Poisson and negative binomial regression_ Stata Technic_d Bulletin _5: 26-28. Reprinted in Stata Technical Bulletin Reprints, vol. 8. pp. 177-180. I
.
!999. @102: Zero-truncawd
poisson
and negative
binomial
regression.
Long, J. S, 1997. Regression Models tot Categorical aad Limited Dependent Reprinted Iin Stata Technical Butledn Reprints, vol, 8; pp. 233-236. Pubtieatit_s, l
1
Rodr{gueL 4" 1993. sbel0- An improvement tu poisstm, Stata Technical 7_-hnie_llBulletin Reprint,_. vol. 2, pp. 94-98.
I
Stata
Technical
Bulletin
Variables. Thousand
Bulletin
11: 11-t4,
47:_7-'40
Oaks, CA: Sage
Reprinted
in St_lta
v
• ii::'i
.....
Rogers, W. H. 1991. sbel: Poisson regression with rates. Stata Technical Bulletin t: 11-t2 Bulletin Reprints, vol. 1, pp. 62-64.
Reprinted in Stata Technical
1993, sgl6.4: Comparison of nbreg and glm for negative binomial. Stata Technical Bulletin 16: 7. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 82-84.
Also See Complementary:
JR] adjust, [R] constraint, [R] lincom, [R] linktest, [R] Irtest, [R] mfx, [R] predict, [R] sw, [R] test, [R] testnl, [R] vce, [R] xi
Related:
[R] glm, [R] poiss0n, [R] xtnbreg, [R] zip
Background:
[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [U] 23.12 Obtaining scores, JR]maximize
I
Tit Install and manage user-written additions from the net
Syrlta:oi fro=
directory._or_.url
_ael cd
path_or_urt
no1 link
tinkname
he1 search
keywords
(see [RJ net search)
taet net
describe
pkgname
net net
set ado set other
dirnarae dirname
ne_ query net
install
net ii get
I
i' all
replace
j
pkgname
[[J all
replace
]
adoi
[, f__i_ad(string) frora(dirname)
ado dir
r ' i, f__ind(string) from(dirname)
ado i describe
[pkgid]
_, fired(string)
ado!uninstall
pkgid
[, fr_m(dimame)
where
i
pkgname
i p_enarne is p_,id is or di "nanle is or or or
]
fArom(dirname)] f
name of package name of a package a number in square brackdts: [#] a directory name PEKSONAL STBPLUS SITE
(default)
DesCription net etches and installs additions to Stata. The additions can be obtained from the Internet or from m__tia. The additions can be ado-files _new commands), help files, or even datasets. Collections of:files Z'e bound together into packages. For instance, the package named zz49 might add the xyz comman( to Stata. At a minimum, such a package would contain xyz.ado, the code to implement the new _:ommand. and xyz .hlp, the on-line help to describe it. That the package contains two files is a detail: you use net to fetch the package z:z49 regardless of how many files there are. ado n_anages the _Packages you have installed using net. The ado command allows you to list packages you have previously installed and to uninstall them. Users can also access the net and ado features by pulling down Help and selecting STB and User-wri ten Programs. 393
I_I
394
net -- Install and manage user-written additions from the net
Options ,
all is used with net install and net get. Typing it with either one makes the command equivalent to typing net Lnsza11 followed by net get. replace is for use with net install and net get. existing files if any of the files already exist.
It specifies that the fetched files are to replace
find(string) is for use with ado, ado dir, and ado describe. It specifies that the descriptions of the packages installed on your computer are to be searched and that the package descriptions containing string are to be listed. from(dimame) is for use with ado. It specifies where the packages are installed. The default is from(STBPLUS). STBPLUS is a codeword that Stata understands to correspond to a particular directory on your computer that was set at installation time. On Windows computers, STBPLUS probably means the directory c:\ad.o\stbplus, but it might mean something else. You can find out what it means by typing sysdir, but this is irrelevant if you use the defaults.
Remarks For an introduction to using net and ado. see [U] 32 Using the Internet to keep up to date. The purpose of this documentation is 1. To briefly but accurately describe net 2. To provide documentation to Stata.
and ado and all their features.
to those who wish to set up their own sites to distribute
additions
Remarks are presented under the headings Definitionof a package The purposeof the :net and ado commands Content pages Package-descriptionpages Where packages are installed A summary of the net command A summary of the ado command Relationship of net and ado to the point-and-click interface Creating your own site Format of content and package-descriptionfiles Example 1 Example 2 Metacharactersin content and package-descriptionfiles Error-free file delivery
Definitionof a package A package is a collection of files typically . ado and . hip files--that provides a new feature in Stata. Packages contain additions that you wish were part of Stata at the outset. We write such additions and so do other users. One source of these additions is the Stata Technical Bulletin (STB). The STB is a printed and electronic journal with corresponding software. If you want the journal, you must subscribe, but the software is available for free from our web site. If you do not have Intcrnet access, you may purchase the STB media from StataCorp.
i
net -- Install and manage user-writtenadditionsfrom the net
i
i
i
I +
I ii
l
395
The purp(se of the net and ado commands 1
The pprpose of the net command is to make distribution and installation of packages easy. The goal is tO_get you quickly to a package description page that summarizes the addition: rte_stat
• n_et describe ! '
package
rte
star from
http://www.wemak_itupaswego.edu/fscu!ty/sgazer/
i
+
_I_ rte_stat.
i '
The robust-to-everything
_ES_PrIOl/tCrln S. Gazer, Aleph-O
(S) Dept.
_ '
of Applied
I00_, confidence
applications; |
i
__
Aleph-I
statistic ; update.
Theoretical
intervals confidence
Mathematics,
proved
WMIUAWG
too conservative
intervals
have been
Univ.
for some
substituted.
The new robust-to-everything supplants the previous robust-toeverything-conceivable statistic. See "Inference in the absence of data"
1
(forthcoming).
After
ihstallation,
_IN+ILLITIOE FILES _ rt e.ado
see help (type
net
rl}o. ins_a]/
rte_stat)
rte.hlp i
nullset, ado random, ado
Should y,)u addition might prove u_fu], net makes the installation easy: at decide install the rte_stat checking
r_e_stmt
consistency
ins talling into c :\ado\stbpius\ ins ;allation complete•
and verifying
not already
installed.-.
...
The p_rpose of the ado command is to help you manage packages installed with net. Perhaps you reme[nber that you installed a package that calculates the robust-to-everything statistic but cannot remembe I the command's name. You could use ado to search what you have previously installed for the r'_e dommand:
[I]i package sg146 from http://waw.ststa, STB-66 sg146. Scalar measures of ( _utput omitted
[I+ 1i
rto_stat. package
com/stb/stb56 fit for re_ression
models.
)
rte_statThe
robust-to-svoryth_-_ sta_2s_ic! Ilpdmte, from http://wwa._emakeitupaswego.edu/faculty/sgazer
(_utputomitted) [2+
package STB-62
sgl21from http://www, stata, com/stb/stb52 sK121: Seemingly unrelated est. t cluster-adjusted
sandwich
est.
or, you _ight type . afro,package find("robust-to-everything") [15_] fro_star from http://www.i_emakeitupaswego.edu/faculty/sgazer rte_s_at. The robust-to-ovorytb_n_ statistic! update. 1
Perhaps 9ou decide that rte You can hse ado to erase it: o uninstall pa+age
(_kag, !
rte_s_at rte_stat.
_-,tall.d)
despite the author's claims is not worth the disk _paee it oeeupie<
rte_stat from http://www, wemak_itupaswego.edu/faculty/sgazer The robust-to-everyt_ii-_ statistict update.
F ..
o:,o nez -- lnsza, ano manage user-written additions from the net ado uninstall is easier than erasing the files by hand because ado uninstall will erase every file associated with the package and, moreover, ado knows where on your computer rte_stat is installed; you would have to hunt for these files.
Content pages There are two types of pages displayed by net: content pages and package-description pages. When you type net from, net ed, net link, or net without arguments, Stata goes to the specified place and displays the content page: • net
from
http://www,
http://www,
stata, com/
STB and
other
Welcome
to Stata
Below
user-written
we provide
mentioned install
stata.com
additions
additions
official
DIRECTORIES
you
use
with
Stata
Corporation.
on Statalist. the
for
to Stata These
updates
could
--met
that
are
NOT
by typing ad-
were
THE
published
OFFICIAL
in the
UPDATES;
STB or
you
fetch
to:
stb
materials
users meetings quest
materials by various people Stata user group meetings StataQuest additions
links
other
published
locations
in the
providing
Stata
Technical
including
additions
Bulletin
BtataCorp
cd stb
http ://www. stata, com/stb/ The Stata Technioal ]Buul.let_
PLACES
you
could
-net
link-
to:
stata
StataCorp
portugal
STB
DIRECTORIES you (ou_ut omitted)
could
web
mirror
-net
site
site
@d- to:
stb57
STB-57,
September
2000
stb56
STB-56,
July
2000
stb55 stb54
STB-55, STB-54,
May March
2000 2000
stb53 stb52
STB-53, STB-52,
January November
2000 1999
(ou_utomitmd)
• net
cd stb54
http://www.stata.com/stb/stb54/ STB-54 March 2000
DIRECTORIES •.
you
could
-net cd- to: Other STBs
employees
to Stata
A content page tells you about other content pages and/or package-description example lists other content pages only. Below we follow one of the links: • net
and
-update-.
pages. The above
net -- Install and manage user-writtenadditionsfrom the net PA(:KAGES you could
I { I
i i
_, i
;
• ;
i _
397
-mot dlscribe-:
dm73_I
Contrasts
dm76 dm77
ICD-9 diagnostic and procedure code utility Kemoving duplicate observations in a dataset
for categorical
to drawing
variables:
update
gr34_3
An update
gr43 ip29_I
Overlaying graphs Metadata for usercwritten
she32
Automated
sg120_2 sgll6_l
Correction to roccomp command Update to hotdeck imputation
sg132 sgl30 sg133
Analysis of variance from summary statistics Box-Cox regressidn models Sequential and drop one term likelihood-ratio
sg134 sg84_2 sxdl_2
Model selection using the Akaike information criterion Concordance corr61ation coefficient: update for Stata 6 Random allocatio_ of treatments bal. in blocks: update
outbreak
Venn diagrams
det.
contributions for pub.
health
to Stata surveillance
data
tests
dm?32_l,I din76 ..... sxdl 2 are links to package-description pages. 1. _W_en you type net from, you follow that with a location to display the location's contem :pai_e, i
ia. The location could be a URL such as http://www, stamcom. The content page at that location would then be listed 1 b, The location could be a: on a Windows computer or :diskette: on a Macintosh computer, The content page on that source would be listed. That would work if you had special media obtained from StataCorp (STBmedia) or speckfl media prepared by another user.
_, The location could even be a directory on your computer, but that would work only if ! that directory contained the right _nd of files. 2. OnCe you have specified a location, typing net loc_tion, if there are any. Typing
cd will take you into subdirectories of that
!
• n_t cd stb
_is_quivalent to typing • n_t from http://www.stata.com/stb
TyI ing net cd displays the content page from that location. 3. _y[ing net without arguments redisplays the current content page, which is the content page last displayed. 4. ne_ link is similar in effect to net cd in that the result is to change the location, but rather thai changing to subdirectories of the current location net link jumps to another location:
(Contin uett on next page )
t
398
net -- Install and manage user-written additions from the net • net
from
http://wvw.xk8.net/
http://www, Ik8 .net
Welcome No,
xk8. net/
to www.xk8.net.
we don't
employees
use
statistical
at StataCorp
software,
so we agreed
but
to put
so you could see how a user materials does not interfere
site might look with the other
PLACES
to:
you
could
-net
stata PACKAGES
link-
StataCorp'smain you
could
four
like
files
one
on our
and see that HTML files.
of the web
having
site Stata
page
desaribe-_
-net
xsample
we rather
A sample
package
Typing net link stata would jump to http://www.stata.com: • net
link
stata
http://www.stata.com/ STB and other user-vritton
Welcome
to
Stata
additions
for
use
v£th
Sta_a
Corporation.
(ou_utomitted)
Package-descriptionpages Package-description • net
from
pages describe what would be installed:
http://w_w.stata.com/stb/stb54
http://www.stata,
com/sth/stb54/
(ou_utomiUed) • net
sg132
describe
package
sg132
from
http://www.stata.com/stb/stb54
TITLE STB-54
sg132.pkg.
Analysis
of variance
from
s1_mmary
statistics
DESCRIPTION/AUTK_(S) STB
insert
Support: After
by
John
R. Gleason,
Syracuse
University
loesljrg©accucom.net
installation,
INSTALLATION FILES sg132/aovsum.ado
see help
aovsum (type
net
install
(type
net
get
sg132)
sg132/aovsum.hlp ANCILLARY FILES sg132/absences.dta
sg132)
sg132/demo.do sg132/Idose.dta
A package-description page describes the package and tells you how to install the component Package-description pages potentially describe two types of files:
files,
!
r , i
net -- Install and rr_nage user-writtenadditionsfrom the net 399 I. Ins, allation files: Files you type net install to install and which are required to make the _addition work. 2. _An,:illary files: Additional files you migM want to install--you type net get to install them-bu_ which you can ignore. Ancillary files are typically datasets that are useful for demonstration pm 9oses. Ancillary files are not really installed in the sense of being copied to an official place for use by Stata itself. The), are merely copied into the current directory so that you may use the_ if you wish. You ir_stall the official files by typing net install sg132
n._t install the :king sgi$2
consistency
and verifying
ins ;ailing into c :\ado\stbplus\ ins ;allation complete.
You get t]te ancillary files thei p_ka ge name: •
i
.
m_t
get
che_:king
not already
installed.,.
...
if there are any and if you want them
by typing
net
get followed b}.
sg132 sg132
consistency
curremt
top,ring into
ltl_smncos,
copying
_,do
copying
IM_s1.4F_a
files
and verifying
not
already
installed...
d_rectory...
copying
anc: llary
followed by the package name:
dta
successfully
copied.
Most _se_ ignore the ancillary files. Once _ou have installed a package--typed descril_tio_ page whenever you wish: . a_o describe
i] package
net
install
use ado to redisplay the package-
sg132
sg132
from http://www.stata.com/stb/stb54
Tn_ :
STB-54
sg132.pkg.
Analysis
of variance
DSS(_IPTION/AUT_It($) : STB insert by John R. Gleason, Syracuse i Support : loesljrg©accucom, net After _NS_ALLATION i a\ao_s_, a\aovs_,
installation,
from summary
statistics
University
see help aovsum
FILES ado hip
iNStALLED ON ! 25 Jul 2000
• I
Note that he package-description page shown _ ado includes where we got the package and when we install_d it. Also note that it does not mention the ancillary files that were originally par_ of thi_ package b_cause they are not tracked by ado.
,_uu
b,
nez -- msza. ana manage user-wm_en aOdltions trom the net
Where packages are installed Packages should be installed in STBPLUS or SITE. STBPLUS and SITE are codewords understands and that correspond to some real directory on your computer. Typing sysdir you where these are, if you care. • sysdir STATA : UPDATES: BASE : SITE:
C: \STATA\ C :\STATA\ado\updates\ C :\STATA\ado\base\ C:\STATA\ado\site\
STBPLUS: PERSONAL: DLDPLACE:
c :\ado\stbplus\ c :\ado\personal\ c :\ado\
If you type sysdir,
that Stata will tell
you may obtain different results.
By default, net installs in the STBPLUS directory and ado tells you about what is installed there. If you are on a multiple-user system, you may wish to install some packages in the SITE directory. This way, they will be available to other Stata users. To do that, before using net install, type . net set ado SITE
and when reviewing ado
....
what is installed or removing packages, redirect ado to that directory:
from(SITE)
In both cases, you literally type "SITE" because Stata will understand that SITE means the site ado-directory as defined by sysdir. To install into SITE, you must have write access to that directory. If you reset where net ado-directory, type
installs and then, in the same session, wish to install into your private
• net set ado STBPLUS
That is how things were originally. If you are confused as to where you are. type net query.
A summary of the net command The purpose of the net command is to display content pages and package-description pages. Such pages are provided over the Internet and most users get them there. We recommend you start at http://www.stata.com and work out from there. We also recommend using net search to find packages of interest to you; see [R] net search. You do not need Internet access to use net. The additions published in the STB are also available on media and can be obtained from StataCorp; see [R] sth. There is a charge for the media. net from is how you jump to a location• The location's
content page is displayed.
net cd and net link change from there to other locations, net cd enters subdirectories of the original location, net link jumps from one place to another, where being determined by what the content-provider coded on their content page• nez describe pkgname
lists a package-description
page. Packages are named, and you type net describe
net install installs a package into your copy of Stata. net (ancillary files) to your current directory.
get
copies any additional
files
_'_-
.et -- lnsti_lland manage user-writtenadditions from the net
401
A summary of the ado command The purp _se of the ado command is to list the package descriptions of previously installed packages. Typirlg o without arguments is the same as typing ado dir. The}' list the names and titles of the p_cl_ge_ you have installed. ado describe !
lists full package-description Nges.
ado _ni stall removes packages from your computer. Since yo_ can install packages from a variety of sources, there is no guarantee that the package name_ are t_n,que. Thus. the packages installed on your computer are numbered sequentially and you may r_fer to them by name or by number I For instance, say you wanted to get rid of the robust-to-everything statistic command you installed: • ado_ find("robust-to-everything") [15] _ackage
z_e $_;a_ from http://www.wemakeitupaswego.edu/facu!ty/sgazer
r_e_stat.
-The robust-to-everyt_
statistic;
_)da_e.
You could t!pe • _do luninstall
ado uninstall
tie_star
[15]
Typing "adc tminstall rte_stat" would work Only if the name rte_stat ado wot/ld efuse and you would have to type the number. The fine () option is allowed with ado dir for the wor( or phrase you specify, ignoring description searched, including the author's name of 'a c_)mmand you wanted to eliminate you could t_pe ;
• ado, [15]
were unique: otherwise
and ado describe. It searches the package description case (alpha matches Alpha). The complete package name and the name of the files. Thus, if rte was the but you couId not remember the name of the package,
find(rte) t5ackage
_,
rte.s_a_.
:
i
rte_sta% The
from
http://www,
robust-to-everythin_
wemakei_:upaswego, statistic;
edu/faculty/sgazer _la_e.
Relationshi of net and adoto the point-and-clickinterface i i
Users m_y instead pull down Help and select STB and User-written advanlages _nd disadvantages: 1. Flippi4g through content and package-desc_ption See CI)apter 20 in the Getting Started manu_l.
1
Programs. There are
pages is easier; it is much like a brow_er.
2. When_rowsing a product-description page, note that the .hlp files are highlighted. _bu may click On .hlp files to review them before installing the package. 3. You ntay not redirect from where ado searches for files. I
402
net m Install and manage
user-written
additions
from the net
Creating your own site The
rest
of this
idea
is that
you
wish
to put
them
Or, perhaps
you
In any case,
have out just
all you
(or in a subdirectory), you
entry
concerns
how
written
additions
so that
coworkers
have
a dataset
need
to create for
you
that
add
use
own
with
you
two
and
more
want
to distribute xyz.
at other
others
You place
site
Stata--say
or researchers
is a homepage.
and
your
ado
and
institutions
can
xyz
to Stata. .hip--and
easily
you want
content
file and
to distribute a package
description
Format of content and package-descriptionfiles content
file describes
the
content
OFF * lines starting with * are comments;
page.
It must
be named
stata
(to make site unavailable they are ignored
.toe:
top of stata.toc temporarily)
* blank lines are ignored, too * v indicates version v2 * * * d d d d
specify
v 2; old-style
d lines display description text the first d line is the title and the remaining blank d lines display a blank line title text text
* I lines display links t word-to-show path-or-url 1 word-to-show pa_h-or-url
toc files do not have this
ones are text.
[ description] [description]
* t lines display other directories t path [description] t path [description]
within the site
* p lines display packages p pkgname [description] p pkgname [description] end of stata.toc Package
files
describe
packages
and
are
named
pkgname,
pkg: top of pkgname.pkg
* lines starting with • are comments;
they are ignored
* blank lines are ignored, too * v indicates version--specify v 2 * * * d d d d
v 2; old-style pkg files do not have this
d lines display package description text the first d line is the title and the remaining blank d lines display a blank line title text text
ones are text.
you
install
on your
are done.
The
The them.
to share.
the files
files--a
additions
homepage file--and
net --
f
Install and man_ge
user-written
additions
from the net
403
* fid :ntifies the component files [patf/]tilename [description] f [pat//]tilename [description]
* • tile is optional; it means stop reading e end of pkgname.pkg
-Example Say we kant the user to see ffJe following: I • net Ifrom http://_r_a.university, edu/-me http: _/w_w. university, edu/-me Chris Farrar, Uni University PA_KA(,ESyou could -n=t d_scribe-: x 'z interval-truncated survival • met Idescribe xyz pack_ iexy= from http://_.university,
edu/-me
_rZTLE xyz.
interval-truncated
survival,
_E$C_FTIOS/AO'tHOL(S) C. Farrar, Uni University. :IBSTII,LATIOB FILES xyz. ado xyz. hip AN¢ILIJd_I FILES sample,dta
i
The files
to do this would
(type
net
install
(type
r_t
get
x_)
ry=)
be top of stata.toc -
-
v2 d Chr:s F_crrar,Uni University p xyz interval-truncated survival end of stata.toc top of xyz.pkg v2 d xyz i
d f f f
C. xyz xyz sam
interval-truncated survival. 'arrar, Uni University. ado hip ,le.dta end of xyz.pkg
On his horn _page. Chris
"_
Note
!
that C_ri_
does
would
place
the followivlg
stata.toc
(shown above)
xyz.pkg xyz, ado xyz.htp sample.dta
(shown file to file to file to
nothing
to distinguish
files:
beabove_ deli#ered (for use by net install_ be delit'ered (for use by net install} be deli*ered (for use by net get'_
ancillary
files from
installation
files.
.....
4U'*
net -- msca. anu manage user-written additions from the net
Example 2 S. Gazer wants to create a more complex site: • net
from
http://www.wemakeitupaswego.edu/faculty/sgazer
http://www, wemakeitupaswego, Data-free :iJ_eron@e _eri_s
S. Gazer, Also
Department
of Applied
see my homepage
PLACES
you
could
edu/f aculty/sgazer
for
-not
stata
the
link-
Theoretical
preprint
you
could
ir
-no_
web
you
could
-net
site
inference
programs
Robust-to-everything-conceivable
rte
Robust-to-everything
describe
package
rte
(work
in progress)
des@ribe-:
rtec
• net
inference".
@d- to:
irrefutable
PACKAGES
"Irrefutable
to:
StataCorp
DIRECTORIES
of
Mathematics
statistic
statistic
rte
from
http://www.wemakeitupaswego.edu/faculty/sgazer/
TITLE rte.
The
robust-to-everything
statistic;
update.
DESCIIPTIOW/AUTHO_(S) S.
Gazer,
Aleph-O
Dept. I00_
of
Applied
confidence
applications;
Aleph-I
Theoretical intervals
confidence
_athoma_i@s, proved
intervals
The new robust-to-everything supplants everything-conceivable statistic. See of data" Support:
(forthcoming). email
After
too
WHIUAWG Univ.
conservative
have
been
for
some
substituted.
the previous robust-to"Inference in the absence
installation,
see
help
r_e.
(type
net
install
(type
net
Ket
[email protected]
INSTALLITION FILES rte.ado
rio_star)
rte.hlp nullset.ado random.ado ANCILLAKT FILES empty.dta
rte_stat)
The files to do this would be top of stata.toc v 2 d Data-free
inference
d S. Gazer, d
Department
materials of Applied
d Also see my homepage for I stats http://w_w.stata.com
the
t ir irrefutable
programs
p rtec p rte
inference
Theoretical
preprint
Robust-to-everything-conceivable Robust-to-everything statistic
Mathematics
of "Irrefutable
(work
inference".
in progress)
statistic end of stata.toc
"
net -- Install and mdnage user-written additions from the net
405
top of rte.pkg v2 d rte
The robust-to-everything
d _bf S. !
Gazer,
of Applied
d Alel)h-O,IOOY, confidence d app.ications;
'
Dept.
Aleph-i
update,
statistic; Theoretical
intervals confidence
prov4d
Mathematics,
WMIUAWG
too conservative
for
inter_als
have
been
Univ.}
some
substituted.
d The new robust-to-everything supplants the previous robust-tod eve:'ything-conceivable statistic, See "Inference in the absence d 6f
[ata" (forthcoming).
After
d
installdtion,
see help
{bf:rte}.
i
d Sup]tort : f _rte ado
emaii
sgazer@wemalteitupasweg_.edu
f rte hip f hull.set.ado f ran,tom.ado f emp_ :y.dta
end of rte.pkg On his hom_page,
Mr. Gazer would place the following
stata, toe rte .pkg rte.ado
i
|
rte.hlp nullset,ado random, ado empty.dta
For comple)
i
(shownabove) (shownabove) (file to be delivered) (file to be delivered) (file to be delivered) (file to be delivered) (file to be delivered)
!
the other package referred to in stata.toc the correspondidg files to be delivered
ir/stata.toc
the contents file':for when the user types net cd ir whatever other _pkg files are referred to whatever other files are to be delivered
sites, a different
l
files:
rtec,pkg rtec .ado rtec.hlp ir/.,, Jr/...
stata,
toc
rte .pkg
--
structure
i
may prov4 more convenient: (shown abovel (shown abovej
rtec.pkg
ihe other pac_ge referred to in stata.toc directory containing rte files to be delivered: (file to be dellvered) rte/rte,hIp (file to be delivered) rte/nullset.ado (file to be delivered) rte/random, ado (file to be delivered) rte/empty,dta (file to be delivered) rte/ rte/rte.ado
rtec/...
directory containing rtec files to be delivered: (files to be delivered)
ir/stata_toc
the contents fiie for when the user types net cd ir
Jr/* Jr/*/.,,.pkg
whatever referred to whatever othei othei package files are files to beare delivered
rtec/
I
If you prefe
this structure,
it is simply a matter _f changing
the bottom of the rte,pkg
i
_
from
f _te. ado f _te. hip f 1lullset.ado f _anC ore,ado f erupty.dta
|
I
i,b; f_
406
net -- Install and manage user-written additions from the net
tO f rte/rte, ado f f f f
rte/rte.hIp rte/nullset, ado rte/random, ado rte/empty .dta
Note thatin writingpathsand files, thedirectory separator forwardslash(/)isused regardless of operating systembecausethisiswhat theInternet uses. Alsonotethatitdoesnotmatterwhetherthefiles you putoutareinDOS/Windows,Macintosh, or Unix format(how linesend isrecordeddifferently). Whcn Stainreadsthefiles overthelnternet, it willfigure outthefileformaton itsown and automatically translate thefiles towhat isappropriate forthereceiver.
SMCL in content and package-descriptionfiles The text listed on the second and subsequent d lines in both stata.toc and pkgname.pkg contain SMCL as long as you include v 2; see [P] smcl. Thus, in rte.
pkg, note that S. G_er
may
coded the third line as
d {bf:S. Gazer, Dept. of Applied Theoretical
Mathematics,
WMIUAWG Univ.}
Error-freefile delivery Most people transport files over the Internet and never worry about the file being corrupted in the process because corruption rarely occurs. If, however, it is of great importance to you that the files be delivered perfectly or not at all, you can include checksum files in the directory. For instance, say that included in your package is big.dta and that it is of great importance big. dta be sent perfectly. First, use Stata to make the checksum file for big. dta
that
• checksum big.dta, save
That creates a small file called big. sum; see [R] checksum. Then copy both big. dta and big. sum to your homepage. W"nenever Stata readsfilename, whatever over the net. it also looks forfilename, sum. If it finds such a file. it uses the information recorded in it to verify that what was copied was error free. If you do this, be cautious. If you put big. dta and big. sum on your homepage and then later change big.dta without changing big. sum, people will think there are transmission errors when they try to download big.dta.
References Baum. C. E and N. J. Cox. 1999. ip29: Metadata for user-written contributions to the Stata programming Stata Technical Bulletin 52: 10-12. Reprinted in Stata Technical Bulletin Reprints. vet. 9. pp. t21-124.
language.
Cox. N. J. and C. F. Baum 2000. ip29.1: Metadata for user-written contributions to the Stata programming language: extension_. Stata Technical Bulletin 54: 21-22. Reprinted in Stma Technical Bulletin Reprints. vol. 9. pp. 124-126.
|-_=
net -- Install and manage user,writtenadditionsfrom the net
Also See
i t
.
Complemental:
[R] checksum, [R] net se*rch. [R] stb
Relate_: Ii I i
[R] update, [1"]smcl
Baekg_u ad:
[GSM]20 Using the Internet, [CSU] 20 Using the Intemet, [GSW] 20 Using the Inte_net, [U] 32 Using the Internet to keep up to date
407
i
i
i
F
II II IMI_
net
Search Internet for installable
packages
1 I
Syntax net search keywords [, or no_tb tocpkg toc pkg everywhere __ilenames errnone I
Description net search searches the Internet for user-written additions to Stata, including but not limited to user-written additions published in the STB. net search lists the available additions that contain the specified keywords The user-written materials found are available for immediate download by using the net command or by clicking on the link. In addition to typing net Programs.
search,
users may pull down Help and select STB and User-written
Options or is relevant only when multiple keywords are specified. By default, only packages that include all the keywords are listed, or changes this to list packages that contain any of the keywords. nostb restricts the search to non-STB sources or, said differently, matches that were published in the STB.
causes net
search
not to list
tocpkg, toc, and pkg determine what is searched, tocpkg is the default, meaning that both tables of contents (tocs) and packages (pkgs) are searched, toc restricts the search to tables of contents only. pkg restricts the search to packages only. everywhere and filenames determine where in packages net search looks for keywords. The default is everywhere, filename restricts net search to search for matches only in the filenames associated with a package. Specifying everywhere implies pkg. errnone is a programmer's are found.
option. It causes the return code to be 111 instead of 0 when no matches
Remarks net search searches the Internet for user-written additions to Stata. If you want to search the Stata documentation for a particular topic, command, or author, see JR] search.
408
net search-- Searchlnternetfor installable packages
409
Topic searches Exarripk_: find what is available about "random effects" • n4_t; search
random
effect
Comntent I. It is best to search for the singular, net search _nd "random effects". !
2. net! search
random effect
random effect
will find both "random effect"
will also find "random-effect" because net search
performs a
gtri_g search and not a word search. 3. net{ search i
i
I
!
random effect
lists all packages containing the words "random" and "effect".
,lot becessarily used together. packages containing 4. If y_u wanted random all the Word "random" net search effect, or.
Authar
you
or
the word "effect",
type would
rches
Example
find what is available by authoi_ Jeroen Weesie
CommCnt_: i
_
• ne_ search
weesie
1. Youlcould type net
search
jeroen
last !ame is used without the first. 2. You]could type net search _eesie sCar4h.
weesie
but that might list less. because sometimes the
but it would not matter. Capitalization is ignored in the
Example:! find what _savailable by Jeroert Weesie excluding STB materials • ne
search
weesie,
noRtb
i
l, The ;TB tends to dominate search results because so much has been published in the STB. If you mow what you are looking for is not in the STB, specifying the nostb option will narrow
i
tlae _arch. 2. just net search ype net weesie search
lists everything net the search weesie, nostb lists, and If you weesie, look dowri list. STB materials are listed first more. and non-STB
Command searches i l
mate Jals are listed last.
:
• •
Example: thekursus, user-written command kursus . nel find search file I You _ould just type net search kursus, and that will list everything net search kursus, file!lists, and more, Since you know k_rsus is a command, however, there must be a kurs_s,
ado file associated with the package. Typing net search
, " 2. SOarCtl" You d:outd aIso t}pe net search
kursus.ado,
file
kursus,
file
narrows the
to narrow the search even more.
,
Where does net search look?
41L11 net search -- Search I.nternetfor installable packages net search looks everywhere, not .just at www.stata.com.
net search begins by looking at www.stata.com, but then follows every link, which takes it to other places, and then follows every link again, which takes it to even more places, and so on. Authors: please let us know if you have a site that we should include in our search by sending an email to netsearch@stata, com. We will then link to your site from ours to ensure that net search finds your materials. That is not strictly necessary, however, as long as your site is directly or indirectly linked from some site that is linked to ours.
How does net search really work?
crawler Your computer talks to www.stata.com
www.stata.com maintains a database of Stata resources. When you use net search, it contacts www.stata.com with your request, www.stata.com searches its database, and Stata returns the result(s) to you. Another part of the system is called the crawler: it searches the web for new Stata resources to add to the net search database and it verifies that the resources already found are still available. Given how the crawler works, when a new resource becomes available, the crawler takes about two days to notice it and, similarly, if a resource disappears, the crawler takes roughly two days to remove it from the database.
net search -.i- Search Internet for instailable packages
F
411
Refeieno.=s Baum, C _. and N. J. Cox. 1999. ip29: Metadata for user-written contributions to the Stata programming language. Stata Td_chnicalBulletin 52: 10-12. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 121-124. Cox, N. J. _nd C. E Baum. 2000. ip29.1: Metadata for!user-written contributions to the Stata programming language: } !
extensiols. Stata Technical Bulletin 54: 21-22. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 124-126. Gould. W. a_d A. Riley. 2000. stata55: Search web for ir_stallablepackages. Stata Technical Bulletin 54: 4-6. Repnnted in Stata!Technical Bulletin Reprints, vol. 9, pp. 10-113. i
•
Also See tt Comp!emlnta_': Relatetl: |
:
I t
[R] net, [R] stb [R] search, [R] update
r i
newey _ Regression with Newey-West
standard errors
[
]
i
Syntax newey depvar
[vartist]
[ t(varnamet) aueights newey
are allowed;
force
[weight]
[if
exp]
noconstant
[in
range],
level(#)
lag(#)
]
see [U] 14.1.6 weight.
shares the features
of all estimation
commands;
see [U] 23 Estimation
and
post-estimation
commands.
Syntaxfor predict predict
[type] newvarname
[if
exp]
[in range]
These aresample. available both in and out of sample; type the statistics estimation
predict
[,
{ xb t stdp ...
i_
e (sample)
} ] "'
if wanted only for
Description newey produces Newey-West standard errors for coefficients estimated by OLS regression. error structure is assumed to be heteroskedastic and possibly autocorrelated up to some lag. Note that if lag(0) is specified, the variance estimates produced Huber/White/sandwich robust variances estimates calculated by regress,
The
by newey are simply the robust; see [R] regress.
Options lag(#)
is not optional;
it specifies the maximum
lag to consider in the autocorrelation
If you specify # > 0, then you must also specify option t(), lag(0), the output is exactly the same as regress, robust.
structure.
described below. If you specify
t (varnamet) spccilies the variable recording the time of each observation. _bu must specify t() if lag() > 0. varJ_amet must record values indicating the observations are equally spaced in time or newey will refuse to estimate the model. If observations are not equally spaced but you wish to treat them as if they were. you must specify the force option. You need only specify t () the first time you estimate a model with a particular dataset. After that, it need not be specified again except to change the variable's identity; nevey remembers your previous t () option. force specifies ttu_t estimation is to be forced even though t () shows the data not to be equally spaced. newey requires observations be equally spaced so that calculations based on lags correspond to a constant timc oh:rage, If you specify a t () variable indicating that observations are not equally spaced, newey will refuse to estimate the model. If you also specify force, newey will estimate the model and :_ssume that the lags based on the data ordered by t() are appropriate. noconstant
spccilics that the estimated regression should not include an imercept term.
level (#) specilics the confidence level, in percent, for confidence intervals. The default is level or as set by set level: see [U] 23.5 Specifying the width of confidence intervals. 412
(95)
newey -- Regressionwith Newey-West standard errors
Options [
413
predict
xb, the d_fault, calculates the linear prediction. strip ;al_ulates the standard error of the linear prediction.
[
Remarks
:
• i The H_ber/White/sandwich robust variance estimator (see. for example, White 1980) produces consistend standard errors for OLSregression coefficient estimates in the presence of heteroskedasticity, The New_y-West (1987) variance estimator is an extension that produces the consistent estimates when [he_e_is autocorrelation in addition to possible heteroskedasticity. ! The N_ewey-West variance estimator handles autocorrelation up to and including a lag of m. where m is specified by stipulating a lag(m) option. Thus, it assumes that any autocorrelation at lags great er than m, can be ignored,
> Example neweyi,lag(O) is equivalent to regress,robust: . r, gross
price
weight
displ,
with
robust
standard
Re_ _ession
[ [
robust errors
[
{
74 14,44 0,0000
R-squared Root MSE
O.2909 2518.4
= =
Robust price
Coef.
i weight dis_)lacement
1.823366 2,087054
,7808755 7.436967
2.34 O. 28
O.022 O.780
.2663445 -12.74184
3. 380387 16. 91595
247,907
1129.602
O. 22
O.827
-2004.455
2500.269
i
_cons
. niwey !
price
ReCession i
Number of obs = F( 2, 71) = Prob > F =
maximum
weight
with lag
Std. Err.
displ,
Newey-West
t
P>Jtl
[95_. Conf,
lag(O) standard
errors
Number
of obs
F( 2, Prob > F
: 0
i
Interval}
71)
=
74
= =
14.44 0.0000
Newey-West
dis
price
Cool.
weight ,tacement
I.823366 2.087054
_cons
247.907
Std. Err.
t
P>Itl
[957, Conf.
Interval}
.7808755 7.436967
2.34 0.28
O.022 0.780
.2663445 -12.74184
3. 380387 16.91595
1129.602
0.22
0.827
-2004.455
2500.269
:1
[
i[
.-Example ha e time-series measurements on variables usr and idle
mo_el; bit
obtain
Newey-Wes!
_land,_rd
,'rrors
allowing
for
a lag
and now
of
up
wish
to 3:
to estimate an o15
i_ !
414
newey --
Regression with Newey-West
standard errors
t • newey
usr
Regression maximum
idle, with
lag
lag(3)
t(time)
Newey-West
standard
errors
Number
of
F( I, Prob > F
: 3
usr
Coef.
idle
-.2281501
_cons
23.13483
Std,
Err,
t
.0690927 6.327031 Newey-West
P>[tl
[95_
obs
=
28)
= =
Conf.
30 10.90 0.0026
Interval]
-3.30
0.003
-.3696801
-.08662
3.66
0.001
!0.17449
36.09516
q
Saved Results newey saves in e(): Scalars e (N)
number of observations
e (F)
F statistic
e (dr_m) e(df_/)
model degrees of freedom residual degrees of freedom
e(lag)
maximum lag
e (cmd)
newey
e (wexp)
weight expression
e(depvar) e(wtype)
name of dependent variable weight type
e(vcetype) e(prediet)
covariance estimation method program used to implement predict
coefficient vector
e (V)
variance-covariance the estimators
Macros
Matrices e (b)
matrix of
Functions e(sample)
marks estimation sample
Methods and Formulas newey is implemented newey
calculates
as an ado-file.
the estimates flons
-- (X'X)-IX'Y
- (x'x)-lx'hx(x'x) That is, the coefficient For the case of lag formulation:
estimates (0)
are simply
those
(no autocorrelafion),
X' X
= X'noX
-1
of OLS linear regression.
the variance
estimates
are calculated
using the White
n i
Here gi - Yi XiflOLS, where xi is the ith row of the X matrix, n is the number of observations, and k is thc number of prodictors in the modal, including the constant if there is one. Note that the above formula is exactly the same as that used by regress, robust with the regression-like formula (the default) for the multiplier qc; see the Methods and Formulas section of [R] regress.
newey -- Re ression with Newey-West standarderrors
F i
FOr e case of lag(#), (1987) f_rmulation X'_X
415
# > 0, the variance estimates are calculated using the Newey-We_t
= X t"J'_0X +
n
n-kt=t
Z
eiei_i(xix__t ---/t
/'
m+l
+ xt/_lxi)
i=t+1
where Q is the maximum lag.
ReferenCes i •
i
Hardin.J.!W 1997.sg72:Newey-Wes1standarderroN for probit,logit, and poissonmodels. Stata TechnicalBulletin 39: 32_35. Reprintedin Stata TechnicalBulletinl_eprints,vol. 7. pp. 182-186. covari ce matrix. Econometrica55: 703-708. Newey, \_ 1980. and K. West. 1987. A simple, positixesemi-definite,heteroskedasticitv and test autocorretationconsistent White, H.; A heteroskedasticitv-consisten, cov_ance matrixestimatorand a direct for heteroskedasticitv Econonefrica48: 817-838.
Also.Set C0mplel lentao,: Related:
JR] adjust, JR] lincom, jR] linktest, JR]mfx, JR] test, JR] testnl, JR] vce [R] regress, IN] svy estimators, [R] xtgls. [R] xtpcse
Backgro and:
[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and iaost-estimation commands
i/
Title i
news -- Report Stata news
II
[l
I
]
Syntax news
Description news displays a brief listing of recent news and information of interest to Stata users. It obtains this information directly from Stata's web site. The news command requires your computer to be connected to the Interact. An error message will be displayed if the connection to Stata's web site is unsuccessful. You may also execute the news command by selecting News from the Help menu.
Remarks news provides an easy way of displaying a brief list of the latest Stata news. More details and other items of interest are available at Stata's web site; point your browser to http://www.stata.com. Here is an example of what news produces: . news StataCorp
News
---
* Intercooled Windows * STB-68
2001 (July
* NetCourse
information http://www,
23, for
is sold 2002)
151:
* Proceedings For
July Stata
2002 Windows
is now
on these
8th London and
will
release:
available
"Introduction
of the
2001
(projected
to
Stata
User
additional
---
be available Aug use
the
net
Programming"
Group
topics
Meeting point
the
first
day
i, 2002) command begins now
your
to download next
month
available
web
browser
to:
stata, com
In this case news indicates that there is a new STBavailable. Users can click on STB and User-written Programs from the Help menu to download STBfiles. Alternatively, the net command (see [R] net) can be used.
Also See Related:
[R] net
Background:
[U] 32 Using the Interne! to keep up to date
416
[
I
l [
nl --
least squares
Syntax i n-t kn_depvar
[varhst] [weight] [ifexp] [inrange][, level(#)i_nit(..,) _ "
!e_ve eps(#) _o!og trace it_erate (#) delta(#) i nllni_
fcn_options
ll__isq(#)
]
# parameterHist I
by ....
: n_ay be used witl_ hi; see
[R] by,
aweights lnd fweights are allowed; see [U] 14,1.6 Weight. nl _hare._ the features of eli estimation commands, see[U] 23 Estimation
and post-estimation
commands.
Syntaxfo! predict pred}ct t
[t3,pe] newvarname
[if
exp] [in range]
These _tatiitics are available both in and out of sample: type predict the estimation sample.
[,
{ _yhat . ..
if
residuals
e(sareple)
...
} ] if wanted only for
i 1
Descriptibn n! fit_ an arbitrary nonlinear function to the dependent variable depvar by least squares. You provide tt_e function itself in a separate program with a name of your choosing, except that the first two letteds of the name must be nl. fcn refers to the name of the function without the first two letters. F6r _1 example, you type nl nexpgr ... 1o estimate with the function defined in the prepare nlneapg[. nlini!,
is useful when writing nlfcns. i
Options level: (#}
specifies the confidence level, in percent, for confidence intervals. The default is level or as _et by set level; see [U] 23.5 Specifying the width of confidence intervals.
(95)
i,
init (. • .i specifies initial values for parameters that are to be used to ovemde the default initial
I l i
val_es Examples are provided below. inlsq(#' fits the model defined by nlfcn using "log least squares", defined as least squares with shifted lognormal errors. In other words, ln!(depvar - #) is assumed nommllv distributed. Sum_
[_
i i I
of squ,.res and deviance are adjusted m the same scale as depvar. leave le,]ves behind after estimation a set of new variables with the same name_ as the estimated pamm_ters containing the derivative of E(!/) with respect to the parameter. eps(#)
_pecifies the convergence criterion for successive parameter e._timates and for the residual
sum o[squares, eps(le-5) _
is the default. 417
_I i 1
nolog suppresses the iteration log. trace expands the iteration log to provide more details, including values of the parameters step of the process. iterate(#) delta(#)
at each
specifies the maximum number of iterations before giving up and defaults to 100. specifies the relative change in a parameter
to be used in computing
the numeric deriva-
tives. The derivative for parameter fll is computed as {f(X, fll,,32,... ,fli + d, fli+l,...) f(X, fll,fl2,... ,/3_,fli+x,...)}/d where d is 6(13i -t- _). The default 5 is 4e-7. fen-options
refer to any options allowed by nlfcn.
Options for predict yhat,the default, calculates the predicted residuals calculates the residuals.
value of depvar.
Remarks Remarks are presented under the headings nlfcns Some common nlfcns Log-normal errors Weights Errors General comments on fitting nonlinear models More on nlfcns nl fits an arbitrary nonlinear function to the dependent variable depvar by least squares. The specific function is specified by writing an nlfcn, described below. The values to be fitted in the function are called the parameters. The fitting process is iterative (modified Ganss-Newton). It starts with a set of initial values for the parameters (guesses as to what the values will be and which you also supply) and finds another set of values that fit the function even better. Those are then used as a starting point and another improvement is found, and the process continues until no further improvement is possible.
nlfcns nl uses the function defined by nlfcn, nlfcn has two purposes: to identify the parameters of the problem and set default initial values, and to evaluate the function for a given set of parameter estimates.
> Example You have variables 9 and x in your data and wish to fit a negative-exponential parameters B0 and B_: Y -- Bo (I - e -Bta:) First, you write a program to calculate the predicted
values:
growth curve with
-
t
nl -- Nonlinearleast squares
pr
419
am define I_inexpgr version 7.0
I
if "'i "_global == "7" S_I { i !
"80 BI"
g!obal
BO=-I
global exit
BI=. 1
/* ... /* if Query declarecall parameters
*/ */
/*
*/
and initialize
them
} replace i
"
"1"=$BO*(l-exp(-$Bl*x)
/* otherwise,
calculate
function
*/
endt
! !
To estimate the model, you type nl nexpgr y. nl's first argument specifies the name of the function. although you do not type the nl prefix. You type nexpgr, meaning the function is ntnexpgr, nl's second mgument specifies the name of the dependent variable. Replicating the example in the SAS manual (985, 588-590): . u ,e sasxmpll • n
nexpgr
y
(oh = 20) Ite:'ation
O:
residual
SS =
.1999027
Ite:'ation I:
residual
SS =
.0026064
Ite."ation 2:
residual
SS =
.0005769
Ite:'ation 3:
residual
SS =
.0005768
Source Model
SS
df
17,6717234
IResidual
2
.0005T_81
18
MS
Number
,
F(
17.6723003
20
18)
20
= 275732.74
8.83S86172
Prob
'> F
=
O.0000
.00013t32045
R-squared
=
1.0000
.883_H5013
Adj R-squared Root MSE Kes. dev.
= = =
1.0000 .0056608 -152.317
I Total
of obs =
2,
i (ne.l)gr) y
Coef.
BO BI
.9961886 .0419539
(SE "s, P values,
i
i
CI's,
Std.
Err.
.0016138 .0003982
and correlations
t 617.29 105.35
P>[t [
[95_, Conf.
O. 000 0.000
.9927981 .0411172
are asymptotic
Interval] .9995791 .0427905
approximations)
Notice -:th_ : the initial values of the parameters ;were provided in the nlnexpgr program. You can, however, ,verride these initial values on the nl_ command line. To estimate the model using .5 for the initial _alue of B0 rather than 1, you can tylje nl nexpgr y, iniZ(B0=. 5). To also change the q
i
i i i
i
initial vail e of B1 from.1 to .2, you type nl nexpgr y, init (B0=.5 The_:oulline of all nlfcn's is the same: program
', i
define
I
B1=,2).
nltcn
version 7.0 if "'I .... == "7" { global
'
(tnhialize axit
S_I
"parameternames"
paramelers
)
} replace
"I" = ...
emd
• ' • " "_" " to place the na mes of the P.arameters in the On a q_ler_ call, Indicated b}, "i- being . , the_nlfcn is global mac_-oS_l and ififtialize the parameters, t_arameters are stored as macros, so ff ,_lfc, declares
!
!_ .!
420
nl -- Nonlinear least squares
that the parameters are A, B, and C (via global S_l "A B C"), it must then place initial values in the corresponding parameter macros A, B, and C (via global A=O, global B=I, etc.). After initializing the parameter macros, it is done. On a calculation call, "1" does not contain "?"; it instead contains the name of a variable that is to be filled in with the predicted values. The current values of the parameters are stored in the macros previously declared on the query call (e.g., $A, SB, and $C).
1>Example You wish to fit the CES production
functions defined by
lnq = Bo + Aln{Dl
R + (1 - D)k 2}
where the parameters to be estimated are B0, A, D, and R. q, l, and k refer to total output and labor and capital inputs. In your data, you have the variables lnq, labor, and capital. The ntfcn is program
define nlees version 7.0 "'1""
if
==
"7"
{
global
8_i
global
BO = 1
"BOA
global
A = -1
global
D =
global exit
R = -1
D R"
.5
} " I'=$BO
replace
+ SA*in($D*labor'$R
* (l-$D)*eapitai'$R)
end
Again using data from the SAS manual (1985, 591-592): . use
sasxmpl2
nl ces inq (obs = 30) Iteration
O:
residual
SS =
37.09651
Iteration
I:
residual
SS =
3-5.48655
Iteration
2:
residual
SS
=
22.69058
Iteration
3:
residual
SS
=
1.845468
(output omitted) Iteration
20:
residual
SS
=
Iteration
21:
residual
SS
=
Source
SS
Model Residual
1.761039 1.761039 df
MS
Number
of obs
30
59.5286148
3
19.8428718
1.76103929
26
.06773228
R-squared
=
0.9713
Adj K-squared Root MSE Res. dev.
= = =
0.9680 .2602543 .0775147
Total
61.2896541
29
Inq
Coef.
Std.
2.11343635
26)
=
F( 3, Prob > F
= =
292.96 0.0000
(ces)
BO
* Parameter (SE's,
.1244892
Err.
Conf.
Interval]
0.124
-.0365497
.2855282
-.3362823 .3366722
.2721671 .1361172
-1.24 2.47
0.228 0.020
-.8957298 .0568793
.2231652 .6164652
R
-3.011121
2.323575
-1.30
0.206
-7.787297
1.765055
BO taken
as
CI's,
constant and
term
correlations
1.59
[957
A D
P values,
.0783443
P>ltl
t
in model are
_ ANOVA
asymptotic
table approximations)
i
nl -- Nonlinearleast squares
421
.......
I ! i
If the ncnlinear model contains a constant term, nl will find it and indicate its presence by placing an asteri ;k next to the parameter name when displaying results. In the output above. B0 is a constant. (nldetelmines that a parameter BO is a constant term because the partial derivative f = OE(y)/OBO has a coffficient of variation (s.d./mean) less than eps(). Usually. f = I tbr a constant, as it does
i
in, th,;_ctse.)
I
_utput mimics that of regress;calculates see [R]them, regress. The means model inF this test,case R-squared, of' nl's squares, etc..closeh are calculated as regress which that theysmn: are correcte_t for the mean. If no "constant" is present, as was the case in the negative-exponential gowth • example _prevlouslv. the usual caveats apply tO the interpretation of the F and/?-squared statistics:
I
i
, l
see comr_ents and'references in Goldstein (1992). When! making its calculations, nl creates flee partial derivative variables for all the parameters. giving e_ch the same name as the corresponding parameter. Unless you specify leave, these are discardecl when nl completes the estimation. "_erefore. your data must not have data variables that have thel same names as parameters. We recommend using uppercased names for parameters and
! !
lowercas¢d names (as is common) for variables. After _stimating with nl, typing nl by itself will redisplay previous estimates. Typing correlate, _coef w!ill show the asymptotic correlation matrix of the parameters, and typing predict myvar will creale new variable myvar containing the' predicted values. Typine predict
res,
resid
will
create, r_s containing the residuals. ntfcn'_ have a number of additional featurei that are described in More on nlfcns below.
Someoorlmonnlfcns Ar_ impo:tant feature of nl. in addition to estimating arbitrary nonlinear regressions, is the facility for addin prewritten common fi:ns.
i
i
Three ?ns are provided for exponential regression with one asymptote:
:
_xp3
Y - b0 4- bl b2 x
_xp2
Y -- bib x
For irrst_ _ce. typing nl exp3 ras dvl estimates the three-parameter exponential model tparameters bo. bl, ard 52) using Y = ras and X = dvl. TwOfi ns are provided for the logistic function (symmetric sigmoid shape; not to be confused with log_stic r( _ression): _g4
Y-bo
+ bl/l'
+ exp{-52(X-
b3)}]
Finally, t_,_ofens are provided for the Gompertz function (asymmetric sigmoid shape):
_3Technical Note You may find the functions above useful, but the important thing to note is that, if there is a nonlinear function you use often, you can package the function once and for all. Consider the function we packaged called exp2, which estimates the model Y = bib x. The code for the function is program
define nlexp2 version 7.0 if
"'I'"=="?"
{
global global
S_2 S_I
"2-param. "bl b2"
* Approximate local exp t empvar Y quietly
}
exp.
initial
"['e(wtype)
growth
values
by
" "e(wexp)
curve,
"e(depvar)"=b1*b2"'2""
regression
of
log Y on X
"]"
{ gen "Y" ffilog('e(depvar)') if e(sample) regress "Y" "2" "exp" if e(sample)
global
51 = exp(_b[_cons])
global exit
b2
= exp(_b['2"])
} replace
"i "=$b1*$b2-"
2"
end
Becausewe were packagingthisfunction forrepeated use,we went tothetroubleofobtaining good initial values, whichin thiscasewe couldobtainby takingthelogof bothsides,
Y = bib X ln(Y)
= ln(blb X) -ln(bl)+
tn(b2)X
and then using linear regression to estimate ln(bl) and ln(b2). If this had been a quick-and-dirty implementation, we probably would not have bothered (initializing bt and b2 to 1, say) and so forced ourselves
enough.
to specify
better initial values with nl's
initial()
option when they were not good
The only other thing we did to complete the packaging was store nlexp2
as an ado-file called
nlexp2, ado. The alternatives would have been to type the code into Stata interactively or to place the code in a do-file. Those approaches are adequate for occasional use, but we wanted to be able to type nl exp2 without having to wor O, whether the program nlexp2 was defined. When nl attempts to execute nlexp2, if the program is not in Stata's memory, Stata will search the disk(s) for an
ado-file of the same name and, if found, automatically load it. All we had to do was name the file with the .ado suffix and then place it in a directory where Stata could find it. In our case, we put nlexp2, ado in Stata's system directory for StataCorp-written ado-files. In your case, you should put the file in the director}, Stata reserves for user-written ado-files, which is to say, c:\ado\personal (Windows), -/ado/personal (Unix), or - : ado: persona/ (Macintosh). See [U] 20 Ado-files.
Q
nl -- Nonlinearleast squares
423
Log.normltl errors A non] near model with identically normally distributed errors may be written
y, =
+
~ N(0,,,2)
(1)
i
for i = 1._..., n. If the Yi are thought to have a:k-shifted lognormal instead of a normal distribution. that is, lnt y_ - k) 4 N (t,, r2), and the systemaiic part/(xi,/3) of the original model is still thoughi
l
approlmat '_ :, the model becomes ln(yi-k)=¢i+v,=in{f(xi,/3)-k}+vi This rood t is estimated if lnlsq(k)
i ! i i
vi"_N(0,
r =)
(2)
is specifidd.
If ntod,_l (2)is correct, the variance of (Yi- _)is proportional to {f(xi,/3)k} 2. Probably the most corn non case is k = 0, sometimes called :"proportional errors" since the standard error of Yi is proport anal to its expectation, f(xi,/3). Assuming the value of k is known. (2) is just another nonlinear nodel in/3 and it may be fitted as usual. However, we may wish to compare the fit of (1)
i i
with that ( f (2) using the residual sum of square i or the deviance D, D = -2 x log-likelihood, from each mo& I. To do so, we must alloy, for the ctjange in scale introduced by the log transformation. Assuming, then, the y, to be normally distributed, Atkinson (1985, 85-87, 184), by considering the Jacobi_n IJ ]0 ln(yi - k)/Oy{I, showed that multiplying both sides of (2) by the geometric mean :,:
of Yi - k.1!1, gives residuals on the same scale as those of Yi- The geometric mean is given by
which is aiconstant for a given dataset. The residual deviance for (1)imd for (2) may be expressed as ,
':
D(_)
=
l+ln(2rr_
2) n
(3)
i
where _ i the maximum likelihood estimate (MLE) of/3 for each model and n_ 2 is the RSS from
i
(1), or tha1 from (2) mtfltiplied by b2. i Since (_) and (2) are models with different !error structures but the same functional form, the
! {
_ _
arithmetic _lifference in their RSS or deviances is _ot easily tested for statistical significance. However, if the devtance difference is large" (> 4, say), one would naturally prefer the model with the smaller de_,iance. Of course, the residuals for e_ch model should be examined for departures from
i_ '_
assumptiots (nonconstant variance, nonnormalit3_, serial correlations, etc.) in the usual way. Consider alternatively modeling E(yi) = I_(C + Ae Bx') E(1/yi) = E(y_) = C + Ae Bx'
i ,
(4)
(5)
where C.._, and 13 are parameters to be estimated. We will use the data (y, x) = (.04, 5). (.06, 12), (.08.25). (I.1.35), (1_ 42). (.2,48), (.25,60), (,3,75), and (.5,120)(Danuso 1991). Model C IA B RSS Deviance I
(4) (4) with l_lsq(0)
1.781 1.799
25.74 2545
-.03926-.001640 -.04051 -.001431
t!
(5) (5) with lnlsq(0)
1.781 1.799
25)4 2745
-.03926 -.04051
! i
,
! I
24.70 17.42
There is lit@ to choose between the two versions ;f the logistic model (4), whereas for the exponential model (5) _the fit using inlsq(O) is much betier (a deviance difference of 7.28). The reciprocal •
i
8.197 3.65t
-51.95 -53.18
I
¢
.
transformation has introduced heteroskedasticity into '_liwhich is countered by the propomonal errors property o_ the lognorrfial distribution implicit :in lnlsq(0). The deviances are not comparable between th_ logistic and}exponential models because the change of scale has not been allowed for. althcmgh inl principle it d°uld be"
•_ 'i
,
424
nl -- Nonlinear least squares
Weights Weights are specified the usual way--analytic and frequency weights are supported; see [U] 23.13 Weighted estimation. Use of analytic weights implies that the Yi have different variances. Therefore, model (i) may be rewritten as Yi -- f(xi,_)
+ ui,
ui "-' N(O, a2/wi)
where wi are (positive) weights, assumed known and normalized number of observations. The residual deviance for (la) is D(_)
--- { 1 + ln(27r_ 2) }n - E
(la)
such that their sum equals the
In(w/)
(3a)
(compare with equation 3), where
Defining and fitting a model equivalent to (2) when weights have been specified as in (la) is not straightforward and has not been attempted. Thus, deviances using and not using the lnlsq() option may not be strictly comparable when analytic weights (other than 0 and 1) are used.
Errors nl will stop with error 196 if an error occurs in your nlfcn program and it will report the error code raised by nlfcn. nl is reasonably robust to the inability of nlfcn to calculate predicted values for certain parameter values, nl assumes that predicted values can be calculated at the initial value of the parameters. If this is not so, an error message is issued with return code 480. Thereafter. as nl changes the parameter values, it monitors nlfcn's returned predictions for unexpected missing values. If detected, nl backs up. That is, nl finds a linear combination of the previous, known-to-be-good parameter vector and the new, known-to-be-bad vector, a combination where the function can be evaluated, and continues its iterations from that point. nl does require, however, that once a parameter vector is found where the predictions can be calculated, small changes to the parameter vector can be made in order to calculate numeric derivatives. If a boundary is encountered at this point, an error message is issued with return code 481. When specifying inlsq(), an attempt to take logarithms error rues sage with return code 482.
of Yi - k when Yi <_ k results in an
If iterate () iterations are performed and estimates still have not converged, results are presented with a warning and the return code is set to 430.
General comments on fitting nonlinear models In many cases, achieving convergence is problematic. For example, a unique maximum likelihood (minimum-RSS) solution may not exist. A large literature exists on different algorithms that have been used, on strategies for obtaining good initial parameter values, and on tricks for parameterizing the model to make its behavior as "linear-like" as possible. Selected references are Kennedy and Gentle (1980, ch. 10) for computational matters, and Ross (1990) and Ratkowsky (1983) for all three aspects. Much of Ross's considerable experience is enshrined in the computer package MLP (Ross 1987), an invaluable resource. Ratkowsky's book is particularly clear and approachable, with useful discussion on the meaning and practical implications of "intrinsic" and "parameter-effects" nonlinearity. An excellent general text. though (in places) not for the mathematically faint-hearted is Gallant (t987}. Also see Davidson and MacKinnon (1993, Chapters 2. 3. and 5).
"_
nl -- Nonlinear least squares
425
The success of nl will be enhanced if care is paid to the form of the model fitted, along the lines }
of Ratkov_sky and Ross. For example, Ratkowsky (1983, 49-59) analyzes three possible 3-parameter "yi+ld;de_sitv' models for plant growth:
+
+ 7x )-I
i
! i i
All' three Imodels give similar fits. However, h} shows that the second formulation is dramatically more hn_ar-hke than the other two and theref0re has better convergence propemes, tn ad&tlon, the parameter_estimates are virtually unbiased and normally distributed and the asymptotic approximation to the standard errors, correlations and confiden_zeintervals is much more accurate than for the other models. _'en within a given model, the way th_ parameters are expressed (e.g., _'_ or e°:'_) affects
I !
the degre_ of linear-like behavior. Our ad},ice is that even if you cannot get a particular model to converge, don't gwe up. Experiment with different ways of writing it or with slightly different alternative models that also fit well. l
More on nlfcns Note tl]at the syntax for nl is f
nl fcn depvar [varlist] [...] [, ... fcn_options ] The synta: for an nlfcn is !
nlfcn {varname I ?} [varlist] [, fcn_options ] The varlis
I
Thus, it is _ossible to write ntfcns that are quite_general. When J_tfcn is called with a ?, the varlist and fin_options, if any, are still passed. In addition. e (d_pvar) contains the identity of the dependeni variable; e (sample) contains the estimation sample according x) the if exp and in range specified on the nl command line" and e (wtype) and e(wexp) cont.ain th weight type and weight expression.
i t
i i
nlfcn is required to post the names of the parameters to $S_I and to provide default initial values
i
for all the parameters. In addition, it may post up to two titles in $S_2 and $S_3 that will be subsequenlly used to title the output. The e () returned results provide useful information for filling in titles and generating initial parameter estimates. When _Ifcn is called without a ? it is required to calculate the predicted values conditional on the currenl value of the parameters. Note that nlfcn is not required to process if _W_or in range. Restriction to the estimation sample will be handled by nl.
i :
i
if specified with nl, will be passed !o nlfcn along with any options not intended Ibr nl.
Thus. a the beginning of this insert, we gave an example for calculating a negative-exponential growth m( del. A better version of the n!fcn woltld have been
i
prog:, defi e
:
version
;
if
"'1""
Y. 0 == "?"
{
global global
S_l "BO BI" BO=I
global
BI=.
1
global
S_2
"negative-ex_.
_lobal exit
S_3
"'e(depvar)"
growth" = BO*(1-exp(-Bl*'2"))"
} replace end
"_1"=$BO* (l-exp(-$Bl*" 2" ))
•-,_u _ii
_- --
r_unnnear least squares
This versionline:would command
title the output
and
allow
the independent
variable
to be specified
on the nl
nl nexpgr y xval
1t
An even more sophisticated version of nlnexpgr might use e (depvar), to generate more reasonable starting values of BO and B1. nlinit
is intended nlinit
nlinit
for use by nlfcns.
Its syntax
"2 ". and if
e (sample)
is
# parameterJist
initializes each parameter nlinit 0 A B C nlini't: 1 D E
in parameter_list
to contain
#. For example,
Saved Results nl saves in e(): Scalars e (N) e (k)
number of observations number of parameters
e (r2_a) e (F)
adjusted //-squared F statistic
e(mss) e(tss)
model sum of squares total sum of squares
e(rmse) e(converge)
root mean square error 0 if convergence failed: otherwise l
e(df_m) e(rss)
model degrees of freedom residual sum of squares
e(df_t) e(dev)
total degrees of freedom residual deviance
e(df.._r)
residual degrees of freedom
e(lnlsq)
1 if specified; otherwise 0
e (rams) e(msr) e (r2)
model mean square residual mean square R-squared
e (gin_.2)
geometric mean (y- k) 2 if lnlsq(); otherwise, l
e(cmd)
nl
e(function)
name of function
e(depvar)
name of dependent variable
e(params)
names of parameters
title in estimation output secondary title in estimation output
e(predict)
program used to implement predict
coefficient vector
e (V)
variance-covariance estimators
Macros
e (title) e(title2) Matrices e (b)
matrix of the
Functions e(sample)
marks estimation sample
The final parameter estimates are available in the parameter macros defined by nlfcn. The standard errors of the parameters are available through _se [parameter]; see [U] 16.5 Accessing coefficients and standard errors.
Methodsand Formulas nl is implemented
as an ado-file.
• ...........
nl -- Nonlinear least squares
427
I !
AcknOWle dgments nl was written by Patrick Royston of the MR(! Clinical Trials Unit, London, The original version of this routi_ _ was published in Royston (t992). Francesco Danuso's menu-driven nonlinear regression:
i
program
(991)
provided
the inspiration.
Refemnec. I
Atkinsorl, A C. 1985. Plots, Transformations and Regression. Oxford: Oxford University Press. Danuso, F, ' )91. sgl: Nonlinear regression command. S_ta Technical Bulletin 1: 17-19. Reprinted in Stata Technical Bulletin _eprints, vol. t, pp. 96-98. Davidson. RI and J. G. MaeKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University ]:_eSS.
i
Gallant, A. _. 1987. Nonlinear Statistical Models. New York: John Wiley & Sons. Gold,_tei_,
1992. srd7: Adjusted summary statistics for logarithmic regressions. Stata Technical Bulietin 5: 17-21.
Reprinte_ in Stata TechrricalBulletin Repr/ms, vok 1; pp. 178-183. Kennedy, W.!J., Jr.. and J. E. Gentle. 1980. Statistical Computing. New York: Marcel Dekker. Ra_ows_,, 1
. A. _983. Nonlinear Regression Modeling.' New York: Marcel Dekker.
ROss. G. L i. 1987. MLP L_er Manual, release 3.08. O_ford: Numerical Algorithms Group. i
. 1990. _bnlinear Estimation. New YOrk: Springer-Verlag. Roysmn. R )92. sgl.2: Nonlinear regression command. Stata Technical Bulletin 7_ lt-18,
Reprinted in Stata Technical
..... . ]993. igl.4: Standard nonlinear curve fits. Stata TeChnicalBulletin t1: 17. Reprinted in Stata Technical Bulletin Bulletin "'_eprints,vol. 2, pp. 112-120. • ,. SASReprints, Institute vol. Inc.2,1985. p. t21. SAS User's Guide; Statistics, Versmn 5 Edition. Cary, NC. i
AlsoSee i
i
ComplemeJitary:
[R] ml, [R] pretlict,
[R] ,t'ct_, JR] xi errors,
[u] Accessing and coefficientsand [U] 16.5 23 Estimation po_-estimationstandard commands
Backgroun_h
}
:
Title "_
3't ".1
nlogit
--
Maximum-likelihood
nested
legit estimation I
I
I
Syntax nlogit
depvar
(altsetvarl
= indepvarsB
( aItsetvarB = indet)varsl)
[ notree
nolabel
%vconstraints
]
clogit
(string)
)
[weight]
[if
level(#)
constraint,
[ ... exp]
nolog
s (nurnlist)
( altsen,ar2 [in
range],
= indepvars2
)
group(varname)
robust dl
mcudmize_oprions
]
where
depvar attsen'arl indepvarsl
altseta,ar2 indepvars2 altsen,arB itutepvarsB
is a dichotomous variable coded as 0 for not selected alternatives and 1 for the selected alternative. is a categorical variable that identifies the top- or first-level set of alternatives--these alternatives must be mutually exclusive groups of the second-level alternatives. are the attributes of the first-level alternatives--either of an alternative alone (absolute) or as the alternative is perceived by the chooser (perceived)--and possibly interactions of individual attributes with the first-level alternatives. is a categorical variable that identifies the second-level set of alternatives---these must be mutually exclusive groups of the third-level alternatives. are the attributes of the second-level alternatives (absolute or perceived) and possibly interactions of individual attributes with the second-level alternatives. is a categorical variable that identifies the bottom, or final, set of alt alternatives. are the attributes of the bottom-level alternatives (absolute or perceived) and possibly interactions of individual attributes with the bottom-level alternatives.
nlogitgen where
newvar
branchlist
= varname
)
[,
no!og
[]
outcome
]
is branch,
branch
( branchlist
branch
!, branch ....
]
is [label:]
and outcome
outcome
value or value
nlogittree
[
outcome
... ]]
is
varlist
[, nolabel
label
]
by . . : may be used with nlogit;see [R] by. fweightsand iweightsare allowed, but are interpreted to apply to groups as a whole and not to individual observations; see [U] 14,1,6 weight. nlogit
shares the features of all estimation commands: see IlJ1 23 Estimation and po._t-estimation commands. 428
nlogit -- Miximum-likelihoodnested Iogit estimation
429
Syntaxfor predict pred:.ct
L_Pejt newvarname
[if
exp] [in range] [, statistic ]
where staristic is
i
pb
predicted probability of choosing bottom-level, or choice-sel, alternative_: each alternative identified by altsetvarB; the default.
i
pl
predicted probability of choosing first-level alternatives: each alternative identified by altsetvarl
i
p2
predicte d probability of choosing second-level alternatives:
i
t
each dhoice identified by altsetvar2 p#
predicted probability of choosin_ #-level alternatives; each alternative identified by attsetvar#
xbb
linear prediction for the bottom-level alternatives
xbl xb2
linear prediction for the first-level alternatives linear prediction for the second-ltwel alternatives
, , ,
xb#
linear prediction for the #-level alternatives
condpb
Pr(each bottom alternative i alternative is available a_Cterall earlier choices)
condpl condp2
Pr(each level 1 alternative) = pl Pr(each level2 alternative I alternative is available after level 1 decision)
. o ,
condp# :i.vb
Pr(each level # alternative ! alternative is available after all previous stage decisions} inclusive value for the bottom-level alternatives
l l I
inclusive value for the second-le4ml alternatives iv2 iv#
The
inclusive value for the #-level alternatives
inclusi :e value catcutat_ d.
These
statis:ics
for
the
are available
first-level
both
alternatives
in and out
is not used
of sample;
in the
type predict
estimation
...
of the
model:
if e(sample)
...
therefore,
iT is no1
if wanted
only
for
_heestir_ationsample.
!i
i
Descripti6n nlogi_ estimates a nested logit model using full maximum likelihood. The model may contain one or m_re levels. Fdr a single-level model, nlogit estimates the same model as c!ogit: see IN] ciogit.i n!ogi_gen
generates a new categorical variable based on the specification of the branches. For
Instance,!
: I
. nlogitgen
is equi alent to
type
= restaurant(fast:
I I 2, family:
3 [ 4 I 5, fancy:
6 [ 71
_rr-
4_u
nJoglz_ Maximum-liKelihood nested Iogit estimation gen
i
type
= I
if
restaurant
== 1
] restaurant
replace
type
= 2 if restaurant
== 3
• replace
type
= 3 if restaurant
==
label • label
define value
Ib_type type
1 fast
== 2
I restaurant
== 4
6 I restaurant
== 7
2 family
I restaurant
== 5
3 fancy
ib_type
nlogittree displays the tree structure based on the varlist. Note that the bottom level should be specified first. For instance, • nlogittree
restaurant
type
Options group(varname) notree
is not optional; it specifies the identifier variable for the groups.
specifies that the tree structure of the nested logit model is not to be displayed.
nolabeI causes the numeric codes rather than the label values to be displayed in the tree structure of the nested logit model, clogit
specifies that the initial values obtained from clogit
are to be displayed.
leve 1 (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. nolog
suppresses
(95)
the iteration log.
robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [u] 23.11 Obtaining robust variance estimates. ivconstraints(string) specifies the linear constraints of the inclusive value parameters. One can constrain inclusive value parameters to be equal to each other, equal to fixed values, etc. Inclusive value parameters are referred to by the corresponding level labels; for instance, ivconstraints (fast = family) or ivconstraints(fast=l). constraints (numtist) specifies the linear constraints to be applied during estimation. Constraints are defined using the constraint command and are numbered: see [R] constraint. The default is to perform unconstrained estimation. dl specifies that method dl is to be used in estimating the mt model instead of the default method rdul. rdul is faster than dl; however, in some cases rdul is not as stable as dl. If the model has four or higher levels, method dO is used instead. maximize_options control the maximization process; to specify any of the maximize options except for iteration log shows many "not concave" messages may want to use the difficult option to help it
see [R] maximize. iterate (0). and and is taking many converge in fewer
You will likely never need possibly difficult. If the iterations to converge, you steps.
Optionsfor predict Consider a nested logit model with 3 levels: Pr(ijk) pb, the default, calculates the probability pl, calculate the probability p2, calculates the probability
= Pr(k
of choosing bottom-level
of choosing first-level alternatives, of choosing second-level
xbb. calculates the linear prediction
) ij)Pr(j
for the bottom-level
[ i)Pr(i)
alternatives,
pb -- Pr(ijk).
pl = Pr(i).
alternatives, alternatives.
p2 -- Pr(ij)
= Pr(j
I i)Pr(i).
nlogit -- Maximum-likelihoodnested legit estimation
i !
xbI, c .lculates the gnear prediction for the first-level alternatives. xb2, c lculates the linear prediction for the second-level alternatives.
i
condpl condpb= Pr(klij). cor_Ip:condpl=
431
Pr(i).
¢ondp!,condp2= Pr(j li). ivb, c_dcutates the i_clusive, value for the bottom-level alternatives: ivb = in {_k where xbb is the linear prediction for the bottom-level alternatives. I
,,j exp(xbb)},
iv2, c_lculates the inclusive value for the second-level alternatives:
_ i
iv2 = ln{y_j _ exp(zb2 + rjivb)},, where zb2 is the linear prediction for the second-le_,el altmnatives, ivb is the inclusive value for the bottom-level alternatives, and r_ are the parameters for he inclusive Value.
Remarks i
nlo :it performs 'full maximum likelihood esumation of nested legit models. These are models of a decis on process th_atis made in levels and where the decisions in later levels are limited by those
!
made i_ earlier levelS. In particular, the decision in each level partitions the choice set into more and
i
more sl_ecific alternative sets or groupings of choices. Let"t look at an _xample and clarify somd terminology. The tree structure of a family's decision about Where to eat r/light look something like this:
i
1
I
Christophers
First the family decides whether to eat fast food. eat at a family restaurant, or eat at a fancy i
i
restaura _t. This first-level decision limits their second-level decision to the alternatives available within hos_n fast food. their second-level decision is between the _he seled_tedrestaurant type. If they have c Mamas tPizza and Freei_irds; if they haxe chosen a family restaurant, the second-level decision is betweenl Care Eccell,: Los Nortenos. and Wings 'N More. If they decide on a fancy restaurant, then toe second-level decision is between Mad Cows and Christopher's. To b_ clear, we will use the following terms to use to describe these models.
i i }
i
level orldecision lev61 is the level or stage at which a decision is made. First-level decisions are made! first, foltowed by second-level decisions, and so on. In the example above there are on_y lwo l_vels. In the first level a type of restaurant is chosen, fast food. family', or fancy', and in the seco d level a specific restaurant is chosen.
b°tt°mllvelistheleqelwherethefinaldecisi°nismade'spec c restaurant.: In our example, thisis when wechoose a a/ternat_'e xet is the ._et of all possible alternatives at any given decision level. {
._,:qT
.......
_" _ ,-ax--um-.Kellnooa nested Iogit estimation
,,F_:_; _:.
bottom alternative set is the set of all possible alternatives at the bottom level. This is often referred to as the choice set in the economics choice literature. In our example, the bottom alternative set is all seven of the specific restaurants. alternative is a specific alternative within an alternative set. In the first level of our example, "fast food" is an alternative. In the second or bottom level, "MadCows" is an alternative. Not all ahematives within an alternative set are available to someone making a choice at a specific stage, only those that are nested within all prior decisions. chosen alternative is the alternative from an alternative set that we observe someone having chosen. A one-level nested logit model is the same as a conditional logit model. The conditional logit models assume the independence of irrelevant alternatives (tlA). that the relative probabilities of various alternatives remain constant regardless are included in the model. When this assumption is violated, the idea behind a to _oup the alternatives into subgroups such that the (IIA) assumption is valid
multinomial logit and Basically this means of which alternatives nested logit model is within each group.
McFadden (1977, 1981) showed how this model can be derived from a rational choice framework. Amemiya (1985, Chapter 9) contains a very nice discussion of how this model can be derived under the assumption of utility maximization. For a two-level nested logit model, we index the first-level alternative as i and the bottom-level ahemative as j.respectively. Let X# and refer to the vectors of explanatory variables specific to categories (i,j) and (i), WeY/write
Prij = Prjl i Pri The conditional probability Prjl i will involve only the parameters/3: e _ 'x _ PrJri = _n e_'Xi_ We define the inclusive values for category (i) as Ii = ln
(X e_ _") _"
TL
then
Pr,: =
ea' Yi+ril, _m ea'Y'_ +r'*I"
Remarks are presented under the headings Datasetupand the m_estructure Testof the independenceof irrelevantalternatives(IIA) Mode1estimation Inclusivevalueparameters Obtainingpredictedvalues
) -
l
'i
nlogit --Maximum-likelihood nestedIogitestimation 433 ............................................................................
I
i
Data setU}pand the tree structure
nlog_tgenand n!ogittree are designedto the nested logit model.
help users specify and display the tree structure of
l> Exampte ! ;
Usin_ fictional dat_, we have data on 300 families and their choice of 7 local restaurants. The restaurar(ts are Freebi_ds, MamasPizza, CafeEccell, LosNortenos, WingsNmore, Cluistophers and
)
MadCm_Is._We want t_ explore the relationship of the decision about where to eat to the household income variable income in t000s of dollars),!the number of kids in the household (variable kids), _he ratin of the restaurant according to the local restaurant guide (variable rating 0 to 5), the average meal co per person '(variable cost), and the distance between the household and the restaurant (variable distance iri miles), incomeand kids are attributes of the family, rating is an attribute of the al ernative (the _restaurant) alone, and cost and distance are attributes of the alternative as perceivec bv the families--that is, each family has its own cost and distance for each restaurant.
[ i } I ) ! i
_se restaurant
Co_ains
clear
data from
restaurant.din 8
v_rs: 75,600
si e:
i_s : variable
i
i
[
i0 Sop (99.0_
of memory
2000
00:41
free)
storage
display
value
2,100 type %
format
label
variable
names
family id choices of restaurants
name
label
fam_ly_id restaurant
float float
_,9.0g _,12.0g
cos£ income
float
7,9.0g _/,9. Og
average meal cost per perso_ hollsehold income
kid_ ratlng
'float float
Y,9.0g Y,9.0g
number of kids in the household ratings in local restaurant
distance
float
_,9.0g
distance between restaurant
chosen
ifloat
_9.0g
0 no 1 yes
guide
Sor#ed
i i
i
I
by :
home
and
fami_y_id
l_st family_i4 family_id
restaurant restaurant
IJ
1
Freebirds
3 2 j_
1 I
CafeEccell MamasPizza
4J 5
11
6J 7
i
chosen
kids rating
chosen
distance kids
in 1/21 rating
distance
1
l
0
I.245553
[) 0
1 I
2 1
4.21293 2.82493
Los.ortenos Wing_Wmore
0_
11
23
_hrisZophers
0
1
4
10. 19829
4.167634 6.330531
1
MadCows
0
1
5
5. 601388
8 9 ,; ,! 1O. i 11 .i
2 2 2
Freebirds MamasPizza CafeEcceil
0 0
3 3
0 1
4.162657 2. 865081
2
LosNortenos
0 1
3 3
2 3
5. 337799 4. 282864
t2 "i 13.,
2 2
WingsNmore ¢hristophers
O, 0
3 3
2 4
8.133914 8.664631
.,
14._
2
0
3
5
9.119597
!
15. I
3
Freebirds
MadCows
1
3
0
2.112586
i
17.1 16.' 18._ )
3 3 3
Cafegccell MamasPizza LosNortenos
0 0 0
3 33
2 t3
6. 978715 2.215329 5. 117877
434 i:t
nlogit -- Maximum-likelihood nested Iogit estimation 19. 20.
3 3
21.
3
WingsNmore Christophers MadCows
0 0
3 3
2 4
5.312941 9. 551273
0
3
5
5. 539806
Suppose that for each family, the decision about where to eat is a decision of two steps. First, i
the family decides whether to eat fast food, eat at a family restaurant, or eat at a fancy restaurant. This first-level decision limits their second-level decision to the alternatives available within the selected restaurant type. If they have chosen fast food, their second-level decision is between the MamasPizza and Freebirds" if they have chosen a family restaurant then the second-level decision is between CafeEccell, LosNortenos, and WingsNmore; if they have chosen a fancy restaurant then the second-level decision is between Christophers and MadCows. To run nlogit, we need to generate a categorical variable that identifies the first-level set of alternatives, fast food, family restaurants, or fancy restaurants. This can be easily accomplished by using nlogitgen. . nlogitgen type = restaurant(fast: ] WingsNmore, fmlcy: Ckristophers new
variable
Ib_type
type
is generated
with
Freebirds I MadCows)
I MamasPizza,
family:
CafeEccell
I LosNortenos
3 groups
: 1 fast 2 family 3 fancy
nlogittree tree
structure
restaurant specified
type for
the nested
logit
model
top-->bottom type fast
restaurant Freebirds MamasPizza
family
CafeEccell LosNorte-s WingsNmore
fancy
Christop~s MadCows
The new categorical variable is type, which takes value l (fast) if restaurant is Freebirds or MamasPizza, value 2 (family) if restaurant is CafeEccell, LosNortenos or WingsNmore, and value 3 (fancy) otherwise,
nlogittree displays the tree structure.
<1
•3 Technical Note We could also use values instead of value labels of restaurant in nlogitgen. The value labels for the newvar, type are optional, and the default value labels for type are typel, type2, and type3. The vertical bar is also optional.
(Continued
on next page)
nlogit-- Maximum-likelihoodnestedlogitestimation r_
.
435
...................................................
:logitgen type = restauraut(1 2, 3 4 5, 6 7) nm' variable type is generated with 3 groups lb.type: i typel
2 type2 3 types
, i
[
)
) !
[ } ! [
tr _e structure Specified for the nested :logittree restaurant type
logit
model
top--> ottoz
type
restaurant
type I
Preebirds NamasPizza
type2
C_feEccell Lk)sNorte~s WtingsNmore
type3
Christ op-s MadCows
_]
Test,oftNe indeperidenceof irrelevant alternatives (IIA) i The I:roperty of th_ multinomial logit model and conditional ]ogit model where odds ratios are independent of the other alternatives is referred to as the independence of irrelevant alternatives (IIA). Hausraan and McPadden (1984) suggest that if a subset of the choice set truly is irrelevant with respect t the other alternatives, omitting it frbm the model will not lead to inconsistent estimate_. Ttierefor Hausman's:_1978) specification test can be used to test for IIA.
'3 ExampleI Supp(,se we want to run ctogit on our choice of restaurants dataset. We also want to test IIA between the alternatives of family restaurants and the alternatives of fast food places and fancy restaurants. To do so, we need to use Stata's hausman command: see [R] hausman. We fi "st run the e_timation on the full bottom alternative set: save the results using hausman, save; ard then run th_ estimation on the bottom alternative set, excluding the alternatives of family restaurarts. We then mn the hausman test. w_th the less option indicating the order in which our models ,_ere fit. 1
en incFast _en
incFancy
en kidFast
(type
== I) *
income
_ (type == 3) * income _ (type
== I) * kids
en kidFancy = (type == 3) * kids logit chose_ cost rating distance Iteration
O:
log
It(ration It_ration
2: I:
_og likelihood likelihood _og
It( ration It_ ration
3: 4:
_og likelihood i _og likelihood
It_ ration
5:
Col,ditional t Lo
_og likelihood
(fiied-effects) _
likelihood
i
likelihood
_ -488.90834
i incFast "_
incFancy
kidFast
kidFancy,
group(family_id)
= -564._7856 = -489.$5097 -496.41546 = -488. _1205 -488.90854 -488.g0834 logistic
regression
Number of obs LR chi2(7)
: =
2100 189.73
Pseudo R2 Prob > chi2
= =
0.1625 0.0000
_
....................
,,,vuu
chosen
Coef.
cost rating
IOgl!
Std. Err.
z
estimation
P>IzI
[95X Conf. Interval]
-.1367799 .3066622
.0358479 .1418291
-3.82 2.16
0.000 0.031
-.2070404 .0286823
-.0665193 .584642
t
-.1977505
.0471653
-4.19
0.000
-.2901927
-.1053082
incFancy incFast I kid_Past kidFancy__[
.0407053 -.0390183 -.2398757 -.3893862
.0080405 .0094018 .1063674 .1143797
5.06 -4.15 -2.26 -3.40
0.000 0.000 0.024 0.001
.0249462 -.0574455 -.448352 -.6135662
.0564644 -.0205911 -.0313994 -.1652061
distance i
-u3zeO
• hausman, save clogit chosen cost ratine distance incFast incFancy kidFast kidFancy if type group(family_id)
l= 2,
note: 222 groups (888 obs) dropped due to all positive or all negative outcomes. Iteration Iteration Iteration Iteration Iteration Iteration
O: 1: 2: 3: 4: 5:
Conditional
log log log log log log
likelihood likelihood likelihood likelihood likelihood likelihood
= = = = = =
-104.85538 -88.077817 -86.094611 -85.956423 -85.955324 -85.955324
(fixed-effects) logistic regression
Log likelihood = -85.955324
chosen
Coef.
cost rating distance
-.0616621 .1659001 -.244396 -.0737506 .4105386
incFastI kidFast __
Std. Err.
Number of obs LR chi2(7) Prob > chi2 Pseudo R2
= = = =
312 44,35 0.0000 0.2051
z
P>JzJ
[95X Conf. Interval]
.067852 .2832041 .0995056
-0.91 0.59 -2.46
0.363 0.558 0,014
-.1946496 -.3891698 -.4394234
.0713254 .72097 -.0493687
.0177444 .2137051
-4.16 1.92
0.000 0.055
-.108529 -.0083157
-.0389721.8293928
• hausman, less Coefficients--j'
cost d kidFast_
Test:
Ho:
i
(b) Current
(B) Prior
-.0616621
-.1367799
-.244396 -.0737506 .4105386
-.1977505 -.0390183 -.2398757
(b-B) sqrt (diag(V_b-V B)) Difference S.E. .0751178 -.0466456 -.0347323 .6504143
.0576092 .0876173 .015049 .1853533
b = less efficient estimates obtained from clogit B = fully efficient estimates obtained previously from clogit difference in coefficients not systematic chi2(5)
= (b-B)'[(V_b-V_B)-(_I)](b_B) = 10.70 Prob>chi2
=
0.0577
The small p-value indicates that the IIA assumption between the alternatives of family restaurants and the bealternatives should utilized. of other restaurants is weak, hinting that the more complex nested logit model
t
• _
_
...................................................................
_
;
nlogit --
/laximum-likelihoodnested Iogit estimation
437
Model! ea timation Exampt¢ In tl_is example, _e want to examine how alternative-specific attributes apply to the bottom altemati,i_.eset (all se_.,en of the specific restaurants), and how family-specific attributes apply to the altema@e set at the Ifirstdecision level (all ttiree types of restaurants). Inlogitchoseh (restaurant = cost ra_ing distance ) (type = incFast incFancy > kidFast kidF_ncy), group(family_id)Inolog tzee structure specified for the nestbd logit model top--_bottom type fast
_estaurant !Freebirds
_asPizza family
fancy
_afeR.ccell _osNort;e-s WingsNmore _ristop~s MadCows
Ne _ted logit Le'rels = De')endentvariable = Lo likelihood =
2 chosen -483,9584
Number of obs LR chi2(10) Prob > chi2
= = =
2100 199.6293 0,0000
i' Coef.
z
P>Jz[
[95X Conf. Interval]
re:_taurant cost
-,0944352
-2.78
O.006
-.1611131
-.0277572
rating distance
.1793759 -.1745797
.126895 1.41 .0433352 , -4,03
O,157 0.000
-,0693338 -,2595152
,4280855 -.0896443
.0116242
-2.47
0.013
I incFancy
-.0287502 . 0458373
5.14
O. 000
-.0515332 0283722 .
-, 0059672 0633024 .
I kidFancy , kidFast
-.0704164 -,3626381
.1394359 ' .1171277
-0.51 -3.10
0.614 O.002
-.3437058 -.5922041
-.1530721 .2028729
2.45 1.49 3.52
0.014 0 135 0.000
1,143415 -.5366608 .6494642
10.2881 3.979105 2.283711
t_e i l
Std. Err.
incFast
.03402
.0089109
(Ii params) /fast /family i /fancy
5,715758 1.721222 1.466588
2,332871 1 152002 .4169075
I
i
LR _est of homo$cedastlclty (iv = 1): 1 •
I
In thi_ model.
"
[
' Ji
chi2(3)=
9.90
Prob
> chi2 = 0.0194
:
Pr(restdurant I tyPe)= !
_
I
Pr(tvpe)!-
Pr(_cost cost + 3rati_ rating + 3dist distance) i
Pr(a, iva_ incFast +
_ Tfast
IVfast
+ 7family
aiFancy
ineFancy +
!V 'I family
+ Tfancy
CtkFast
"
kidFast +
O kFast
kidFast
IVfancy)
T he [J_ test against t_e ' constant-only model iMicates_ that the model is significant (p-value = 0.000). and t.466588. The inclul}ive value, part,meters for Iast, famil 'iy,and import are 5.......... 715758.1 -7o_o-_o
-..... _
,,,..,_,.-- m.x,,,,u.,-..e.noo,
nesteaIOglt estimation
respectively. The LRtest reported at the bottom of the table is a test for the nesting (heteroskedasticity) against the null hypothesis of homoskedasticity. Computationally, it is the comparison of the likelihood of a nonnested clogit model against the nested legit model likelihood. The X2 value of 9.90 clearly supports the use of the nested legit model with these data, <1
Inclusivevalue parameters nlogit allows the user to apply linear constraints of the inclusive value parameters. One can constrain inclusive value parameters to, say, equal to each other, or specify fixed values rather than allowing these parameters to be freely estimated. > Example Continuing with the above example, we fix all the three inclusive value parameters to be 1 to recover the model estimated by clogit. . nlogit
chosen
> kidFast User
defined
I000:
(restaurant
kidFancy),
[fast]_cons
distance
nolog
) (type
ivc(fast
=I,
= incFast
family=l,
incFancy fancy=l)
notree
= 1
[family]_eons
998:
[fancy]_cons
= 1 = 1
legit =
Dependent Log
rating
constraint(s):
999:
Nested Levels
= cost
group(family_id)
variable
2
=
likelihood
Number
chosen
LR
= -488.90834
Coef.
Prob
Std.
Err.
z
of obs
=
chi2(7) > chi2
P>lzl
2100
=
189.T294
=
0.0000
[95_ Conf.
Interval]
restaurant -.1367799
.0358479
-3.82
0.000
-,2070404
-.0665193
rating distance
.3066622 -.1977505
.1418291 .047i653
2.16 -4.19
0.031 0,000
.0286823 -.2901927
.584642 -.I053082
incFast
cost
type -.0390183
.0094018
-4.15
0.000
-.0574455
-.0205911
incFancy kidFast
.0407053 -.2398757
.0080405 ,1063674
5.06 -2.26
0.000 0.024
.0249462 -.448352
.0564644 -.0313994
kidFancy
-.3893862
.I143797
-3.40
0.001
-.6135662
-.t652061
(IV params) type 1
/fast /family
1
/fancy
I
LR test
clogit
of homoscedasticity
chosen
cost
rating
(iv = I):
distance
chi2(O)=
incFast
> group(family_id) Iteration Iteration
O: 1:
log likelihood io g likelihood
= -564.57856 = -496.41546
Iteration
2:
log
likelihood
= -489.35097
Iteration
B:
log
likelihood
= -488.91205
0.00
incFancy
Prob
kidFast
> chi2
kidFancy,
=
i
•
! I
_
_ Itezation
ii
4:
nlogit-- Mlaximum-likelihood Iogitestimation 439 _ nested .......... i....
l_g
likelihood
Number of obs LR chi2(7)
= =
2100 189,73
Prob
=
0.0000
Log likelihood
Pseudo
=
0.1625
[95Y, Conf.
Interval]
= -488.90834
,
l
i
Coef.
5
Std.
Err.
z
> chi2
P>Izl
1{2
cost
, .1367799
,0358479
-3.82
O.000
-. 2070404
-.0665193
r rating
i "3066622
1418291
2.16
0.031
.0286823
.584642
distance , incFast
_- 1977505 _ .0390183
.0471653 .0094018
-4.19 -4,15
0.000 0.000
-.2901927 -.0574455
-. 1053082 - :0205911
5.06
O. 000
.0249462
.0564644
-2,26 -3.40
O.024 O.001
-.448352 -• 6135662
-. 0313994 -. 1652061
lincFancy
. 0407053
i kidFast !kidFancy
.0080405
_. 2398757 _. 3893862
i
= -488.90834
Itez ation 5: (fixed-effects) lag. likelihoodlogistic = -488. re_ression 90834 Concitional ! } i _
chosen
i zl
.1063674 •1143797
'
i i
i
Obtainingredicted!values predictmay be use_lafter nlogitto obtain the predicted values of the probabilities, the conditional
!
probabili@s, the linear predictions, and the inclusive values for each level of the nested logit model Predicted _robabilities _or nlogitmust be inte_reted carefully. Probabilities are estimated for each group as _whole and dot for indi'_idual observations. ?
Example i 1
Contim _ingwith our Nodel with no constraintsl we can predict pb = Pr(restaurant); pl = Pr(type); condpb = Pr(restaura_t I type); xbb, the line_ prediction for the bottom-level altemativesi xbl, the linear ?rediction fo_ the first-level alternatives; and the inclusive values ivb for the bottom,level alternative _. • q_i nlogit
i
i
l
i
chosen
(restaurant
k±dFancy), group [family_id) . pzedict pb (opt ion
pb
assum,,,d;
distance
) (type
nolog
= incFast
incFancy
kidFast
i
Pr (mode))
. pxedict
p1,
• pzodict
condpb
• predict
xbb,
x!>b
. predict
xbl,
xl_)l !
pli condpb
• list predict id chosenlpb ivb, i'rb
i
pl condpb
in 1/14
pb .0831245
pl ; 1534534
condpb .5416919
.070329 ,2763391 .284375
11534534 ', 7266538 _,7266538
.4583081 .3802899 .3913486
0
.1659397
! 7266538
.2283615
0
.03992 t5
11198928
.3329766
I 2
0 0
.0799713 . Ol i76
_ 1198928 10286579
.6670234 •4103599
2 2
0 0
• 0168978 .2942401
i0286579 _7521651
. 5896401 .3911909
t ._
id 1
2. ! 3.1 4.1
1 1 1
_
0 0 0
5.:
1
i
6. i
1
7, i 8. 9 105
= cost _ating
chosen 1
i
j7
11. 12. 13. 14.
iF
,
2 2 2 2
.........
1 0 0 0
.2975767 .1603483 .1277234 ..vv_vv .0914536
.7521651 .3956268 -7521651 .2i31824 Iw_mt .219177 _OtllllQ||_ll .582741 .219177 .417259
. list id chosen xbb xbl ivb in 1/14 id chosen xbb xbl 1. 1 1 -.731619 -1.191674 2. 1 0 -.8987747 -1.191674 3. 1 0 -1.149417 0 4. 1 0 -1.120752 0 5. 1 0 -1.659421 0 6. 1 0 -3.514237 1.425016 7. 1 0 -2.819484 1.425016 8. 2 0 -1.22427 -1.878761 9. 2 0 -.8617923 -1.878761 10. 2 0 -1.239346 0 11. 2 1 -1.22807 0 12, 2 0 -1.846394 0 13. 2 0 -2.804756 1.570648 14. 2 0 -3.138791 1.570648
i
ivb -.1185611 -.1185611 -.1825957 -.1825957 -.1825957 -2.414554 -2.414554 -.3335493 -.3335493 -.3007865 -.3007865 -,3007865 -2.264743 -2.264743
Saved Results nlogitsaves in e O: Scalars e(N) e (k_eq)
number number
of observations of equations
e(tevels) e (re)
depth of the model return code
e(N_g)
number
of groups
e(chi2)
x2
e(df._m)
model
degrees
of freedom
e(df...me)
model
degrees
of freedom
e(ll) e(ll_O)
log likelihood log likelihood,
constant-only
log likelihood,
clogit
e(ll_c) Macros
for clogit model
model
e(chi2_c)
X 2 for comparison
e(p_c)
p-value
for comparison
test
e(p) e(ic)
p-value numoer
for X 2 test of iterations
e(rank)
rank of e(V)
e (cmd)
nlogit
e (vcetype)
covariance
e (level#)
altsetvar#
e (user)
]ikelihood-evaluator
e(depvar)
name of dependent
e(opt)
type of optimization
e(title)
title in estimation
output
e(chi2type)
LK: type of model
e(group) e(wtype)
name of group() weight type
variable
e(predict) e(cnslist)
program used to implement constraint numbers
e(iv._names)
parameters
e(V)
variance-covariance estimators
variable
e (wexp) Matrices
weight
e (b) e(ilog) Matrices
coefficient vector iteration log (up to 20 iterations)
e (sample)
marks
expression
estimation
sample
estimation
test
method program
X 2 test
for inclusive
predict
values
matrix
of the
nlogit -- Maximum-likelihoodnested togit estimation
441
Methods and For.mlas
!
provide ltroductions t the nested logit model nlogit is implem,,'nted as an ado-file. Greene (2000, 865-871) and Maddala (1983, 67-70) We _ 11present the! methods and formulas for a three-level nested logit model. The extension of this mo& to cases m_olvmg more levels of a tree is apparent, but is more complicated.
!
Using !he same not_tion as Greene (2000), we index the first-level alternative as i, the second-level
! !
ahemativ_ as j, and tte bottom-level alternative as k. Let Xijk, }_j and Zi refer to the vectors ot explanato_; variables _ecific to categories (i,j, k), (i,j), and (i), respectively• We write
i
:
--Prkli j Prjl i Pr_
The cond fional probability Prkto will involve only the parameters/3: eff Xij_ Prklij = Y'_ne'°'X_'_ We define the inclusiw values for category (i,d) as
1"1
and
easyij +ri5Iij PrJli = }-_,mea'V"_+ri'_h'_ Define in(lusive values! for category (i) as
m
/
then
e'f'
Zi-b_i
Ji
Pri = -= El eYrZt+rlJl If we r_strict all the
I
form:
l
where
i
Prijk
i
_ij
and 6i to be 1, we then recover the conditional logit model of the following
eVijk
= fl'X, k + c,'Y j+
,_ il i_ '
,,,,,_ mogrzm Maxlmum-iiKellllOOOnested logit estimation There are two ways to estimate the nested logit model: sequential estimation and the full information maximum likelihood estimation, nlogit estimates the model using the full information maximum likelihood method. If g = 1, 2,..., G denotes the groups, and Pr_j k is the probability of category (i, j, k) being a positive outcome in group 9, the log likelihood of the nested logit model is In L = E
ln(Pr;jk)
9
=
In Pr_lij + InPr_i
+ tnPrf
/
)
References Amemiya.
T. 1985. Advanced
Econometrics.
Greene. W. H. 2000. Econometric
Analysis.
Cambridge,
Hausman.
J. 1978. Specification
tests in econometrics.
Hausman,
J. and D. McFadden.
1984. Specification
Maddala. G. S. 1983. Limited-dependent Press.
McFadden, D. EconomeNc
1981. Econometric models Applications, pp, 198-272.
Saddle
University
46: 125t-I27t.
tests in econometrics.
for analyzing
Press°
River, NJ: Prentice-HalL
Economerrica
and Qualitative
McFadden, D. 1977. Quantitative methods Foundation Discussion Paper no. 474.
MA: Harvard
4th ed. Upper
Econometrica
Variables
in Econometrics.
behavior
of individuals:
of probabilistic choice. In Smacturat Cambridge, MA: MIT Press.
52: 1219-1240.
Cambridge: some recent Analysis
Cambridge developments. of
Also See Complementary:
[R] lincom, [R] lrtest, [R] predict, [R] test, [R] testnl, [R] xi
Related:
[R] elogit, [R] logistic, [R] logit, JR] mlogit
Background:
[u] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [R] maximize
Discrete
University CoMes Data
with
[ ie !
notes
i
i
Place
in data
Syntax vama,ne] notes
t_xt : !
_otes
notes drop evarlisf [in #[/#]] where eva list is a varl:_'t but may also contain _theword _dta and # is a number or the letter 1. If text incl ides the letters TS surrounded by blanks, the TS is removed and a time stamp is substituted in its p ace.
Descripti(,n notes
attaches note: to the dataset in memory. These notes become a part of the dataset and are
attached generically to :he dataset or specifically to a variable within the dataset.
i
Remarksi saved when
the dataset is saved and retrieved When the dataset is used: see [R] save, notes can be
j [
A not_ is nothing formal; it is merely a string of text--probably words in your native language Treminding you to do something, cautioning you against something, or anything else you might] feel like jotti lg down. People who work with real data invariably end up with paper notes plastered ground their tlerminal saying things like "Send the new sales data to Bob" or "Check the
I
income salary95; I don't believe if' or "The gender was significant!" would be betterv_iable jf theseinnotesi were attached to the dataset. Attached to dummy the terminal, they tend toItfall off
l
and get lost. Addin_ a note to y ur dataset requires typing note or notes (they are synonyms), a colon (:L and whatever _ou wan_ to remember. The note is added to the dataset currently in memory.
4
. n_te:
I
i
Send co_y to Bob once verified.
nite s
i
You can +play your n_tes by typing II
notes
(or note)
by itself.
!
Send copy ,_oBob once verified.
!
Once youi resave your _ata, you can replay the note in the future, too. You add more notes just as
i
you did tl_e first:
[
. nSte:
Ii 2i
i
Mary war_ts a copy, tOO.
Send copy to Bob once verified. Mary ,,rants a copy, too.
443
You can place time stamps on your notes by placing the word TS (in capitals) in the text of your note: • note: TS merged • notes
updates
from
JJ_F
_dta: I. 2. 3.
Send copy to Bob once verified. Mary wants a copy, too. 19 Jul 1000 15:38 merged updates
from JJ&F
The notes we have added so far are attached to the dataset generically, which is why Stats prefixes them with _dta when it lists them. You can attach notes to variables: • note mpg: is the 44 a mistake?
Ask Bob.
note mpg: what about the two missing values7 • notes _dta: i. 2. 3. mpg: i. 2.
Send copy to Bob once verified. Mary wants a copy, too. 19 Jul 2000 15:38 merged updates from JJ_F is the 44 a mistake? Ask Bob. what about the two missing values?
Up to 9,999 generic notes can be attached to _dta and another 9,999 notes can be attached to each variable.
Selectively listing notes notes by itself lists all the notes. In full syntax, notes is equivalent to typing notes _all in 1/1. Here are some variations: notes notes notes notes notes notes notes
_dta mpg _dta mpg _dta in 3 _dta in 3/5 mpg in 3/5 _dta in 3/1
list list list list list list list
all generic notes all notes for variable mpg all generic notes and mpg notes generic note 3 generic notes 3 through 5 mpg notes 3 through 5 generic notes 3 through last
Deletingnotes notes drop works much like listing notes except that typing notes all notes; type notes drop _a11. Some variations: notes notes notes notes notes
drop drop drop drop drop
_dta _dta in 3 ._dta in 3/5 _dta in 3/i mpg in 4
delete delete delete delete delete
drop by itself does not delete
all generic notes generic note 3 generic notes 3 through generic notes 3 through mpg note a
5 last
"
_
T
.......
!
............................
._ .......................................................
_
-_ .i ¸
i
_:
notes -- Place notes in data
" 445
Warningsi 1. Notes _re stored wit_ the data and, as with _her updates you make to the data, the _additions and deletions are not pei_nanent until you save the data; see JR] save, i i
I
2. The m_ximum lengt_ of a single note is 1,000 characters with Small Stata and 67,784 characters
+
with I ercooled Stala.
Methods it nd Forrrtulas ! '
noteaiis
implemen_d
as an ado-file.
1
i
References Gleason,
J, R. I998.
in Stata
Technical
i dm571
A notes
Butlqtin
editor
Reprints,
vol.
for Window 8, pp.
i
Also See
i
i
Complenenta_v:
[_] describe, [R] save
!
Related:
_] codebook
i
Backgrou nd:
L_]15,8 Characteristics
i
and Macintosh.
10_13.
1
i
J
Stata
Technical
Bulletin
43: 6-9.
Reprinted
f"f .."
! !
Title I nptrend, , -- Testfor trend across,°rderedgroups ,,,
I
Syntax nptrend
varname [if exp] [in range], by(groupvar) [ nodetail s_core(scorevar)]
Description nptrend
performs a nonparametric test for trend across ordered groups.
Options by(groupvar) is not optional; it specifies the group on which the data is to be ordered. nodetail
suppresses the listing of group rank sums.
score (scorevar) defines scores for groups. When not specified, the values of groupvar are used for the scores.
Remarks nptrend performs the nonparametric test for trend across ordered groups developed by Cuzick (1985), which is an extension of the Wilcoxon rank-sum test (rar,_ksum:see [R]signrank). A correction for ties is incorporated into the test. nptrend is a useful adjunct to the Kruskal-Wallis test; see [R] kwallis.
In addition to nptrend, for nongrouped data the signtest and spearman commands can be useful: see [R] signrank and [R] spearman. The Cox and Stuart test, for instance, applies the sign test to differences between equally spaced observations of varname. The Daniels test calculates Spearman's rank correlation of varname with a time index• Under appropriate conditions, the Daniels test is more powerful than the Cox and Stuart test. See Conover (1999) for a discussion of these tests and their asymptotic relative efficiency. > Example The following data (Altman 1991, 217) show ocular exposure to u]traviolet radiation for 32 pairs of sunglasses classified into 3 groups according to the amount of visible light transmitted. Group
Transmission of visible light
I 2
< 25% 25 to 35%
3
> 35%
Ocular exposure to ultraviolet radiation 1.4 0.9 2.6 0.8
1.4 1.0 2.8 1.7
1.4 1.1 2.8 1.7
Entering these data into Stata, we have 446
1.6 1.1 3.2 1.7
2.3 t.2 3.5 3.4
2.3 1.2 1.5 1.9 2.2 2.6 2.6 4.3 5.t 7.1 8.9 13.5
I
|
i _
V
......................................
_
............... i¸ 4
nptrend!--Test for trend across ordered groups
|
447
, li,t _xposmte 1.4
2.
1
1.4
3._
i
1.4
1
2.3
1
2.3
(o 7; ut omitted) 2 31 "i 3
s2.1
i
.9 8,9
s
ls.s
]
We use nt_trend to tes for a trend of (increasing) exposure across the 3 groups by typing . nl_rend exposure, by(group)
l
group 1
2z
=
sum of ranks 76
score t
obs 6
3
8
162
..522
18
290
3 i
!
i > Izl = i,13 Pr?b When the l_rou ps are g{iven any equally saced scores (such as -1, O, 1), we will obtain the same p , answer as !above. To illustrate the effect of changing scores, an analysis of these data with scores 1,
i
2, and 5 (_dmittedh' no! very sensible in this c_se) produces
ii
[
geb mysc = con_(group==3,5,group) nl_rend exposul_e,by(group) score(mys_)
I
group
s4ore
1 2 3 z
i _i
2 5 1
=
.46
Pr_b> Izl :
_.14
obs
sum of ranks
18 8 6
290 i62 76
"
This example suggests ihat the analysis is not all that sensitive to the scores chosen.
q
! i 3 Technical _lote
_
! I
The grc_uping variabt_ may be either a string v_able or a numeric variable. If it is a string variable and no scdre variable id specified, the natural nfimbers 1, 2, 3, .. are assigned to the goups in the son order !of the string!variable. This may not always be what you expect. For example, the sort raer olttle strings one, two, three _s one, three, two.
l
a
]
i
4 group 1
1.
i
SavedReSults nptrer@ii saves in r ): _
i
_calars r(N) r(p)
nuNber of observitions
r(z)
z statistic
two,sided p-value
r(T)
test statistic
i
-
- _
[,Ii!
..-
,,_.._,,,.
--
,_ot ,u, .u.u
across oruere(! groups
Methods and Formulas nptrend
is implemented as an ado-file.
nptrend is based on a method in Cuzick (1985). The following description of the statistic is from Altman (1991, 215-217). We have k groups of sample sizes ni (i = 1,..., k). The groups are given scores, li, which reflect their ordering, such as 1, 2, and 3. The scores do not have to be equally spaced, but they usually are. The total set of N = _ n_ observations are ranked from 1 to N and the sums of the ranks in each group, R/, are obtained. L, the weighted sum of all the group scores, is k L = E lini i=1
The statistic T is calculated as k T = E liRi i-=1
Under the null hypothesis, the expected value of T is E(T) = .5(N + l)L. and its standard error is
se--'(T)
=
I
(
n + 1 --_ N
k i=I
li2ni -- L 2
)
\ /
so that the test statistic, z, is given by z = { T - E(T) }/se(T), which has an approximately standard Normal distribution when the null hypothesis of no trend is true. The correction for ties affects the standard error of T. Let 2_"be the number of unique values of the variable being tested (N _
a--
The corrected standard error of T is _(T)
N(X - 1) = v/1 - a se(T).
Acknowledgments nptrendwas written by K_ A. Stepniewska and D. G. Altman (t992) of the Imperial Cancer Research Fund, London.
References Altman, D. G. 1991. Practical Statistics for MedicaJ Research. London: Chapman & Hall, Conover, W. 3. 1999. Practical Nonparametric Statistics_ 3d ed. New York: John Wiley & Sons. Cuzick, J. 1985. A Wilcoxon-type test for trend. Statistics in Medicine 4: 87-90. Sasieni, E 1996. snpl2: Stratified test for trend across ordered groups. Stata Technical Bulletin 33: 24-27. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 196-201). 7"
i
i i
i
1
i Ii I
I +
i_
i
i
_ nptrend:+- Test for trend across o_ groups 449 Sasieni,P., L A. Stepniew_&a,and D. G. Altman+1996_.snpll: Test for trend across orderedgroups revisited. Stata Technic_ Bulletin 32: 7-29. Reprintedin Stata TechnicalBulletin Reprints,vol. 6, pp. 193-196. Stepniewsk K.A. and D. 3. Altman.1992.snp4:Nonparametric test for trend across orderedgroupS.Stata Technical i Bulletinl9 21-22. Reitr_ntedin StataTechnicalBultetinReprints,vol. 2. p. 169.
!
i
AlsoSee Related:
[R] epitah
[R] kwallis, [R] signrahk, [R] spearman, IN] symmetry
+ i i
i
I i
i
+ i t i
I I i
J
i
i
i
!
l
+
I
i i
i+ +
i i
_
' !',
I ILI_
I
obs
--
Increase
_e I I Inu]_ber
Ul
°f
°bse]_ati°ns
in
dataset
I
I
I
I
i
Syntax set obs #
Description sel; obs changes the number of observations in the current dataset. # must be at least as large as the current number of observations. If there are variables in memory, the values of all new observations are set to missing.
Remarks Example set obs can be useful for concocting artificial datasets. For instance, if you wanted to graph the function 9 = z e over the range 1 to 100, you could .
drop
_all
• set
obs
obs was
I00
O, now
100
generate
x = _n
generate
y = x-2
graph
y x
(graph notshown)
q
> Example If in a program you want to add an ex_a data point, local set
npl obs
= _N + 1 "npl"
Also See Related:
[R] describe
450
ologit
Maximum-likelihood
_
ordered logit stimation
i
il,
i
li
. i
i
iii
i1!ii I
I , ill
1
l !
Syntwx ologi_
depvar
[varlist]
[weight] [if
exp] [in range][
table robust
cl___uSter (varnam.z) s_£ore (newvarlist) !eve1 (#) offset (varname)
i I
i by .,.
i
maximize_options
]
_
i
: m_y be used with ologit; see [R] by,
I I
fweights, exghts, and l_we:tghts are allowed: see [U} /4,1.6 weight ologit _ha+s the features bf all estimation commands: see [U] 23 Estimation
i
ologit ma3i be used with i_ to perform stepwise estimation: see [R] sw.
i
and post-estimation commands,
i
l l
Syntax for tpredict outCome(outcome) '_ i
nooffset
]J
"
I !
Note that wilh the p option,!vou specify ei{her one or k new _afiables depending upon wbether tbe outcome() option is also s_cified (where/t" is the number of categories of dem'ar). With xb and strip, one new variable is specified.
i
These statistics are availabk the estimation sample.
i
t
both in and out of sample! type predict
...
if
e(sample)
...
if wanted only for
Descliptin
I
ologi_ estimates ortdered ]ogit models of ordinal variable depvar on the independent variables
! !
va,llst. Th_ actual valuds taken on by the dependent variable are irrelevant except that larger values are assumdd to correspohd to "higher" outcomes. Up to 50 outcomes are allowed in Intercooled Stata: 20 outcombs in Small _tata.
-_
!
See [R] logistic for _ list of related estimation commands.
Options
i
!
tablereqhests a table howing how the probabilities for the categories are computed from the fitted ; equatio_a.
I
robust s[ ecifies that tl_e Huber/White/sandwich estimator of variance is to be used in place of the traditio_}aI calculatioh: see [U] 23.11 Obtaiding robust variance estimates, robust combined with cluster () allc_,s observations which a_'enot independent within cluster (although they must be inde )endent between clusters)i
l
i
If you _?ecify pue5._ his, robust
is impliedi see [U] 23.13 Weighted estimation.
!
I
451
_:
:+ i
,; "
';
eluszer(varname) specifies that ......... the observations are independent across groups (clusters) but "+'== 'uSlIL estimation not necessarily within groups, varname specifies to which group each observation belongs; e.g., cluster(personid) in data with repeated observations on individua/s, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U] 23.11 Obtaining robust variance estimates, cluster() can be used with pweighes to produce estimates for unstratified cluster-sampled data, but see the svyologit command in [R] svy estimators for a command designed especially for survey data. byClUster()itself, implies robust;
specifying
robust
score (newvartist) creates k new variables, where first variable contains OlnLi/O(xib); the second contmns OlnLj/O(_cut2j); and so on Not ,ho, . e ,,,+ ..Stata. sc(kW°Uld1).create the appropriate number of new
cluster()
is equivalent
to typing cluster()
k is the number of observed outcomes. The variable co ' . . . . ;¢ ...... ntmns O_Lg/O(_cutl,), the third ,, yuu were to speclry the option score (sc,), variables and they would be named me0, scl,
level (#) specifies the confidence level, in percent, for confidence intervals+ The default is Zleve] (95) or as set by set level;see [uJ 23.5 Specifying the width of confidence intervals. offsetto be(Varname)l, specifies that varname is to be included in the mode/with maximize_optiOnSspecify them. control the maximization
process: see [R] maximize.
coefficient constrained
You should never have to
Optionsfor predict p, the default, calculates the predicted probabilities. If you do not also specify the outcome () option, you must specify k new variables, where k is the number of categories of the dependent variable. Say you estimated a model by typing ologit result xl x2, and result takes on three values. Then you could type predict pl p2 p3 to obtain all three predicted probabilities. If you specify the outcome () option, then you specify one new variable. Say that result takes on the values I, 2, and 3. Then typing predict p1, outcome(I) would produce the same pl. xb calculates the linear prediction. You specify one new variable; for example, predict linear, xb. The linear prediction is defined ignoring the contribution of the estimated cut points. stdppredictCalculateSse,thestdp.Standard error of the linear prediction.
You specify one new variable; for example,
outcorae(outcome) specifies for which outcome the predicted probabilities are to be calculated. outcome () should contain either a single value of the dependent variable, or one of #1, #2 ..... with #1 meaning the first category of the dependent variable, #2 the second category, etc. nooffse_z
is relevant only if you specified offset
(varname)
for ologit.
I! modifies the calculations
made by rather thanpredict x 3b + offsetj. so that they ignore the offset variable; the linear prediction
is treated as xjb
Remarks Ordered Iogit models are used to estimate relationships between an ordinal dependent variable and a set of independent instance, "poor", "good", variables. and " An ordinal ,, variable is a variable that is categorical and ordered, for excellent , which might be the answer to one's current health status or the repair record of one's car. If there are only two outcomes, see [R] logistic, [R] legit, and [R] probit. This entry is concerned only with more than two outcomes. If the outcomes cannot be
ologit-- MaXimum-likelihood orderediogitestimation
453
: !
ordered (e. _,.,residency in the north, east, south and west), see [R] mlogit. This entry is concerned only with nodels in whi ch the outcomes can be ordered.
! !
In order ed logit, an u _dedying score is estimated as a linear function of the independent variables and a set ,f cut points, rhe probability of observing outcome i corresponds to the probability that the estimad'd linear fum tion. plus random error, is within the range of the cut points estimated for
i
the outcomi.':
i
_ r(outcomej = i) = Pr(_,-1 < /31xlj +-" _ 3kZkj + uj _< t
i i t
t
Example
!
You wisi_ to analyze [the 1977 repair records of 66 foreign and domestic cars. The data are a variation o( the automobipe dataset described in [U] 9 Stata's on-line tutorials and sample datasets,
Io
The n:pair records, 't,like those in 1978. take on values poor, fair. average, good, and excellent. Here 1977 is a c_oss-tabulation of the data:
i
tab
rep77
forei+,
¢hi2
R_pair R_cord !
_1977
! Foreign Dome_tlc
iFair iPoor Average !Good
t
_o al ! Pearson
Total
! I
tO 2 20
1 l 7
3 27
t
13
7
20
,I 0
Excellent,
i
Foreign
5
_ 45 i 1 _hi2(4)
Although it lappears that
ll
oreign
5 66
21 =
13.8619
• Pr = 0.008
takes on the _alues "Domestic"
and "Foreigff',
it is actually
a numeric vhriable takin_on ..--.the, values 0 and I.: Similarly, rep77 takes on the values 1, 2, 3, 4, and 5, correiponding to 'l_oor , Fair , and so o0. The more meaningful words appear because we attached val_e labels to t_e data" see [U] 15.6.3 Value labels. ° t •
! I
i
Since theI chl-squaredb, alue is significant, you could claim that there is a relationship between :foreign an_t rep77. Lite_rally,however, you can only claim that the distributions are different; the ch_-squared lest _s not dt t_onal. One way to model these data is to model the categorization that took place Qhen the data _ere created, Cars havea true frequency-of-repair, which we will assume is given by Nj = 3 :fOr_ig'Xlj + ttj and a car is categorized as "poor" if Sj < _o, as "'fair" if _o < Sj < 4i, and so onI • olog!t
1
i
foreign,
table
Iterat
on O:
log !Likelihood
= -89,895098
Iterat
on t:
log
Likelihood
= -85.95176S
]terat
on 2:
log
Likelihood
= -85.908227
!terat
on 3:
log
likelihood
= -85.908161
Ordere
log_t 1
i_ Log
I
rep77
II elihood i
estimates " = -85.908161
Number
of obs
66
=
7.97
iR chi2(1)
=
PSeudo
R2
=
0.0444
Prob
_h_9-
=
0.0047
>
......
[,
u'"
.,.-^,,.u.m-.n¢..ooa
rep77
Coef.
foreign
oroerea
SCd. Err.
1.455878
.5308946
j
_cut1 _cut2
-2. 765562 -. 9963603
.5988207 .3217704
I
_cut3 _cut4
3.123351 .9426153
.3136396 .5423237
rep77
ioglt estimation
z 2.74
[95% Conf.
O. 006
.4153436
Interval] 2.496412
(Ancillary parameters)
Probability
Poor Fair Average Good Excellent
P>[zl
Observed
Pr( xb+u<_cutl) Pr(_cutl<xb+u<_cut2) Pr(_cut2<xb+u<_cut3) Pr(_cut3<xb+u<_cut4) Pr(_cut4<xb+u)
0.0455 0.1667 0.409i 0.3030 0.0758
Our model is Sj = 1.46 foreignj -+-uj; the expected value for foreign cars is 1.46 and, for domestic cars, 0; foreign cars have better repair records. The "ancillary parameters" _cut 1, _cut2, _cut3, and _cut4 correspond to the t_'s in our previous notation--they model the categorization. For instance, the probability that a foreign car is categorized as having a poor repair record is given by the probability that 1.46 i < -2.77 or, equivalently, uj _< -4.23. The estimated cut points tell us how to interpret the score and the estimates--produced because we specified the option table--reminds us of A car is estimated as having a poor repair record if the score is less than the (Actually, the table could say less than or equal, but since the logistic distribution probability of any particular value is zero so it does not matter.)
table below the the interpretation. estimated _cut1. is continuous_ the
For a foreign car, the probability of a poor record is the probability that 1.46 + uj < -2.77 or, equivalently, uj < -4.23. Making this calculation requires familiarity with the logistic distribution: the probability is 1/(1 4- e 42z) = .014. On the other hand, for domestic cars, the probability of a poor record is the probability _Lj _< -2.77, which is .059. This, it seems to us, is a far more reasonable prediction than we would have made based on the table alone. The table showed that 2 out of 45 domestic cars had poor records while 1 out of 21 foreign cars had poor records--corresponding to probabilities 2/45 -- .044 and 1/2t = .048. The predictions from our model imposed a smoothness assumption_foreign cars should not, overall, have better repair records without the difference revealing itseIf in each category. The fact that, in our data. the fractions of foreign and domestic cars in the poor category are virtually identical is due only to the randomness associated with small samples. Thus, if we were asked to predict the true fractions of foreign and domestic cars that would be classified in the various categories, we would choose the numbers implied by the ordered legit model:
tabulate Domestic Foreign
legit Domestic
Foreign
Poor Fair
.044 .222
.048 .048
.059 .210
.014 .065
Average Good Excellen|
.444 .289 .000
.333 .333 .238
.450 .238 .043
.295 .467 .159
ologit -- Makimum-likelihoodordered Iogit estimation
I
455
See H_pothesis test'._and predictions below for a more complete explanation of how to generate prediction_ from an ore ered logit model,
E3TechnicalNote ! i !
an arbitra_-y dichotomi/_ation, which might otherwise have been tempting, We could, for instance, have sum_narized thesd data by converting the '5-outcome rep77 variable to a 2-outcome variable. combinin_ cars in the _verage, fair. and poor categories to make one outcome and combinin_ cars in
;
the good _nd excellent stIcategorie to make the second. l t , Anoth!r even less _lpp,ealing,atternati, e would have been to use ordinau' regression, arbitrarily labeling _xcellent" as! good as 4, and so on. The problem is that with different but equally valid labelings (say I0 for _xcellent ), we would obtain different esnmates. We would have no way of choosin_ bne metric over another. That is not, however, true of ologit. The actual values used to label the i:ategones _ " make no difference other tl_an through the order they imply.
i
In facti our labeling was 5 for "excellent". 4 for "good". and so on. The words "excellent" and
! _
"good'-'at_pear in our o Jtput because we attached a value label to the variables' see [U] 15.6.3 Value labels. If!we were to n3w go back and type replace rep77=10 if rep77==5, changing all the 5s to 10s. w_ would still _btain exactly the same results when we re-estimated our model.
I
! i
i
i
i
{
i
Example
; I
In the !example abo_e, we used ordered lo_t as a way to model a table. We are not. however. limited to!including only' a single explanatory ._-,_ _fiable nor to including only categorical variables, We can explo_e the relatio_ship of rep77 with any of the variables in our data. We might, for instance. model re'_77 not only tn terms of the origin of manufacture, but also including length (a proxy for size) and_pg: . ologit rep77 f reign length mpg Ite_'ationO: Ite_'ationi:
log likelihood = -89.895098 log likelihood = -78.775147
Ite_'ation2: Ite_'ation3:
log log
Iteration 4:
log likelihood = -78.250719
likelihood = -78.25_299 likelihood = -78.25_722
i i i Ord+red logit estimates Logilikelihood _ -78.2507_9
LR chi2(3) Prob > chi2 Number of obs Pseudo R2
|
rep77
Cool.
P>lz[
[95% Conf. Interval]
! }
_ foreign _ length f i mpg
2.896807 .0828275
.7906411 .02272
3.66 3.65
0.000 0.000
1.347179 .0382972
4.446435 .1273579
.2307677
.0704548
3.28
0.001
.0926788
.3688566
i :
curl _cut2
17.92748 19.86506
5.551191 5.59648
(Ancillary
parameters)
i
_cut3 _cut4
22.10331 24.69213
5.708935 5.890754
i
z
23.29 0.0000 66 0.1295
!
!
Std. Err.
= = = =
......
_--.
v_.
iw_ii
liilUlllil|llill
foreign ptays a role, asand even larger role than previously. We find that larger cars tend to have betterstill repair records, doan cars with better mileage ratings.
,1::
I :i_; :! ;,
Hypothesis tests and predictions See [u] 23 Estimation and post-estimation commands for instructions on obtaining the variancecovariance matrix of the estimators, predicted values, and hypothesis tests. Also see [R] lrtest for performing likelihood-ratio tests.
• i
_>Example , In a previous example, we estimated the model ologit rep77 predict command can be used to obtain the predicted probabilities.
foreign
length
mpg. The
You type predict followed by the names of the new variables to hold the predicted probabilities, ordering the names from low to high. In our data, the lowest outcome is "poor" and the highest "excellent". We have five categories and so must type five names following predict; the choice of name is up to us: predict poor fair avg good exc (option p assumed; predicted probabilities) • list exc good make model rep78 if repT7 ==. 3. I0. 32.
exc .0033341 .0098392 .0023406
good .0393056 .1070041 .0279487
44. 53. 56. 57. 63.
.015697 .065272 .005187 .0281461 .0294961
.1594413 .4165188 .059727 .2371826 .2585825
make AMC Buick Ford Merc. Peugeot Plym. Plym. Pont.
model Spirit Opel Fiesta
rep78
Monarch 604 Horizon Sapporo Phoenix
Average
Good
Average
The eight cars listed were introduced after 1977 and so do not have 1977 repair records in our data. We predicted what their 1977 repair records might have been using the fitted equation. We see that, based on its characteristics, the Peugeot 604 had about a 41.65 + 6.53 _ 48.2 percent chance of a good or excellent repair record. The Ford Fiesta, which had only a 3 percent chance of a good or excellent repair record, in fact had a good record when it was introduced in the following year. <1
Q TechnicalNote For ordered legit, predict, xb produces Sj - Xlj;31 + X2j/_2 4-- -- q- Zkj/Pk. The ordered-legit predictions are then the probability that Sj + uj lies between a pair of cut points _,i-1 and i,;/. Some handy formulas are Pr(Sj + ,,j < n):
1/(1 + e s'-*)
Pr(Sj 4-uj > n) = 1 Pr(nl
1/(1 + e s'-*)
< Sj + uj < _;2) = 1/ (1.4 eS'-*2 ) - 1/(1+
e
")
i
_
_
!
;
ologit -- Maximum-likelihoodordered Ioglt estimation
! Rather t_an using pr+±ct
457
--i ............... directly, we coul_ cak'ulate the predicted probabilities by hand. tf we
wighed tc obtain the predicted probability that the repair record is excellent and the probability that it is good, ,_e look back fat ologit s output to obtain the c,ut points. We find that "good" corresponds to the int _rvat _cut3 _ Sj + u < _cut4 and i excellent to the interval Sj + u > _cut4:
i I
" predict score! xb ;n probgood _ l/(l+exp(score-_b[_cu,4]))
!
,
1
i
i
- 1/(I+exp(score-_b[_cut3]))
• g_n probexc = _I - £/(i+exp(score- b[,cut4]))
1
The resul s of our cal ulation will be exactly ihe same as that produced in the previous example.
!_
Note that}we refer to the estimated cut points just as we would any coefficient, so _b[,_cut3] refers to the valhe of the _c@3 coefficient; see [U] i6.5 Accessing coefficients and standant errors,
SavedR,ults ologil , Scalars
r
i
saves in e(I:
e(N_ e(k__cat)
nut bet of observations nun ber of categories
e(ll) e(ll_O)
log likelihood Iog likelihood,
e(d__.m)
moc el degrees of freedom
e(ch±2)
X2
e (r__p)
pse_tdo R-squared
e (N_clust)
number of clusters
constant-only
model
i
Macros i
! ! i
e(c+d) e(d@var) e(w_ype)
o].0 21; narr of dependent variable weJ _t type
e(vcet_e) e(chi2type) e(offset)
covariance estimation method Wald or Lit: type of model x 2 test offset
e(w_xp) e(c]ustvar)
weight expression nam of cluster variable
e(predict)
program used to implement predict
i
coef ]cient vector care _ory values
e (V)
vafiance-covariance estimators
i
Matrices '
l
e (b) e (c_ t)
matrix of the
Functions i
e(sa_ple)
marl:s estimation sample
ethods,nd For M ' i _ I
i i {
las
A straightforward textbook description of the model fit by ologit, as well as the models tit by oprobiit, clogit, mlogit, can be found in Greene (2000, chapter 19). When you have a qualitati_ie dependentk,ariable, several estimation procedures are available. A popular choice is muttinomia_ logistic reglession (see JR] miogit)_ but if you use this procedure when the response V ' • i .... , .... anable ts ,ordmal. youiare d|scardmg mformatmn because multmomml logtt ignores the ordered aspect of t_e outcome Ordered logit and probii models provide a means to exploit the ordering information[ '
4nO
There isimore than ol e "ordered logic model, The model fit by ologit,which we will call the ordered lo@ model, is a] ;o known as the proportional odds model. Another popular choice, not fitted by ologitlis known as he _tereotype model. AIi ordered logit models have been derived by s_arting with a bindS, logit/probi model and generalizing I it to allow for more than two outcomes.
;_"_r_
'
_
oioglt -- Maximum-likelihood ordered Iogit estimation
The proportional odds ordered logit model is so called because, if one considers the odds odds(k) = P(Y < k)/P(Y > k), then odds(k1) and odds(k2) have the same ratio for all independent variable combinations. The model is based on the principle that the only effect of combining adjoining categories in ordered categorical regression problems should be a loss of efficiency in the estimation of the regression parameters (McCullagh 1980). This model was also described by Zavoina and McKelvey (1975), and previously by Aitchison and Silvey (1957) in a different algebraic form. Brant (1990) offers a set of diagnostics for the model. Peterson and Harretl (1990) suggest a model that allows nonproportional explanatory variables, Fu (1998).
odds for a subset of the
ologit does not allow this, but a model similar to this was implemented
by
The stereotype model rejects the principle on which the ordered logit model is based. Anderson (1984) argues that there are two distinct types of ordered categorical variables: "grouped continuous", like income, where the "type a" model applies; and "'assessed", like extent of pain relief, where the stereotype model applies. Greenland (1985) independently developed the same model. The stereotype mode/starts with a multinomial logistic regression model and imposes constraints on this model. Goodness of fit for ologi'l;
can be evaluated by comparing
the likelihood value with that obtained
by estimating the model with mlogit. Let Lj. be the log-likelihood value reported by ologit and let L0 be the log-likelihood value reported by mlogit, if there are p independent variables (excluding the constant) and c categories, mlogit will fit p(c - 1) additional parameters. One can then perform a "likelihood-ratio test", i.e., calculate -2(L1 - L0), and compare it to )C2{p(c2)}. This test is only suggestive because the ordered logit model is not nested within the multinomial logit model. A large value of -2(L1 - L0) should, however, be taken as evidence of poorness of fit. Marginally large values, on the other hand, should not be taken too seriously. The coefficients and cut points are estimated using maximum-likelihood as described in [R] maximize. In our parameterization, no constant appears as the effect is absorbed into the cut points. ologit and oprobit begin by tabulating the dependent variable. Category i = 1 is defined as •"the minimum value of the variable, i = 2 as the next ordered value, and so on, for the empirically determined [ categories. The probability
of observing an observation
Pr(outcome
= i) = Pr
t_i-1
<
in the case of ordered logit is
_jxj
J
n-
U <
t_i,
1
1
l +exp(-t_i+ Note that _;0 is defined as -vo
_-_./3jzj)
+ _ /3jzj)
and t_I as +co.
In the case of ordered probit_ the probability
Pr(outcome
l +exp(-__l
= i) -- Pr(_i_l
/ "
where @() is the standard normal cumulative
of observing an observation
< Z/3jxj J
3
;i]jl
distribution
+ u < _)
_
¢_(I_-"
function.
1
_32j_j) j
is
-
otogit -- MaXimum-likelihood ordered Iogit estimation
459
References
Aitchison, J. and S. D. Sih)y. 1957. The generalization Of probit anah'sis to the case of multiple response,_. Biometrika
:
1
Anderson. i" A. 1984. Re ession and ordered categorical variables (with discussion)_ Journal of the Royal Statistical
44:131 140,
! l
Societyj Series B 46: 1_30. Brant, R. 1_990.Assessing,proportionality in the propot'tional odds model for ordinal lo_isticLregession. Biometrics 46: I17_-t178.
}
Fu, V.K.
_-_ ,!
Stata T_chnicat Bulletir Reprints, vol. 8. pp. 160-164. Goldstein, R. 1997. sg59: Index of ordinal variation arid Norman-Barton GOF. Stata Technical Bulletin 33: 10-!2.
i _ I
Reprini__ in Stata Teclnical Bulletin Reprints. vol. 6. pp. 145-147. Greene, _'_'iH. 2000. Econgmetric Anah,sis, 4th ed. Upper Saddle River. NJ: Prentice-Halt. Greenland,!S. 1985, An a_plication of 10_istic models to the analysis of ordinal response, giometrical Journal 27: 189-19_. i
!
Lor_g,J. SI' 1997. Regresston Models for Categorical and l,imited Dependent Variables Thousand Oaks, CA: Sa_e i Publications. i
[ _i
McCulla_h! R 1977. A logistic model for paired comp_'isons with ordered categorical data. Biometrika 64: 440-4'_ . _98d, Regression m_l)delsfor ordinal data (with discussion). Journal of the Royal Statistical Society, Series B
98. sg88: Estimating generalized ordered t0git models. Stata Technical Bulletin 44: 27-30. Reprinted tn
McCullagh,!R and J.A.
elder, t989. Generalized Linear Models, 2d ed, London: Chapman & Hall.
Peterson. Ii. and E E H_rrelt, Jr, 1990, Partial proportional odds models for ordinal response variables. Applied I
StatistiCs39:205-217., Woife, R. 1_98. sg86: Continuation-ratio models for ordil_alresponse data, Stata Technical 13ultetin44: 18-2t in Stat_ Technical Bull'.tin Reprints. vol, 8, pp, 149-153
!
Wolfe, R. iand W. W. GoJld. 1998. sg76: An approxlmate likelihood-ratio test for ordinal restxmse models. 5tata Technic_l Bulletin 42: ?4- 27. Reprinted in Stata Technical Bulletin Reprints. vol. 7. pp. t99-204.
_'
Zavoina, W. and R. D. _ cKetvey. 1975, A statistical model for the anah'sis of ordinal level dependent variables Journaliof Mathematic_ Sociology 4: 103-120.
!!
_:
AisoSee =
Complen_entary: Related: BackgroUnd:
iR]R] w,adjust'[R] test,[R] [R]linktest'vce lrtest, [R] [R] svy mfx.estimators [R] predict. R] slogistic, [P,]lincom.[R] logit,testril.[R] [R]: mlogit, [R][R] oprobit, u] 16.5 Accessing u] 23 Estimation
coefficients
and standard
and _st-estimation
U] 23.11 Oblaining
robust variance
U] 23.12 Obtaining
scores,
R] maximize
errors.
commands, estimates.
Reprinted
_lt_
i ILli_
oneway
-- One-way analysis of variance
Syntax {i,
on_eway response_varfactor_var
[weight]
[if exp] [in range]
[, noa-nova nolabel
missing wrap tabulate [no]means trio]standard[no]freq Ino]obs bonferroni s__ccheffe sidak I
by ...
: may be used with oneway;
aweights
and freights
see JR] by.
are allowed;
see [U] 14.1.6
weight.
Description The oneway comparison tests.command reports one-way analysis-of-variance If you wish to estimate more complicated (ANOCOVA)models, see [R] anova.
ANOVAlayouts or wish to estimate analysis-of-covariance
See [R] encode for examples of estimating See [R] loneway
(ANOVA)models and performs multiple-
ANOVAmodels on string variables.
for an alternative oneway command with slightly different features.
Options noanova
suppresses
the display of the ANOVAtable.
nolabel causes the numeric codes to be displayed rather than the value labels in the ANOVA and multiple-comparison test tables. rni s sing requests that missing values offactor_var to be omitted from the analysis.
be treated as a category rather than as observations
wrap requests that Stata take no action on wide tables to make them readable. Unless wrap specified, wide tables are broken into pieces to enhance readability.
is
tabulate produces a table of summary statistics of the response_var by levels of the factor_var. The table includes the mean, standard deviation, frequency, and. if ihe data are weighted, the number of observations. Individual elements of the table may be included or suppressed by the ino]means,[no]standard,[no]freq,and [no]obsoptions. Forexample,typing ,
oneway
response
factor,
tabulate
means
standard
produces a summary table that contains only the means and standard deviations. the same result by typing oneway
response
factor,
tabulate
You could achieve
nofreq
[no]means includesabove. or suppresses only the means from the table produced by the tabulate See tabulate 460
option.
)
I [' , !
,,,,e I .... Ii
i
Elapse --
nerate pharmacokinetic measurement dataset
Syntax
!
t stlat (measure) I keep(varlist)
t
I
force
_odots
]
where treasure is any of
_ ! !
au¢ aucli to aucex auclo
area area area area
i
half ke cmax
half life _f the drug elimination rate maximun_ concentration
tmax
time concentration time at of l_st haximum concentration
I
und und und und
,r the ',rthe ',r the ,.rthe
concentration-timel curve (AUCo,_) concentration-time curve from 0 to vc using a linear extension concentration-time curve from 0 to vc using an exponential extension log concentration-time curve extended with a linear fit
t omc
i
Description t
•
pkco_lapse
is on_ of the pk commands. If you have not read [R] pk, please do so before reading
pkeo:_lapse
generhtes new variables with the pharmacokinetic summary measures of interest.
Options
' . I
!
id(id_v(r), which _sinot optional, specifies the variable that contains the subject id over which pk¢o_ lapse is to +perate.
! !
fit(#) .__ecifies the r_umber of points to use ita estimating the AUC0.oo.The default is the last three points fit(3), which should be viewed as a minimum. The appropriate number of points will
I !
depen_ on your dat_. trapezoid tells Statal to use the trapezoidal rule when calculating the AUC. The default is cubic splinel, which give_ better results for most ftinctions. When the curve is very irregular, trapezoid may g_ve better res!alts. stat(melm, re) specifies which measures should be generated, The default _s to generate all the measures, i
1 I
i i
keep(va_tist) specifie_ which variables should be kept during the collapse, Variables not specified with t_e keep () op!ion will be dropped. When keep ()is specified, the keep variables are checked _o ensure that all v_lues of the variables are the same within id_var. force
folces, the collapse, even in cases where ihe values of the keep()
nodots s_ppresses the dots during calculation. 514
variables are different within
pk --
-_-
2 2 2 2 2 2 2 2 2 2 (output omitted ) This
format
for two
1 1 1 1 1 1 1 1 1 1
is similar
drugs
at each
to the second time
id i 2 3 4 5 7 8 9 10 12 13 !4 15 18 19 20 For that
this
we
the dataset administered. The
data
observation
produced
was the
applied study.
is only the
subject
during We
the
need
WE see
that
in the
Similarly,
and
id I0 i0
study
first
expansion
1 was
and
treatment
into need
the
two
two
period
in sequence and
had
10 (the
calls
we have data
measurements
to the
first
of
following:
study
Note
to the
AUCs,
each
treatment
was
format.
That and
is.
one,
when
treatment
A
period
of
second
the
outcome
the
the
subject
need
[R] pkshape.
means
indicating
is one we
in the
so that one
see
which
applied
there
pkequiv,
pkshape:
variables,
subject
measure.
addition
when
using
B was
the
In
pkcross
observations new
pharmacokinetic
wide
is in sequence
measure treatment
received
that
to be
period A B
auc 137.0627" 139.7389
these
indicating
To use
treat
study,
of subject
subject
Stata
subject
of the
auc 150.9643 218.5551
of the
sequence 2 2
recording
the
number
expansion
we
now
pkexamine.
be accomplished
observation
In addition, another
that
as our
by
outcomes.
can
of the
expect
period
This
period the
except
drugs
each
in what
This
1 1
first
are
511
/
two
for
or more
format.
goal
computed
dataset.
sequence
subject
is to transform
in the
variable.
might
the
first
to split
received One
id 1 1
applied
subject
The
variable
pkcollapse
to long
above,
AUC for
data
8.710549 10. 90552 8. 429898 5. 573152 6. 32341 .5251224 7.415988 6.323938 1,133553 5. 759489
described
measures
two
(biopharmaceutical)
auc_concB 218. 5551 133. 3201 126. 0635 96.17461 188.9038 223.6922 104.0139 237.8962 139.7382 202.3942 136.7848 104.5191 165.8654 139. 235 166. 2391 158. 5146
the
the
containing
data
in a single
treatment.
by
the first
to use of
a sequence
subject
these
subject.
auc_concA 150. 9643 146. 7606 160.6548 157. 8622 133.6957 160.639 131.2604 168.5186 137.0627 153.4038 163.4593 146.0462 158.1457 147. 1977 164. 9988 145. 3823
any
contains
per
to transform Consider
used
7.253442 5. 849345 6. 761085 4. 33839 5. 04199 4. 25128 6. 205004 5. 566165 3. 689007 3.644063
format
each
we chose
have
also
for
seq 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
example,
could
1.5 2 3 4 6 8 12 16 24 32
Pharmacokinetic
l, had
1 2 an AUC of
an AUC of 218.5551 first
treat B A
subject
in the period I 2
150.9643 when
sequence
when
treatment
treatment
B was
2) would
be
A was applied.
[i i I_ _I i
pk_ P k,o.c
in the first dataset, vdu will need to use reshape to change it to the second form; see [R] reshal_. Because the data in th_ second (or long) format containsinformation for one drug on severalsubjects, pksummzan be used t)pproduce summary statistics of the pharmacokinetic measurements. The output .
ksunm id t
S_ary i
e concA
statistics
for
star,
i
_
Mean
the
pharmacoktnetic
Median
measures N_mber of observations =
_ariance
Kurtosis
p-value
127,58
-0.34
2,07
O.55
auc
, L51.63
152.18,,
aucline
397.09
219,83
1_8276.59
2.69
9.61
O,O0
aucexp auclog half
_68.60 _65,95 90.68
302,96 298.03 29,12
720356.98 752573.34 17750.70
2.67 2.71 2,36
9.54 9.70 7.92
0.00 0,00 O,O0
! 0.02 i 7,37
0.02 7.42
0.00 0.40
0.88 -0.64
3.87 2,75
0.08 0,68
3,38 32.00 !
3.00 32,00
7.25 0.00
2,27
7,70
0,00
ke cmax tomc tmax
I
T
16
Skewness
Until aow, we hav_ been concerned with the profile of only one drug. We have characterized the profil '. of that dru_by individual subjects using pkexamine and by a group of subjects using pkm_mm.['he goal of hn equivalence trial, however, is to compare two drugs, which we will do in the remai_der of this e_ample. In the case of equi,_alencetrials, the study design most often used is the crossover design. For a complete discussion of!crossover designs, see Ratkowsky,Evans, and Attdredge (1993).
!
Briefly_crossover d_signs require that each sr_bjectbe given both treatments at two different times. The orde_in which thd treatments are applied ihanges between groups. For example, if we had 20 subjects n_mbered I through 20, the first I0 would receive treatment A during the first period of the study, ther_they would l_egiven treatment B. Thesecond I0 subjectswould be given treatment B during the first p_riod of the study, then they would be given treatment A. Each subject in the study will have four vari@les that describe the observation: a _ubject identifier, a sequence identifier that indicates the order bf treatment, and two outcome variables, one outcome for each treatment. The outcome vari_ables_or each sub _:tare the pharmacokinetic measures. The data must be transformed from a series of rbeasurements on individual subjects to data containing the pharmacokinetic measures for each subj@t. In Stata _lance. this is referred to as a collapse, which can be done with pkcollapse:
I
see [R] pl_coilapse. To illustrate pkcoll:_pse,
!
i i
i i
I
1
assume that we have the following dataset:
1 id1 1 I t I I 1 i 1 1 1 1 2
1 Iseq1 1 1 I 1 1 I 1 1 1 i 1 1
0 ttme.5 1 1.5 2 3 4 6 8 12 16 24 32 0
2
1
•5
2
1
1
0 3.073_03 concA 5.188444 5. 898577 5.096378 6. 0941_85 5. 158172 5.7065 5. 272_67 4. 4576 5. 146423 4.947427 1,920421 0
0 3.712592 concB 6. 230602 7. 885944 9.241735 13.10507 .169429 8.759894 7.985409 7. 740126 7.607208 7,588428 2.791115 0
2.48_62
. 9209593
4,883_9
5.925818
pk -- Pharmacokinetic (biopharmaceutical) data
509
. pkexamine time conc :
Maximum concentration Time of maximum concentration Tmax Elimination rate Half life
= = = = =
4.7 3 32 0.0279 24.8503
Area under the curve r
AUC [0, Tmax]
.....
AUC [0, inf.) Linear of log conc.
85.24
AUC [0, inf.) i AUC [0, inf.) Linear fit i ExponenZial fit
142.603
107.759
142.603
Fit based on last 3 points.
Clinical trials, however, require that data be collected on more than one subject. There are several ways to enter raw measured data collected on several subjects. It would be reasonable to enter for each subject the drug concentration value at specific points in time. Such data could be id 1 2 3
concl 0 0 0
conc2 1 2 1
conc3 4 6 2
conc4 7 5 3
conc5 5 4 5
conc6 3 3 4
conc7 1 2 1
where concl is the concentration at the first measured time, conc2 is the concentration at the second measured time, etc. This format requires that each drug concentration measurement be made at the same time on each subject. Another more flexible way to enter the "data is to have an observation with three variables for each time measurement on a subject. Each observation would have a subject id, the time at which the measure.merit was made, and the corresponding drug concentration at that time. The data would be id 1 1 1 1 t 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
concA 0 3.073403 5.188444 5.898577 5.096378 6.094085 5.158772 5.7065 5.272467 4.4576 5.148423 4.947427 1.920421 0 2.48462 4.883569 7.253442 5.849345 6,761085 4.33839 5.04199 4.25128 6.205004 5.566165 3.689007 3.644063
time 0 .5 1 1.5 2 3 4 6 8 12 16 24 32 0 .5 1 1.5 2 3 4 6 8 12 16 24 32
(ou_utomitted)
Stata expects the data to be organized in the second form. If your data are organized as described
...................
!
I t
[
' "_ !
....................................
pkex mine will co pute and report all the t_harmacokineticmeasures that Stata produces including PkuPhB_fl? of tl_ area Okir_ticunder_iOpharmB_O_icam) dSta fc/urcak ulations the time-vdrsus-concentrationcurve. The standard area under the curve from 0 to the _aximum observed time (AUCo,tm.) is computed using cubic splines or the area curve tO pkexamine compute trapezok:aI rule. Additionally, will also the under the from 0 infinity t,v extending t_e standard time-versusqzoncentrationcurve from the maximum observed time _
l_i
i !
t
usin_ thr_ different rrWethods.The first method%implyextends the standard curve using a least squares linear fit through the iast few data points. The second method extends the standard curve by fitting '
a decreadn_ exponenlial curve through the l_tstfew data points. Lastly, the third method extends the curv,_by fitting a least squares linear regrdssion line Onthe log concentration. The mathematical details o_these extensions are described in th_ Methods and Formulas section of [R] pkexamine. Data [rom an equikalence trial may also bd analyzed using methods appropriate to the particular study detign. When ybu have a crossover design, pkcross can be used to fit an appropriate ANOV_. model. As an aside, _ crossover design is simply a restricted Latin square; therefore, pkeross can also be lsed to analyie any Latin square design. Therelare some pNctical concerns when de_ling with data from equivalence trials. Primarily, the da_anee_ to be organiied in a manner that Stat_ can use. The pk commands include pkeollapse and pkshap_, which are +signed to help transform data from a common format to one that is suitable for analysis with Stat_. In thd following example, we illustrate severWdifferent data formats that are frequently encountered in pharr_aceutical research and describe how ihese formats can be transformed to formats that can
bea. y ed S,.,+ 1
)
[ [
i
,>Example
i !
Assu__ethat we ha,_eone subject and are interested in determining the drug profile for that subject. A reasonable, experiment would be to give, thei subject the drug and then measure the concentration • _ .
}
of the d4g m the subject s blood over a t,me period. For example, here is a dataset from Chow and --
time
1
o
.g
[
[ l
i'on
o 0
1.5 2 3 1 4
4.4 4.4 4,7 2.8 4.1
8 12
3.6 3
24 32 16
1.62 2.5
°
1
concentrat
,
)
Examining these d ta, we notice that the concentration quickly increases, plateaus for a short period, a_d then slowh' decreases over time. pkexamine is used to calculate the pharmacokinetic
i
measuresi°f interest" li_examine is explained !n detail in [R] pkexamine The °utpul is
le I pk-
Pharmacokinetic
(biopharmaceutical)
data
[
I
I
i
Description The term pk refers to pharmacokinetic
data and the commands,
all of which begin with the letters
pk, designed to do some of the analyses commonly performed in the pharmaceutical industry. The system is intended for the analysis of pharmacokinetic data, although some of the commands are of general use. The pk commands pkexamino pkst__mm pkshape pkcross pkequiv pkcollapse
are [R] [R] [R] [R] [R] [R]
pkexamine pksumm pkshape pkeross pkequiv pkeollapse
Calculate pharmacokinetic measures Summarize pharrnacokinetic data Reshape (pharmacokinetic) Latin square data Analyze crossover experiments Perform bioequivalence tests Generate pharmacokinetm measurement dataset
Remarks Several types of clinical trials are commonly performed in the pharmaceutical industry. Examples include combination trials, multicenter trials, equivalence trials, and active control trials. For each type of trial, there is an optimal study design for estimating the effects of interest. Currently, the pk system can be used to analyze equivalence trials. These trials are usually conducted using a crossover design; however, it is possible to use a parallel design and still draw conclusions about equivalence. The goal of an equivalence trial is the assessment of bioequivalence between two drugs. While it is impossible to prove two drugs behave exactly the same, the United States Food and Drug Administration believes that if the absorption properties of two drugs are similar, then the two drugs will produce similar effects and have similar safety profiles. Generally, the goal of an equivalence trial is to assess the equivalence of a generic drug with an existing drug. This is commonly accomplished by comparing a confidence interval about the difference between a pharrnacokinetic measurement of two drugs with a confidence limit constructed from U.S. federal regulations. If the confidence interval is entirely within the confidence limit, the drugs are declared bioequivalent. An alternative approach to the assessment of bioequivalence is to use the method of interval hypotheses testing, pkequiv is used to conduct these tests of bioequivalence. There are several pharmacokinetic measures that can be used to ascertain how available a drug is for cellular absorption. The most corn mort measure is the area under the time-versus-concentration curve (AUG). Another common measure of drug availability is the maximum concentration (Cmax) achieved by the drug during the follow-up period. Stata reports these and other less common measures of drug availability, including the time at which the maximum drug concentration was observed and the duration of the period during which the subject was being measured. Stata also reports the elimination rate, that is, the rate at which the drug is metabolized, and the drug's half-life, that is. the time it takes for thc drug concentration to fall to one-half of its maximum concentration. 507
1 l
_: ...........
.
i
506
.............
.........
pergram-- IPeriodogram
Also See C0mple[ _enta_:
IR] tsset
Related:
IR] corrgram, JR] cumsp, JR]wntestb
Baekgro_rod:
_tata Graphics Manual
pergram-- Periodogram
505
Methodsand Formulas <_.'_
"-_=
pergramis implemented as an ado-file.
_-._
We use the notation of Newton (1988) in the following.
= _:
A time series of interest is decomposed into a unique set of sinusoids of various frequencies and amplitudes.
_
A plot of the sinusoidal amplitudes (ordinates) versus the frequencies for the sinusoidal decomposition of a time series gives us the spectral density of the time series. If we calculate the sinusoidal amplitudes for a discrete set of "natural" frequencies (I/n, 2/n .... , q/n), then we obtain the periodogram. Let x(1),..., k = I,...,(n/2)
z(n) be a time series and let wk -- (k - 1)/n denote the natural frequencies for + 1. Define
=
t=l
A plot of n C k2 versus ,_k is then called the periodogram. The sample spectral density is defined for a continuous frequency w as
1
n x(t) e2_i(t-l)_
r
f(1 - co)
ifwe[0,.5]
if coC [.5,1]
Note that the periodogram (and sample spectral density) is symmetric about w -- .5. Further standardize the periodogram such that
n
k=2
82
= 1
(where 82 is the sample variance of the time series) so that the average value of the ordinate is one. Once the amplitudes are standardized, we may then take the natural log of the values and produce the log-periodogram. In doing so, we truncate the graph at 4-6. Note that one frequently drops the prefix "log-" and simply refers to the "log-periodogram" as the "periodogram" in text.
References Box. G. E. P. and G. M. Jenkins.
1976. Time Series Analysis:
Box, G E. R, Jenkins. G. M. and G, (2. Reinsel. Englewood Cliffs, N J: Prentice-Hall. Chatfield. Hamilton, Newton.
C. 1996. The Analysis J. 1994. Time Series H. J. 1988,
TL¥1ESLAB:
of Time Series: Analysis,
Forecasting
1994, Time
An introduction.
Princeton:
Princeton
A Time gerie._ Analysi_
Series
and Control. Analysis:
5th ed, London:
University
Laboratory,
Oakland,
Forecasting Chapman
CA: Holden-Day.
and
Control.
3d ed.
& Hall.
Press.
Pacific Grove,
CA: Wadsworth
& Brooks/Cole.
r'
graphsumac ime, xlab ylab s(o) c(1)
=
20"
l
i
i
t
-11]
o V
]i
t
•
i
-2o 5
.
rgra_
I
100
1 0
sturfc, gen(ordinate) Sample spectrsl denstt_ functmn evaluated af the nalura_ frequencies l t 'i| |
• •
rl
I
6.00
" 6.00
4.00
- 4.00
0.00
" 0.00
I
_. •
-2.00
i
-4.00
- -4.00
-6.00 1
_ :
0.00
The periodigram
_'---: I 0,10
..... _------: ............ --:-::--_: : ..... l!! _ ' | 4 ' ' f 0.2_ 0,30 _requency
clearly shows the four contributions
can see tha l the periods
?f the summands
0.40
to the original
-6,00
O.SO
time series. From the plot, we
were 3, 6, 12, and 36, although
you can confirmthis
by
, ge doubleomen.= (.n-1)/144 using • genldouble ni peric d = i/omega
i
(1 missingvalue __nerated) . lie_ period [ 5. 13, 25.
!_ I_ I 1
49.
ome_
period 36 12 6
3
if ordinate>O
k omega<=.5
omega ,02777778 .08333333 I'16666667
i 33333333 <1
l
1
pergram -- Periodogram graph
lynx
time,
.... :
--_
xlab
8000
-
6000
"
4000
-
ylab
s(o)
503
c(1)
z
2000 0 4
_.
j]
V I
Time
• pergram
lynx Sample spectral density function evaluated at tMe natural frequencies i r ,,l ....
I
= - 6,00
6,00 "
4,00 -
_E
2.00 -
>.o
0.00
mc_
.¢:t o E .J
j
4.00
2.00
-
0,00
?
-2.00
-2.00
z
=
¢,
T"
-4.00
-6.00
-4,00
T-0,00
t O. 10
l 0.20
_-- -6.00 0.30
0.40
0,50
Frequency
The periodogram indicates that there may be a periodicity at 15 years for these data, but is otherwise random in nature. In [R] eorrgrarn, we see evidence of the ARMA (autoregressive moving average) nature of this time series,
q
_' Example In order to more clearly highlight what the periodogram depicts, we present the result of analyzing a time series of the sum of four sinusoids (of different periods). The periodogram should be able to decompose the time series into four different sinusoids whose periods may be determined from the plot.
' r
graph spot
_
ime,
xlab
ylab
s(o)
cCt)
200 "
-ca° 15o-
t
¢
too E Z
1
I
1700
i • Ipergra_
1800
°
i
1900
.i
Year
2000
,
,
spo_ ev_fualed at 1he nat_ra frequencies Sample spectral de_sily i _ i functlo_ 1,
I
t
......
6,00
"_ E
" 6.00
4,00
" 4,00
2.00
" 2.00
-2.00
-2.00
-400
-4.00
I=o
-6.00
I 0.00
i 0.,0
0 20
0.30
0.40
-6 O0
0 50
Frequency
The eriodogram _dicatesa peak frequency between 10 and 12 years. ,1
i
- Example Here _veexamine tl:e number of trapped Cat_adianlynx (Newton t988. 587). The raw series and
the' log-plriodogram ar given as
pergram -- Periodogram
501
graph air time, xlab ylab s(o) c(1)
600
. o i
ii
i
1
400
g ,al ea 200
"
0 t t950 Time
1 1955 (in months)
t9
0
pergram air Sampie spectral density evaluated at the natural t
tunction lreq uencies I I
I
,
6.09
600
4.00
I
- 4 O0
E
2.00
2.00
¢' o
O.OO
- C 00
cr_
tOl_
°°
I!
.,:::
0ooo
-6,00
0.00
" -6.00 O, 10
0.20 0.30 Frequency
0,40
0.50
The periodogram clearly indicates the annual cycle together with the harmonics. The similarity in shape of each group of twelve observations reveals the annual cycle. The magnitude of the cycle is increasing, resulting in the peaks in the periodogram at the harmonics of the principal annual cycle.
<1
_' Example In this example, the data consist of 215 observations on the annual number of sunspots from 1749 to 1963 (Box and Jenkins 1976, Series E). The graph of the raw series and the log-periodogram for these data are given as
i
I
lm--
Title Syntax
i
pergra_ is for use with e-series data: see [R] tsset You must tsset your data before using pergram. In addition. the tilae series must _ dense (nonmissing and nd gaps in the time variable) in the specified sample,
!
varname may contain ti_e-series operators: see [U] N.4.3 Time-series varlists.
{
DesCdplion
, i
per ram plots the. tog standardized periodogram for a (dense) time series.
Options ! !
genera':e (newvarna_ne) specifies a new variable to contain the raw periodogram values. Note that the g_merated grapl_log-transforms and scales the values by the sample variance and then truncates
i
them to the [-6, I] interval prior to grapliing them. graph_c,r_tionsare an!, of the options allowed with graph, t_ovay; see [G] graph options.
Ji Remark., A go )d discussior] of
.
the periodogram is provided in Chatfield (1996), Hamilton (1994), and
classic .terence is Btx. Jenkins. and Reinsel (1994). Teehnid Note Newton !1988). Chatt_eldis also a very good introductoryreference for time-series analysis. Another
i I
perg _m produces a scatterplot where the points of the scatterplot are connected. The points themseh es represent the log-standardized periodogram, and the connections between points represent the (conlinuous) log-slandardized sample spectral density. Although the periodogram is asymptotically unbiased for the specttat density, it is not cons!stent, and many analysts witl obtain the raw ordinates from thi command _ith the gen() option and smooth them prior to plotting.
¢ I !
1 1
main features of the lots.
Exampl_ In t following e_amples, we present the periodograms together with an interpretation of the We h{avetime-serie_data consis+tingof 144 observations on the monthly number (in thousands) of intemati_malairline p_,ssengersbetween 1949,and 1960 (Box. Jenkins. and Reinsel 1994. Series G_. We can Faph the ra_ series and the ]og-periodogram for these data by typing
i i !. !
I
+
pctile m Create variable containing percentiles
499
numbered, respectively, 1,2,..., m, based on the m quantiles given by the p_:-th percentiles, Pk = 100k/m for k = 1,2,...,mI. Note that if x[pk_i] - x[pd, then the kth category is empty. All elements are put in the (k - 1)th category:: (xb_k_2],x[pk_ll ]. If xt:i.le
is used with the eutpoints
(-cc,
(varname)
(YO),Y(2)],
and they are numbered, respectively, Y(1), Y(2), .... Y(m).
1, 2,...,
"--,
option, then the categories (Y(m-1),Y(m)],
x -- x__l
where
] = x[pk]
are
(Y(m), +_)
m 4- 1, based on the m nonmissing
values of varname:
Acknowledgment xtile is based on a command originally posted on Statalist (see [u] 2.3 The Stata listserver) Philip Ryan of the University of Adelaide, Australia.
AlsoSee
r
Related:
[R] centile, [R] summarize
Background:
[U] 21.8 Accessing results calculated by other programs
bv
"i
[3 Tech ical Note s,!mmarize, d,:tail
,,
will compute the 1st, 5th, lOth, 25th, 50th (median), 75th, 90th, 95th, and
99th 3ercentiles.q_hereis no real advantage in using _pcZile to compute these percentiles. Both sumlI _rize, detail and __pet ile use the same internal code. _petite is slightly faster because summarize, detail computes a few additional things. The value of _pctile is its ability to completepercentilei other than these standard ones.
_,
Saved esults pc
/
Ale and _p_tile
save in r() Scalars r(r#)
value_f #-requested percentile
Metho¢ ; andFo mulas pet
le and xti:
e are
implemented as ago-files.
We irst give the _efault formula for percentiles. Let (j)refer to t_e x in ascending order [or j = 1,2 .... ,n_nLet w(j) refer to the corresponding weightstof x(j): if t_re are no weights, wo. i = 1. Let N j=l 'w(j). To o_tain the pth 9ercentile,which we will denote as zip], let P = Np/lO0 and let i
W(O = E wO) j=l
Fihd th_ first index i ,uch that DV(_)> P. The pth percentile is then
x[pl = !
x(_-l) +ix(i) 2 x (_)
if 1,9}i_1)= P otherwise
Whenlthe option a Ltde_ is specified, the _:followingalternative definition is used. In this case, weights _e not allowe:l. Lel i e integer flo,_rof (n _ l)p/lO0: i.e., i is largest integer i _ _ a. t)p/lO0. Let h be the remain& h = (n + llp/lO0 - i. The pth percentile is then |
where x j
x[p] = (1 - h)xii ) + hz(i+1)
is taken to _e x(i) and _(n+l) is taken to be x(n). /
xtile)roduces thelcategories
-i
pctile -- Create variable containing percentiles
497
_pctile _pctile is a programmer's command. It computes percentiles [U] 21.8 Accessing results calculated by other programs, You can use _pctile . _pctile . ret
and stores them in r();
see
to compute quantiles just as you can with pctile:
weight,
nq(lO)
list
scalars :
_pctile results.
r(rl)
=
2020
r(r2)
=
2160
r(r3)
=
2520
r (r4)
=
2730
r(r5)
=
3190
r(r6)
=
3310
r(rT)
=
3420
r (r8)
=
3700
r(r9)
=
4060
is, however,
The percentiles wish: _pctile ret
weight,
limited to computing () option (abbreviation p(10,
33.333,
45,
21 quantiles since there are only 20 r()s p()) 50,
55,
to hold the
can be used to compute any odd percentile 66.667,
you
90)
list
scalars : r(rl)
=
2020
r(r2)
=
2640
r(r3)
=
2830
r(r4)
=
3190
r(r5)
=
3250
r(r6)
=
3400
r(r7)
=
4060
_pctile, pctile, and xtile each have an option that uses an alternative definition of percentiles, based on an interpolation scheme; see Methods and Fom_ulas below. _pctile • ret
weight,
p(10,
33.333,
45,
50,
55,
66.667,
903
altdef
list
scalars : r(rl)
=
2005
r(r2)
=
2639. 985
r(r3)
=
2830
r(r4) r(rS)
= =
3190 3252.5
r(r6)
=
3400. 005
r(r7)
=
4060
The default formula inverts the empirical distribution function. We believe that the default formula is more commonly used. although some consider the "alternative" formula to be the standard definition. One drawback of the alternative formula is that it does not have an obvious generalization to noninteger weights.
_ i
Ii• i
rp !
496
jl pctile - create vadablecontaini_ percentiles i 120
1
3
18. 19.
17.
120 125
12o
1 1
1
3 4
:[I.
132
1
4
to,
13o
i2,
1
93
l
94 131 94 (o_qmtornitted)
lo
1_o. i
o
3
4 i
1 1
o
4
136
00
0 TechnicalNote
I_
In th!. iztite' last examplb.catego_y if=webp i:E°nlycase==l ,wanted cut!(pct)t° categorize cases, we could have issued the command
i * _ ! i
Mos_ Stata commahds follow the logic that Using an if exp is equivalent to dropping observations not satisfyi_2 the expressi on and running the command. This is not true of xtile when the cutpoints () option i_Jsed. (_qaer_ the eutpoints () option' is not used, the standard logic is true.) This is because xtile _ill use all no,missing values of the cutpoint () variable regardless of whether these values belon_ io observation that satisfy the if expression,
!
If yoh do not wan: to use all the values i. the cutpoint () variable as cutpoints, simply set the ones that you do not _eed to missing, xtile does not care about the order of the values or whether
I
they are separated by! missing values.
!
I
i i
_ TechnicalNote
,
Note!that quantile_are not always unique. If we categorize our blood pressure data on the basis of quinttles rather tha_ quartiles, we get t _ctile pet = 4 bp, _tile quint bp, nq(5) nq(5) genp(percent) _ist percent
I
bp
quint
pct
!
98
1
104
20
100
1
120
40
lo4
1
_25
80
1
_i
_.
!
_. 5. _.
! I
9. _. 1_. 1t._
i
i
_.
I_o 120 12o t2o
12o 13o la2 125
2 2 2
12o
60
2
2 s s 4
The 40t_ and 60th percentile are the same; t_ey are both 120. When two (or more) percentiles are the samd, they are gixen the lower category nhmber.
i i i i
pctile -- Create variable containing percentiles
495
• xtile category = bp, cut(class) list bp class category 1, 2. 3. 4. 5. 6. 7, 8, 9. 10. 11.
bp 98 100 104 110 120 120 120 120 125 130 132
class 100 110 120 130
category 1 1 2 2 3 3 3 3 4 4 5
The cutpoints can, of course, come from anywhere. They can be the quantiles of another variable or the quantiles of a subgroup of the variable. For example, suppose we had a variable case that indicated whether an observation represented a case (case = 1) or control (case -- 0). . list bp 98 IO0 104 ii0 120 120 120 120 125 130 132 116 93 115
case 1 1 1 1 1 1 1 1 1 1 1 0 0 0
(outputomi_ed) 110. 113
0
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
We can categorize the cases based on the quanfiles of the controls. To do this. we first generate a variable pet containing the percentiles of the consols' blood pressure data pctile pet = bp if case==O, nq(4) list pet in i/4 I. 2. 3. 4.
pet 104 117 124
and then use these percentiles
as outpoints to classify bp for all subjects.
xtile category = bp, cut(pet) gsort -case bp • list bp case category 1. 2. 3. 4. 5.
bp 98 lOO 104 110 120
case 1 1 1 1 1
category 1 1 1 2 3
494
pctile -- Cr,_atevariable containingpercentiles
xtil_ can be used to create a variable quart
i
• tile
quart
•
= bp,
98
i
nq(4)
I
I
i
1oo
i
_. ,
Ii0 t20 104
2 2 1
_.
12o
2
b_ i q"_I
10. :
I
i
11.
i
that indicates the quartiles of bp.
130 125
! !
132 1
4 3 4
The categories created i_are
I
(+_,x[2_l] ' (xi2_,xis_] ' (Xi_oi, X[7_l],(x[75_,+oo) where z_5, Ziso] an ZiTsi are, respectwely, the 25th, 50th (me&an), and 75th percentiles of bp We coul use the pc le command to genera[e these percentiles:
!
I i
-_
1
ictile pet = _p, nq(4) genp(percent) _ist bp quart ipercent pet i bp quart percent
pet
_.
98
I
25
104
_. _.
104 100
1 I
75 50
125 120
4. d._
llo 12o
2 2
_
_.
120
2
i
_.
12o
2
I_.
t20
2
1I!.
130 132
4 4
!
i
xtil(_ can categori_e a variable based on _y set of cutpoints, not just percentiles. Suppose that we wish iocreate the _ollowing categories for blood pressure:
i
(-_.,!_oo],(too, ! t_ot (U0.120] (i2o,_3o].(_3o,+o0)
To do thi_, we simph', ,create a variable contairiing the cu'lpoints
i:
i
class ihput class I!. I00
23i.i. t20 :io 5i. end
and then iuse xtile with the cutpoints()o_tion. |
{
: ]
i
i
pctite -- Create variable containing percentiles
493
Note that summarize, detail calculates standard percentiles. • summarize mpg, detail Mileage (mpg) Percentiles
Smallest
1_ 5_ 10Z 25_
12 14 14 18
12 12 14 14
50_
20
75X 90X 95X 99_
25 29 34 41
Larges¢ 34 35 35 41
0bs Sum of Wgt.
74 74
Mean Std. Dev.
21.2973 5.785503
Variance Skewness Kurtosis
33.47205 .9487176 3.975005
can onlycalculate thesepa_icular percentiles. The commands let you compute any percentile. But s_Immarize,
detail
Weights can be used with petile,
_pctile,
pctile
and
_pctile
and xtile:
. pctile pet = mpg [w=weight], nq(10) genp(percent) (analytic weights assumed) . list percent pet in I/I0 I. 2. 3. 4. 5. 6. 7. 8. 9. 10.
percent i0 20 30 40 50 60 70 80 90
pet 14 16 17 18 19 20 22 24 28
/
The result is the same no matter which weight type you specify--aweighz,
fweight,or pweight.
xtile xtile will create a categorical variable that contains categories corresponding to quantiles. We illustrate this with a simple example. Suppose we have a variable bp containing blood pressure measurements: list
i. 2. 3. 4. 5. 6. 7, 8. 9. 10. 11.
bp 98 I00 104 II0 120 120 120 120 125 130 t32
+
!_'+
I I ,i !
492
pctile -- Ci .=atevariable containin+ percentiles
cutpoi+tts(vamame, requests that xtile ise the values of varname, rather than quantiles, as cut ints for the c legories. Note that all v_lues of vamame are used, regardless of any if or in restri,:tion; see the technical note in the xt_le section below. percentiles(m+mtist) requests percentiles Corresponding to the specified percentages. Percentiles
+ i
are p!aced in r(r]t), r(r2) ..... etc. For example, percentiles(10(20)90) 10th.130th, 50th. 7Dth, and 90th percentilei be computed and placed into r(rl),
}
r(r4_,
i
detail_ on ho,a to _pecify a numIist.
and r!r5)I
Up to 20 (inclusive)p_rcentiles
requests that the r(r2), r(r3),
can be requested. See [u] 14.1.8 numlist for
Remark.. pctile pctil,ecreates a _ew. variable containing percentiles. You specify the number of quantiles that you wan(. and pctil_ computes the corresponding percentiles. Here we use Stata's auto dataset and
!
compute the deciles of mpg: t se auto
+
+
• _ctile pct= _ist pet
i
i
•
pet
14
_, _, _.
20 22 24
_. _.
25 29
!
'_
i
earner to oistinguish be tween the percentiles.
! !
I
'V_
apg, nq(lO)
in I tl0
_illthe
.en.
_ _
. p_tile pet
option
= _pg, t
1_st
percent
!
percent
_ct
.enerate
nq(10) in 1/10 pet
2_ 11. 31
20 10 30
17 14 18
4! si
40 so
19 20
+o .o 80
+
;+
I
:oi°°
.
to
/
22 2+ 25
anot]
genp(percent)
,e_
v_d_ab'e
_vJth
the
co_[espondi_.
_erce_]ta.e..
,, ,.
tie [ petile -- Create variable contlfining percentiles
]
i
Syntax pctile
genp(newvarp) T
xtile newvar
'
altdef = exp
{ nquantiles(#) _pctile
= exp [weight]
[type] newvar
varname
[if
exp]
[in
range]
[, _nquantiles(#)
]
[weight 1 [if
exp]
[in range]
I c_utpoints(varname) [weight]
[if
exp]
[,
} a!tdef
[in range]
t
[,
{ nquantiles(#) I p_ercentiles(numlist) } altdef ] fweights, and pweights are allowed (see [U] 14.1.6 weight) except when the altdef in which case no weights are allowed.
aweights,
option is specified,
Description pctile creates a new variable containing typically just another variable. xtile
the percentiles
creates a new variable that categorizes
of exp. where the expression
exp by its quantiles. If the cutpoints
option is specified, it categorizes exp using the values of vamame as category cutpoints. varname might contain percentiles, generated by pctile, of another variable.
exp is
(varname) For example,
_pct ile is a programmer's command. It computes up to 20 percentiles and places the results tn r(); see [U] 21.8 Accessing results calculated by other programs. Note that summarize, detail will compute some percentiles (1, 5, 10, 25, 50, 75, 90, 95, and 99th); see [R] summarize.
Options nquantiles (#) specifies the number of quantiles. The command computes percentiles corresponding to percentages 100k/m for k = t,2,..., m- 1, where m = #. For example, nquantiles(lO) requests that the 10th. 20th, ..., 90th percentiles be computed. The default is nquantiles(2): i.e., the median is computed. genp(newvarp) specifies a new variable to be generated containing to the percentiles. altdef uses an alternative
formula for calculating percentiles.
the percentages
corresponding
The default method is to invert the
empirical distribution function using averages ((zi + z_+x)/2) where the function is flat (the default is the same method used by summarize; see [R] summarize). The alternative formula uses an interpolation method. See Methods and Formulas at the end of this entry. Weights cannot be used when altdef is specified. 491
1
I"
• i !
_
4_J i I pc°rr _3Technic Note -- Pztrial c°rrelati°n c°efficients Some caution is in order when interpreting the above results. As we said at the omset, the partial corretati )n coefficient is an attempt to estimate the correlation that would be observed if the other variable., were held cc_nstant. The emphasis is on attempt, pcorr makes it too easy to ignore the fact that you are fitting a aodel. In the above example, the model is _price= fl0+ fllmpg+ _2wei_t + _3foreign+ e which is_ in all honestk a rather silly model. Even if we accept the implied economic assumptions of the moddl--that consumers value mpg, weight, and foreign--do we really believe that consumers
i i _. i i !
place equal value on _very extra l,O00 pounds of weight? That is, have we correctly parameterized the mod_l? If we hav_ not, then the estimated :partial correlation coefficients may not represent what the3' clai_ to represen I. Partial correlation coet_icients area reasonable way to summarize data after one is cobvinced that the underlying model is reasonable. One should not, however, pretend that there is no underlying model and that the partial correlation coefficients are unaffected by the assumptions and parai aeterization.
Methodsand Fornlulas pcor_ is implemen :ed as an ado-file. Result t are obtaine, by estimating a linear regression of varnamel on varlist; see [R] regress. The partial correlation coefficient between varnamel and each variable in varlist is then defined as
!
(Theil 19)1, 174). wh_re t is the t statistic, n the number of observations, and k the number of mdependdnt variables i_cludmg the constant but excluding any dropped variables, The significance .
is _iven _y 2, trail
_n - k, abs (t))
References Thei.I.H. 1_7!. Principles_)[Econometrics.New York John Witey& Sons.
i
AlsoSee
i
Related: !
} I
JR] eorrel te, [R] spearman
pcorr -- Partial correlation l coefficients I I
i I
Syntax ;
pcorr varnamel
vartist
[weight]
[if exp] [inrange]
by ... : may be used with pcorr; see [R] by. aweights and fweights are allowed: see [U] 14.1.6 weight.
'
Description pcorr displays the partial correlation coefficient of varnamel the other variables in varlist constant.
with each variable in varlist, holding
z
Remarks Assume that y is determined by xl, x2, ..., xk. The partial correlation between 5' and xl is an attempt to estimate the correlation that would be observed by y and xt if the other x s did not vary.
> Example Using our automobile dataset (described in [U] 9 Stata's on-line'tutorials and sample datasek_), the simple correlations between price, mpg, weight, and foreign can be obtained from correlate (see [R] correlate): • corr price (obs=74)
mpg
weight
foreign
price
mpg
weight
price
i. 0000
mpg
-0. 4686
weight
0.5386
-0.8072
1.0000
foreign
0.0487
0.3934
-0.5928
foreign
I.0000 1.0000
Although correlate gave us the full correlation matrix, our interest is in just the first column. We find, for instance, that the higher the mpg, the lower the price. We obtain the partial correlation coefficients using pcorr: pcorr price (obs=74) Partial
mpg
weight
correlation
Variable mpg
foreign
of price
with
Corr.
Sig.
O. 0352
O. 769
weight
O. 5488
O. 000
foreign
O, 5402
O. 000
We now find that. holding weight and foreign constant, the partial correlation of price with mpg is virtuallv zero. Similarly, in the simple correlations, we found that price and foreign were virtually uncorrelated. In the partial correlations holding mpg and weight constant we find that price and foreign are positively correlated. q
489
!
...............
o!_,_u_,,uet-_le
aataset
)
!
t
I[ i
i
.
1
our:sheet copi_ the data currently loaded in memory into the specified file. About all that can go wrbng is the fil_ you specify already extsts: outsheet
u_ing
[ile tosend._ut
r(602) ;
tosenfl already
exists
)
In thai case, you ca_l erase the file (see [R_ erase), specify outsheet's differe at filenarne, _aen all goes well, out,beet is silent:
replace
option, or use a
i I outsheet
us .ng tosend,
replace
-
tf you are copying tl e data to a program othtr than a spreadsheet, remember to sl_ify option: ) •i outsheet
us_ ng feral,
nenames
"!-
q
i
[
Also See Compl_ mentary:
[R] insheet
Related
[R] outffle
Backgr
[U] 24 Commands to i_put data
[ t
i
the nonces
"
)
)
)
:le I outsheet-
II
Write spreadsheet-style
dataset
l
i
Syntax outsheet [varlist] using filename[if exp] [in range] [, nonames nolabel noquote comma replace ] ?
If filename is specified without a suffix, .out is assumed.
Description outsheet writes data in tab- or comma-separated most spreadsheet programs prefer.
ASCII format into a file. This is the format that
Options nonames specifies that variable names are not to be written in the first line of the file; the file is to contain data values only. nolabel specifies that the numeric values of labeled variables are to be written into the file rather than the label associated with each value. noquote
specifies that string variables are not to be enclosed in double quotes.
comma specifies comma-separated replace
format rather than the default tab-separated
specifies that it is okay to overwrite filename
format.
if it already exists.
Remarks If you wish to move ),our data into another program, you have the following 1. The use of an external data-transfer
program; see [r.j] 24.4 Transfer
alternatives:
programs.
2. Cutting and pasting from Stata's data editor; see Getting Started, chapter 6. 3. Using outsheet. 4. Using outfite;
see [R] outfile.
Concerning alternatives 3 and 4. outsheet is typically preferred for moving the data to a spreadsheet and outfile is probably better for moving data to another statistical program. If your goal is to send data to another Stata user, you could use outsheet or outfile, but it is easiest to send the .dta dataset. This will work even if you use Stata for Windows and your cohort uses Stata for Macintosh. All Statas can read each other's .dta files.
487
2 t
,,
S ,meprogram,, _referdata that are separatedby commas rather than by blanks. Stata wilt produce such a dataset if you specify the commao_tion:
0
. outfite I
u_ ing employee,
1
comma
. _ype empl,tee.raw Carl Marks , 57213,24000, male "Irene Adlez" ,47229,27000, "female" I"Adam Smith", 57323,24000, "reale" "David Walli
,57401,24590,"male"
i
!"MaryRogers',57802,27000, "female"
_.
i"Carolyn Fr_ k", 57805,24000, "remade" "Robert Laws in",57824,22500,"male"
Example t
Fin lly, outfil_can create data dictionaries that infilecan read. Dictionaries ate perhaps the best w_y to organb : your raw data. A dicl!onary describes your data so that yoa do not have to remem_er the order _f the variables, the number of v_ables, the variable names, or anything eisel The fill in which y( _store your data becorrles self-doCumentingso that when you come back to it at somi future date, 'ou can understand it. See JR]infile (fixed format) for a full description of data When you speci . :ioutfile
usi
the dictionary, employee,
dict
Stata writes a dot file: [
. itype employeb.dct
i
i s_rl5 float
i
iname _empno
"Employee "Employee
float salary I
d_ictionary { float
isex
"carl
name" number"
"Arm_aisalary" :sexlbl
"Sex _
572 3
24000 " ,ale"
"Irene Adler" "Adam Smit t"
47229 5732.3
270(_0 240_0
"female" "male"
1
"David Walli ;" I "Mary Rogerl;"
5740_1 57802
2450D 270011
"male" "female"
i
" obert Lawsoii" '_arolyn Fran] :"
57824 57805
225ob 24000
"male" "female"
i
!
q
<
AlsoSee i
I
Complementary:
[ ] infile
Related: !
[ ] outshee_
Backgrou/_d:
[_]24 Commands to input data
I i
1
!
_
outfile -- Write ASCII-format dataset
485
[3 Technical Note The nolabel option prevents Stata from substituting value label strings for the underlying numeric value; see [U] 15.6.3 Value labels. As we just said, the last variable in our data is really a numeric variable: • outfile
using
employ2,
nolabel
• type employ2.raw "Carl Marks"
57213
24000
"Irene
Adler"
47229
27000
0 1
"Adam
Smith"
57323
24000
0
"David
Wallis"
57401
24500
0
"Mary
Rogers"
57802
27000
1
"Carolyn Frank" "Robert Lawson"
57805 57824
24000 22500
1 0
0
[3 Technical Note If you do not want Stata to place double quotes around the contents of string variables, the noquote option: . outfile • type
using
employ3,
employ3.raw Carl Marks
Irene Adam
specify
noquote
57213
24000
male
Adler
47229
27000
female male
Smith
57323
24000
David
Wallis
57401
24500
male
Mary
Rogers
57802
27000
female
Carolyn Frank Robert Lawson
57805 57824
24000 22500
female male
0
I> Example Stata never writes over an ex:{sting file unless explicitly told to do so. For instance, if the file employee, raw already exists and. you attempt to overwrite it by typing outfile using employee, here is what would happen: • outfile
using
file employee.raw r(602) ;
employee already
exists
You can tell Stata that it is okay to overwrite a file by specifying using employee, replace.
(Continued
on next page)
the replace option:
outfile
> Exampl_ Youlhaveentered nto Statasome data on s ven employeesin your firm. The data contain employee r_ame.!mployee identification number, salar_i and sex: •!list !
i
,
i
name
empno
salary
sex
Ii. Carl Mark_
i
57213
24,,000
male
i2. Irene Adl_r i3. Adam Smit_
47229 57323
127,000 24,000
female male
!4. David
57401
24,500
male
i5. Mary Rogers
57802
27,000
female
!76:Carolyn F_ank , Robert La#son
57805 57824
24,000 22,500
female male
Wal_is
i
If yo_ now wish tc use a program other thin Stata with these data, you must somehow get the data over to l_at other prol "am. The standard Stata_format dataset created by save will not dothe job--it is writte_ in a special _ormat that only Stata uhderstands, Most programs, however, understand ASCII datasetsg-standard te datasets that are like ihose produced by a text editor. You can tell Stata to produceisuch a datas_ using outfile. Typi8g outfile using employee createsa dataset called employee,raw that c,)ntains all the data. We Can use the Stata typecommand to review the resulting
i
file:
_utfile
using
employee
i "Carl Marks" _ype employee.raw
57213
24000
i
i "Irene : "Adam
47229 57323
27000 24000
"female" "male"
I
!"David Walli i" "Mary Roger ;"
57401 57802
245a0 270d0
"male" "female"
[tCarolyn
578os
24o o "femaW'
IRobert Lavso_"
57824
22500
i
Adler" Smith"
"male'*
"male"
We se _ , that the fileicontainsi the four variables and that Stata has surrounded the string variables with double quotes. I
1 I
i 3 TechnicalNote
[
!
outfi_e is careful _o columnize the data in :case you want to read it using formatted input. [n the example a_bove,the firs_tstring has a }',-16s display format. Stata wrote two leading blanks and then placed th+ string in a |6-character field, outfile always right-justifies string variables even when
I
the displa__format requests left-justification.
!
The fi!st number h_s a Y,9.0g format. Th_ number is written as two blanks followed by the number. _ght-justified in a 9-character field. The second number has a Y,9.0gc format, outfile ignores tt'_ comma part of the format and also writes this number as two blanks followed bv_the number, right-justified in a 9-character field. :
!
,
The ]aatt entry is really a numeric _ariable,:: but it ha:s an associated value label. Its tbrmat is
} "
Y,-9.0g. 4o Stata wrot_ two blanks and the2 tight-justiSed the value label in a 9-character field Again. ou{fileright-jt_stifies value labels e_en:;when the display formal specifies left-justification.
i
•
I outfile -- Write ASCII-format
dataset
[
I
I
Syntax
outfile[var//s,] using te,,ameexp][inra,,e][,
dictio=y
no!abel noquote replace wide ]
Description outfile writes data to a disk :file in ASCII format, a format that can be read by other programs. The new file is not in Stata format; see [R] save for instructions on saving data for subsequent use in Stata. The data saved by outfile can be read back by infile; see [R] infile. Iffilename is specified without an extension. '.raw' is assumed unless the dictionary option is specified, in which case '.dct' is assumed.
Options comma causes Stata to write the file in comma-separated-value format. In this format, values are separated by commas rather than blanks. Missing values are written as two consecutive commas. dictionary writes the file in Stata's data dictionary format. See [R] infile (fixed format) description of dictionaries. Neither comma or wide may be specified with dictionary. nolabel causes Stata to write the; numeric values of labeled variables. labels enclosed in double quotes. noquote
for a
The default is to write the
prevents Stata from placing double quotes around the contents of string variables.
replace permits outfile to overwrite an existing dataset, replace mav not be abbreviated. wide causes Stata to write the data. with one observation into lines of 80 characters or fewer.
per line. The default is to split observations
Remarks outfile enables data to be sent to a disk file for processin_ by a non-Stata program. Each observation is written as one or more records records that will not exceed 80 characters unless you specify the wide option. The values of the variables are written using their current display formats, and unless the comma option is specified, each is prefixed with two blanks. If you specify the dictionary option, the data are written in the same way, but in front of the data outfile writes a data dictionary describing the contents of the file.
483
i
orth rely uses th( Christoffel-Darboux Both _rtlaog and ,rthpoly
recurrence formula (Abramowitz and Stegun 1968).
normalize thd orthogonal variables such that
Q_Q=MX !
where It _ -- diag(w_,w2,...,wN)
with wl:,w2,...,WN
the weights (all 1 if weights are not
I
specifiedi), and M is t_e sum of the weights (the number of observations if weights are not specified).
i i
Referenqes
I
Abramowiiz.M. and I. 4' Stegun, ed. 1968.Handbook of Mathemat/ca/Functions,7th printing.Washington.DC: Nation_dBureauof Standards.
!
Golub,G. !H.and C. F. Va_Loan. 1996,Matr/x CompUtations,3d ed. Baltimore:JohnsHopkinsUniversityPress,pp.
218-2_9. I
Sribney, _,_(Reprints,voI. M, 1995.sg37:5,Orthogonalpolynomials.S}aa TechnicalBulletin25: 17-18. Reprintedin Stata Technical Bultetii pp. 96-98.
i I
!_o }
i AlsO,See
Related: I
R] regress
Backgrot_nd:
_] 23 Estimation and I_gst-estimation, commands
:
Some of the correlations problems removed,
orthog -- O_hogonal variables and o_hogonal polynomials 481 among the powers of weight are very large, but this does not create any
for regress. Howevel; we may wish to look at the quadratic trend with the constant the cubic _end with the quadratic and constant removed, etc. orthpoly will generate
polynomial terms with this property: . orthpoly weight, generate(pw*)
dog(4) poly(P)
. regress mpg pwl-pw4 Source Model Residual Total
SS
df
MS
1652.73666
4
413.184164
790.722803
69
11.4597508
2443.45946
73
33.4720474
mpg
Coef.
pwl pw2 pw3 pw4 _cons
-4.638252 ,8263545 -.3068616 -.209457 21.2973
Std. Err. .3935245 .3935245 .3935245 .3935245 .3935245
t -11.79 2.10 -0.78 -0.53 54.12
Number of obs = F( 4, 69) = Prob > F =
74 36.06 0.0000
R-squared Adj R-squared Root MSE
0.6764 0.6576 3.3852
P>ItJ
= = =
[95_ Conf. Interval]
0.000 0.039 0.438 0.596 0.000
-5.423312 .0412947 -1.091921 -.9945168 20.51224
-3.853192 1.611414 .4781982 .5756028 22.08236
Compare the p-values of the terms in the natural-polynomial regression with those in the orthogonalpolynomial regression. With orthogonal polynomials, it is easy to see that the pure cubic and quartic trends are nonsignificant and that the constant, linear, and quadratic terms each have p < 0.05. The matrix P obtained with the poly () option can be used to transform coefficients for orthogonal polynomials to coefficients for natural polynomials: orthpoly weight, poly(P) deg(4) . matrix b = e(b)*P matrix list b b[1,5] yl
degl .02893016
deg2 -.00002291
deg3 5.745e-09
deg4 -4.862e-13
_cons 23.944212
<1
Methodsand Formulas orthog orthog's
and orthpoly are implemented orthogonalization
as ado-files.
can be written in matrix notation as
x=QR where X is the N × (d + l) matrix representation of varlist plus a column N x (d + I) matrix representation of newvarlist plus a column of ones (d in vaHist and N = number of observations). The (d + 1) × (d + I ) matrix triangular matrix: i.e.. R would be upper triangular if the constant were first, so the first row/column has been permuted with the last row/column.
of ones. and Q is the = number of variables R is a permuted upper but the constant is last,
Q and R are obtained using a modified Gram-Schmidt procedure; see Golub and Van Loan (1996) for details. Note that the traditional Gram-Schmidt procedure is notoriously unsound, but the modified procedure is quite good. orthog performs two passes of this procedure.
'
[ { t
480
orthog 0 _hogonalvariablesandiorthogonal polynomials ! _ompare '
rtrulk
trunk difference
I
, ,,i
i
r t_/_k>t runk
74
jo!ntly
74
count
defined
to_al
minimum:
average
maximum
8.88e-15
I.92e-14
3.55e-14
8,88e-15
1.92e-14
3.55e-14
74
I In [hii example,
the:recovered
variable rtrank
is almost exactly the same
as the original _runk.
Vehenodthoeonalizingman',, variables, this procedure can be performed as a check of the numerical soundnessof the orth_gonalization, Because of the ordering of the orthogonalization procedure, the last variable and the ariables near the end of the varlist are the most important ones to check.
-
The o_thpoly cot mand effectively does for polynomial terms what the orthog command does for an arbitrary set of variables.
°
i
I,
> Examplei i
A_aini considerthe auto.dta dataset.Supposewe wish to fit themode] mpg
=
+ _I weight
+/_2
we_g ht2 + _3 weigh
t3 + ;_4 weight4 + e
We will first compute he regression with natuial polynomials: !
!
i a
double
w2 = wl*wl
, g_n double
w3 = w2*wl
. g,_n double
w4 = w3*wl
. c_rrelate
i!
wl-_4
I
w2
wl
w3
wl
1 .(300
i
w2
0.(
i
w3
O.¢.565
O. 9916
I.0000
i
w4
O. t 279
O. 9679
O. 9922
!
915
1.0000 1. 0000
. r,_gress mpg wl-v4 88
I
w4
.i
df
_
Model !Residual
MS
Number
,,
_( 4,
652.73666 '90.722803
4 69
413._84164 11.4fl_97508
!443.45946
73
33.4_20474
!
i }
Adj Total
mpg
Coef.
Std.
Err.
_i
.0289302
_2 w3
.-. 0000229 5.74e-09
.0000566 _.19e-08
w4
' 4.86e-13
_cons
23.94421 I
Root
t
,1161939
69) =
Prob > F R-squared
. _ i
0.25
P>It I
74
of obs =
R-squared MSE
[95Y, Conf.
30.06
= =
0.0000 0.6764
=
0.6576
=
3,3852
Interval]
O. 804
-. 2028704
.2607307
i -0.40 0.48
O. 687 0.631
-. 0001359 -1.80e-08
.0000901 2,95e_08
9.14e-13
-0.53
0.596
-2.31e-12
1.34e_12
86.60667
'
0.783
-148.83t4
196.7198
i!
0,28
W
odhog -- Odhogonal variables and odhogonal polynomials
. regress
price
length
Source
weight
SS
Model Residual Total
price
weight headroom
trunk
df
MS
Number F( 4,
of obs 69)
74 10.20
4
59004145.0
Prob
=
0.0000
399048816
69
5783316.17
R-squared Adj R-squared
= =
0.3716 0.3352
635065396
73
8699525.97
Root
=
2404.9
Std.
Err.
t
> F
= =
236016580
Coef.
length
headroom
MSE
P>ltl
[95_
Conf.
-185.747
479
Interval]
-I01.7092
42.12534
-2.41
0.018
4.753066 -711.5679
1.120054 445.0204
4.24 -1.60
0.000 0.114
2.518619 -1599.359
-17.67148 6.987512 176.2236
trunk
114.0859
109.9488
1.04
0.303
-105.2559
333.4277
_cons
11488.47
4543.902
2.53
0.014
2423.638
20553.31
However, we may believe a priori that length is the most impo_ant predicton followed by weight, followed by headroom, followed by trunk. Hence, we would like to remove the "effect" of length from all the other predictors; remove weight from headroom and trunk; and remove headroom from trunk. We can do this by running orthog, and then we estimate the model again using the orthogonal variables: • orthog
length
• regress
price
Source Model
weight
headroom
olength
I
trunk,
oweight
SS
i
gen(olength
oheadroom df
oweight
oheadroom
otrunk)
matrix(R)
otrunk
MS
Number F( 4,
of obs 69)
74 10.20
236016580
4
59004145.0
Prob
=
0.0000
399048816
69
5783316.17
R-squared Adj R-squared
= =
0.3716 0.3352
635065396
73
8699525.97
Root
=
2404.9
price
Coef.
Std.
olength
1265.049
279.5584
4.53
oweight oheadroom
1175.765 -349.9916
279.5584 279.5584
4.21 -1.25
0.000 0.215
1.04 22.05
Residual Total
Err.
otrunk
290.0776
279.5584
_cons
6165.257
279.5584
Using the matrix R, we can transform the metric of original predictors: • matrix matrix
t
> F
= =
MSE
P>ltt
[95_
Conf.
Interval]
0.000
707.3454
1822.753
618.0617 -907.6955
1733.469 207.7122
0.303
-267.6262
847.7815
0.000
5607.553
the results obtained using the o_hogonal
6722.961
predictors back to
b = e(b)*inv(K)" list
b
b[1,5] length yl
-101.70924
weight 4.7530659
headroom -711.56789
trunk 114.08591
_cons 11488.475
Technical Note The matrix R obtained using the matrix() option with orthog can also be used to recover .¥ (the original vartist) from Q (the orthogonalized newvarlist) one variable at a time. Continuing with the previous example, we illustrate how to recover the trunk variable: .
matrix
• matrix
C = R[l...,"trunk"]" score
double
rtrunk
= C
!
478 orthog -- )rthogonal variables and orthogonalpolynomials Notei that the coef_cients corresponding tcr the constant term are placed in the last column of the matrix. The last r_bwof the matrix is all tero except for the last column, which corresponds to the _onstant term. 1
Remarks •
Ortht,gonal variab
s are useful for two reasons. The first is numerical accuracy for highly collinear
variableg._ Stata'srel_ress and other estimationcommandscan face a largeamountof coll}nearitv and variables due to stil! produce accbrate results. But, at some point, these commands will drop r ! _!
cotlineahty".- If you ktnow with certainty that the variables are not perfectly collinear, you may want to retain a_lof their eff@ts in the model. B3,'usihg orthogor orthpolytO produce a set of orthogonal . i all vanable_ ...... will be present m the estimauon results. vanablef;,
}
User i are more lik ly to find orthogonal vafi_'ablesuseful for the second reason: ease of interpreting results, brthog and _rthpoly create a set 0f variables such that the "effect" of all the preceding
_
vanable_ have been Fm°ved from each vanable. For example, ff one 2ssues the command
I
. iorthog
xt
x2
x3,
generate(ql
q2
q3)
cons:ant
xl are constant produce ql, is removed from xt the removed 2, and finally the conslant, xl. and x2 then are removed fromand x3 to produce q3.
the the fromeffett x_ toof produce Hence,
tO
}
ql = r01 + rll xl
g
q2 = r02 + r1_2xl + r22 x2
i
q3 = ro3 + rl_3xI + r23 x2 -,- r33 x3 This cm be generali
i i ! _
d and written in matrin notation as
_
X = OR
where ..J,*: is the A" ×!(d + t) matrix representation of varlist plus a column of ones, and Q is the Ar × (di+ l) matrix representation of newvarlist plus a column of ones (d = number of variables in varli._t and N = namber of observations). The (d-t- 1) × (d + 1) matrix R is a permuted upper _riangul_.r matrix: i.e.. R would be upper triangular if the constant were first, but the constant is last. so _he, first row/zolumn has been permuted with the last row/column. Since Stata's estimatmn commar_ds list the constant term last. this allows R, obtained via the matrix() option, to _ used to trans rm estimatk n results.
!i
i
i
I i
.- Example ConsiderStata's md:o. dta dataset Supposewe postulatea mode] in which price dependson the car's le lgth. weight, headroom (headroom);, and trunk size (trunk). These predictors are collinear. bat not .'xtremely so--the correlations are nch that close to l" horrelate
leiLgth weight
headroom
trrmk
(o bs=74) I__ngth
weight
-!_ength
1 0000
_eight
0 9460
i 0000
0 5163 0 7266
0.4835 0.6722
'
he_adroom trunk
"
headroom,
1.0000 0. 6620
trunk
1. 0000
regres:certainly h@ no trouble estimating _hi_,rnodeh
"itle I °rth°g
-- Orth°g°nal ,
variables and °rth°g°nal
p°ly , n°mials
]
Syntax orthog
[varlis,]
tweightl
[matrix(matname)
orthpoly
varname
[if
expl
[in
range],
g_enerate(newvarlist)
]
[weight]
{ generate(newvarlist)
Iif
exp]
[in range],
[p_oly(matname)
} [ degree(#)
]
orthpoly requires that either generate() or poly(), or both. be specified, iweights, fweights, pweights, and aweights are allowed, see [U] 14.1.6 weight.
Description orthog orthogonalizes a set of variables, creating a new set of orthogonal variables (all of type double), using a modified Gram-Schmidt procedure (Golub and Van Loan 1996). Note that the order of the variables determines the orthogonalization: hence, the "most important" variables should be listed first. orthpoly computes orthogonal polynomials
for a single variable
Options generate(newvarlist) is not optional; it creates new orthogonat variables of type double. For orthog, newvarlist will contain the orthogonalized varlist. If varlist contains d variables, then so will newvarlist. For orthpoly, newvarlist will contain orthogonal polynomials of degree 1, 2, .... d evaluated at varname, where d is as specified by degree (d). newvarlist can be specified by giving a list of exactly d new variable names, or it can be abbreviated using the styles newvar 1newvard or newvar,. For these two styles of abbreviation, new variables newvar 1, newvar2, .... newvar d are generated. matrix(mamame) (orthog by X = QR, where X is and Q is the N × (d + 1) of variables in varlist and
only) creates a (d+ 1) × (d + l) matrix containing the matrix R defined the N × (d+ 1) matrix representation of vartist plus a column of ones, matrix representation of newvarlist plus a column of ones (d = number N := number of observations).
degree(#) (orthpoly only) specifies the highest degree polynomial to include. Orthogonal nomials of degree 1, 2.... , d - # are computed. The default is d = 1.
poly-
poly(mamame) (orthpoly only) creates a (d + 1) × (d 4- 1) matrix called matname containing the coefficients of the orthogonal polynomials. The orthogonal polynomial of degree i < d is matname[ i, d + I ] + matname[ i, 1 ] *varname + matname[ + " • + matname [ i, i ]*varname" 477
i, 2 ] *varname 2
I_
I !
,°,,au,_s
..........
In _,aataset
a Tectlnical Note :_
If _,our data ( ontain variables the_ correctly andi yee_r2.
e ,en though
named yearl,
to most computer
year2 ..... programs,
yea_rig,
yearl0
year20_
is alphabetically
aorder
will order
bep,veen yearI
i
I
Methi,dsandI_ormulas a ,rder is imp emented
as an ado-file,
Refe. nces Gleas_n, J, R. 1997. tmSl: Defining and recordirig variable ot'derings. Stata Technical Bultetin40: 10-12. Reprinted in IStata, TechnicaJ Bulletin Reprints, rot. 7, p_, 49-52. }
Weesi_. J. 1999. din7 .: Changing the order of variables in a dataset. Szala Technical Bulletin 52: 8-9. Reprinted in St_ta Technical B_ Iletin Rep6nts, vol. 9, pp. 6_-62. !
AlsoS_ee Coml_lementary:
[R]descry'be
Related:
[R] edit,
[R] rename
W
Contains
data
from
obs: I
74
1978
6
vars: size:
7
2,368
(99.6%
storage I
order -- Reorder variables in dataset
auto.dta
variable
name
of memory
Automobile
Jul
2000
475
Data
13:51
free)
display
value
type
format
label
variable
label
!
i ;1
i
make
strl8
%-18s
Make
mpg
int
%8,0g
Mileage
price
int
%8.0gc
Price
weight
int
%8.0gc
Weight
length
int
%8.0g
Length
(in.)
rep78
int
%8.0g
Repair
Record
Sorted
and
Model (mpg)
(Ibs.) 1978
by:
Note:
dataset
has
changed
since
last
saved
[
1} ' I '
If we now wanted length to be the last variable in our dataset, we could type order price weight rap78 length but it would be easier to use move: . move length describe
rep78
Contains
from
data
auto.dta
obs:
74
wars:
6 2,368
size:
variable
name
1978
Automobile
7 Jul (99.6%
storage type
of memory
display format
2000
Data
13:51
free)
value label
variable
label
make
strl8
Z-18s
Make
mpg
int
_8.0g
Mileage
price
int
_8.0gc
Price
weight
int
%8.0gc
Weight
(Ibs.)
rep78
int
%8.0g
Repair
Record
length
int
%8.0g
Length
(in.)
Sorted
make mpg
and
Model (mpg)
1978
by:
Note:
dataset
has
changed
since
last
saved
We now change our mind and decide that we would prefer that the variables be alphabetized. aorder describe Contains
data
from
obs:
auto.dta 74
1978
6
7 Jul
vars: 2,368
size:
(99.4%
storage
of memory
Automobile 2000
free)
display
value
type
format
label
length make
int sift8
_8.0g _-18s
Length (in.) Make and Model
mpg
int
_8.0g
Mileage
price
int
%8.0gc
Price
rep78 weight
int int
%8.0g _8.0gc
Repair Weigh_
variable
Sorted
name
Data
13:51
variable
label
(mpg) Record (ibs.)
1978
by:
Note:
dataset
has
changed
since
last
saved
_
i
Title tl i
!! 't
i
I
[
1
I
u
"
Syntax ord._r_ vartist Yartlame
movi
_rname2
1
aor_er [varlist]
Descriion
i
order changes tl_e order of the variables in the current dataset. The variables specified in varlist are m@ed, in order, lto the front of the data_t. ! ! movi_ also reorder_ variables, move relocaies vamamel to the position of vamame2 and shifts the
ii_
remain!ng_variables, !includingl varna,ne2, to make room.
_-
aor_er alphabeti_esthe variablesspecifiedin varlistand movesthem to the front of the dataset. If no vhrlist is specihed. _all
Remarks _- Examplb When using order, ., describe C)ntains
I "
obs: tars:
}
i
i
_
i
you must specify a vadist, but it is not necessa_' to specify all the variables
in the dataset._ For e::ample, i
!
!
is assumed.
data from
auto.dta 74 6
• 2,368
;ize:
1978 7 Jul Automobile 2000 13:51 Data (99.6}',of memory
storage
free)
display
value
type
format
label
p :ice
int
_8, Ogc
Price
w_ight m@g m_ke
int int strl8
_,8.0gc 7,8.0g Y,-18s
Weight (Ibs.) Mileage (mpg) Make and Model
l_ngth r_p78
int int
Y,8. Og '/,8.0g
Length Repair
v_riable
S_rted
name
by :
Note: ._order
variable
make
dataset
has
cIianged since
last
pg
describe
474
saved
label
(in.) Kecord
1978
oprobit m Maximum-likelihood
,
ordered probit estimation
473
Saved Results oprobit
saves
in e():
Scalars e (N)
number of observations
e (11)
log likelihood
number of categories model degrees of freedom pseudo R-squared
e(ll_O) e(chi2) e (N_clust)
log likelihood, constant-only model X2 number of clusters
e(cmd)
oprobit
e(vcetype)
covarlance estimation method
!
e(depvar) e(wtype)
name of dependent variable weight type
e(chi2type) e(offset)
Wald or LR: type of model X2 test offset
[ i
e(wexp) e (clustvar)
weight expression name of cluster variable
e(predict)
program used to implement predict
coefficient vector category values
e (V)
variance-covariance estimators
e (k_cat) e(df_m) e (r2_p) Macros
Matrices e (b) e (cat)
[ t
matrix of the
Functions e fsample)
marks estimation sample
Methodsand Formulas Please
see the Methods
and Formulas
section
of [R] ologit.
References Aitchison. J. and S. D. Silvey. 1957, The generalization of probit analysis to the case of muhiple responses. Biometrika 44: 131-140. Goldstein. R. 1997. sg59: Index of ordinal variation and Neyman-Barton Reprinted in Stare Technical Bulletin Reprints, vol. 6, pp. 145-147.
GOE Stat_ Technical Bultetin 33: 10-12.
Greene, W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River, NJ: Prentice-Hall. Long, J. S. 1997. Regression Models tbr Categorical and Limited Dependent _,hriable.s. Thousand Oaks, CA: Sage Publications. Wolfe, R. 1998. sg86: Continuation-ratio models for ordinal response data. Stata TechnicJ Bulletin 44:18-21. in Stata Technical Bulletin Reprints, vol. 8, pp. 149-153.
Reprinted
Wolfe. R, and W. W. Gould. 1998. sg76: An approximate likelihood-ratio test for ordinal response models. State Technical Butletin 42: 24-27. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 199-204.
Also See Complementary:
[R] adjust,
[R] lincom,
[R] linktest.
[R] lrtest.
[R] mix.
[R] predict,
[R] test, [R] testnl, [R] vce, [R] xi Related:
[R] logistic, [R] mlogit, [R] ologit, [R] probit, [R] svy estimators
Background:
[U] 16.5 Accessing coefficients and standard errors_ [u] 23 Estimation and post-estimation commands. [u] 23.11 Obtaining
robust
[U] 23.12 Obtaining tR] maximize
scores,
variance
estimates,
[R] sw,
!
I
!!" r
472
_
_
op_obit
m
Maximum-|i_lihood
o_ered ,;
P _0_
Hypothesistests md predictions
See u] 23 Estim tion and post-estimation commands for instructions on obtaining the variancec_3varialce matrix oi the estimators, predicted values, and hypothesis tests. Also see [R] lrtest for perforating likelihoo_ -ratio tests.
Exampi tn t_e above example, we estimated the model oprobit rep77 foreign length mpg. The predict command can be used to obtain the predicted probabilities. You type predict followed by the nantes of the ne_ variables to hold the p_dicted probabilities, ordering the names from low to high. I_ our data, the lowest outcome is poor and the highest excellent . We have five categories and so inust type fiv, names following predict; the choice of name is up to us:
I !
. !predict poor fair avg good exc (_ption p assu_ed; predicted probabilities)
_
. ilist make mo_el exc good
'
i
! }
i 13.
i
if rep77==.
make AMC
model Spirit
exc .0006044
good .0351813
Ford _zlO[ Buick 44. Mere. _3. Peugeot _6. Plym. _7. Plym.
Fiesta Opel Monarch 604 Horizon Sapporo
.0002927 .0043803 .0093209 ,0734199 .001413 .0197543
.0222789 .1133763 .1700846 .4202766 .0590294 .2466034
_3.1
Phoenix
.0234156
.266771
Pont.
i
For Srdered probill, predict, xb produces Sj = Xlifll -t- x2jfl2 +"" + xk3flk. Ordered probit is identlcal to ordered logit except that one uses a different distribution function for calculating probabilities. The orc_ered-probit predictions are then the probability that Sj 4- uj lies between a pair of cut _ints e;i-1 arid _i. The formulas in the case of ordered probit are
l
I
e_timatiOn
i
Pr(Si
+ u < n)=
I
Pr(Sj
+ w > _,) = i - _(_ - Sj) = _(Sj
Rather than using pr diet i . _predict I . _en . _en
,F(_-
Sj) - n)
directly, we could calculate the predicted probabilities by hand. " "
psco_re,xb I
probexc T norm(pscore-_b[_cut4]:) probgood norm( b[ cut4]-pscol_e)
- norm(
b[ cut3]-pscore)
oprobit -- Maximum-likelihood ordered probit estimation
471
Remarks An ordered probit model is used to estimate relationships between an ordinal dependent variable and a set of independent variables. An ordina/variable is a variable that is categorical and ordered, for instance, "poor", "good", and "excellent", which might be the answer to one's current health status or the repair record of one's car. If there are only two outcomes, see [R] logistic, IN] logit, and [R] probit. This entry is concerned only with more than two outcomes. If the outcomes cannot be ordered (e.g., residency in the north, east, south and west), see IN] mlogit. This entry is concerned only with models in which the outcomes can be ordered. In ordered probit, an underlying score is estimated as a linear function of the independent variables and a set of cut points. The probability of observing outcome i corresponds to the probability that the estimated linear function, plus random error, is within the range of the cut points estimated for the outcome: Pr(outcomej
= i) = Pr(n__l
< fllzlj
+/32x2j
+'"
< _)
uj is assumed to be normally distributed. In either case, one estimates the coefficients ill, 132, ..., flk together with the cut points nl, n2, ..., nz-1, where I is the number of possible outcomes. no is taken as -oo and nz is taken as 4-00. All of this is a direct generalization of the ordinary two-outcome probit model.
> Example In [R] ologit, we sample datasets) to logit to explore the proxy for size), and togit:
use a variation of the automobile dataset (see [U] 9 Stata's on-line tutorials and analyze the 1977 repair records of 66 foreign and domestic cars. We use ordered relationship of rep77 in terms of foreign (origin of manufacture), length (a mpg. Here we estimate the same model using ordered pmbit rather than ordered
. oprobit
rep77
Iteration
O:
log
likelihood
= -89.895098
Iteration
I:
log
likelihood
= -78.141221
Iteration
2:
log
likelihood
= -78.020314
Iteration
3:
log
likelihood
= -78.020025
Drdered
Log
probit
likelihood
foreign
length
mpg
estimates
N_raber of obs LR chi2(3) Prob > chi2
= = =
66 23.75 0.0000
= -78.020025
Pseudo
=
0.1321
repY?
Coef.
foreign
1.704861
length
,0468675
mpg
Std.
Err.
R2
z
P>Iz[
[95_
.4246786
4.01
0.000
.8725057
2.537215
.012648
3.71
0.000
.022078
.0716571
.1304559
.0378627
3.45
0.001
.0562464
.2046654
_cutl _cut2
10.1589 11.21003
3.076749 3.10T522
_cut3
12.54561
3.155228
_cut4
13.98059
3,218786
(Ancillary
Conf.
Interval]
parameters)
We find that foreign cars have better repair records, as do larger cars and caus with better mileage ratings. q
clus_er(varnamt
specifies that the observations are independent
across groups (clusters) but
;
n__ necessarily vithin groups, varname:specifies to which group each observation belongs; e.g,, catuster(pers mid) in data with repeated observations on individuals, cluster() affects •the estimated stand_trd errors and variance-covariance matrix of the estlmators (VCE), but nol the es_mated coeffi :ients; see [t2] 23,11 Obtaining robust variance estimates, cluster() can be us#d with pwe: ghts to produce estinmtes for unstratified cluster-sampled data. but see the sWoprobit colnmand in [R] svy estimators for a command designed especially for survey data.
i
cl_aster()
imp ies robust;
specifying robust
cluster()
is equivalent to typing cluster()
by iitself, }
scor_(newvarlist) creates k new variables, where k is the number of observed outcomes. The firs_ variable cot tains OlnLj/O(xjb); the second variable contains OlnLj/O(_cutlj); the third conhins OlnLj/, _(_cut2j); and so on. Note that if you were to specify the option score(sc*), Sta!a would creale the appropriate number of new variables and they would be named seO. scl, level #) specifies le confidence level, in percent, for confidence intervals. The default is level or _ set by set level: see [U] 23.5 Specifying the width of confidence intervals.
(95)
_,
offse_ (varname) s_cifies that varname is to be included in the model with coefficient constrained to be 1.
i l
maximi_e..options control the maximization process; see [R] maximize. You should never have to spedfy them.
i
Optionsior predicl
I
p.:the d_ault, calculat _s the predicted probabilities. If you do not also specify the out come () option. you must specify new variables, where kis the number of categories of the dependent variable. Say vbu estimatec _ model by typing oprobit result xl x2. and result takes on three values.
i
Then i,ou could tyl:e predict pl p2 p3. to obtain all three predicted probabilities. If you specie' the ot_tcome() opt on, then you specify one new variable. Say that result takes on values 1.2. and 3i Then typing predict pl outcome(I) would produce the same pl. xb. calculates the line_ • prediction. You specify one new variable; for example, predict linear, xb. Tt_e linear prod ction is defined ignoring the contribution of the estimated cut points. i
xb calcult_tes the line prediction. You specify one new variable: for example, predict linear, xb. Ttje linear pred_ fion is defined ignoring the contribution of the estimated cut points.
_ _"
s_:dp calculates the stm dard error of the linear prediction. You specify one new variable: for example, predittse, stdp. outcome outcome) sp 'cities for which outcome the predicted probabilities are to be calculated. owcco_e() should dontain either a single value of the dependent variable, or one of #I, #2 ..... _vith #i meaning the_first categor_ of the dependent variable, #2 the second category, etc.
i _
nooffsetiis
relevant o
if you specified olfset (varname) for oprobit It modifies the calculations made bi, predict s_ that they ignore the offset variable; the linear prediction is treated as x3b rather t_an xjb + eraser,.
le ,
[
oprobit
-- Maximum-likelihood
ordered probit estimation
,
]
T
Syntax oprobit cluster :
depvar
[varlist]
(varname)
[weight]
[if
score (newvarlist)
exp] level
[in
range I [,
(#) 9ffset
t_able_robust
(varname)
maxbnize_options
]
by ... : may be used with oprobit; see [R] by, fweights, iweights, and pweights are allowed; see [U] 14.1.6 weight. oprobit shares the features of all estimationcommands; see [U] 23 Estimation and post-estimation commands. oprobit
may be used with sw to perform stepwise estimation: see [R] sw,
Syntaxfor predict predict [O,pe]newvarname(x)[if exp] [in range] [. { p I xb I stdp } outcome(outcome)
nooffset ]
Note that with the p option, you specify either one or k new variablesdepending upon whether the outcome () option is also specified (where k is the number of categories of depvar). With xb and stdp, one new variable is specified. These statistics are available both in and out of sample; type predict ... if e(sample) ... if wanted only for the estimation sample.
Description oprobit estimates ordered probit models of ordinal variable depvar on tile independent variables varlist. The actual values taken on by the dependent vmiable are irrelevant except that larger values are assumed to correspond to "higher" outcomes. Up to 50 outcomes are allowed in Intercooled Stata and up to 20 are allowed in Small Stata. See [R] logistic for a list of related estimation commands.
Options table requests a table showing how the probabilities equation.
for the categories
are computed from the fitted
robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster () allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,
robust is implied; see [U] 23.13 Weighted 469
estimation,
+
i + ' i
I_;+
i
468
oneway-- Dne-wayanalysis of variance
+
The :cheff_ test (Scheffd 1953. 1959: also see Winer. Brown, and Michels 1991, 191-t95)differs in derivdttion, but it altacks the same problem. Let there be k means for which we want to make all the pair,k,ise tests. Two means are declared significantly different if
i
+ iI_
t >_ v/(k-
1)F(a:k-
1,_,)
where /_(a:_ k - 1.__, is the a+-critical value of the F distribution with k - 1 numerator and 12 denominator degrees of freedom. Scheffd's test has the nicety that it never declares a contrast si!mificalt if the over tll F test is nonsignificant. Turnihg the test ar )und, Stata calculates a significance level
}
g=F
,k-l,v
! [ i
I [ ! ;
J For instance,, you hay.• a calculated t statistic of 4.0 with 50 The F test equivalent, says the significance evel is says the same. If vou are doing three comparisons, however, and S0 degrees of'frec:dom, which says the significance level 100021
_
degrees of freedom. The simple t test 16 with t and 50 degrees of freedom. you calculate an F test of 8.0 with 2 is .0010.
+ Referendes +
Ahman. D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall. Bartlett, _. S. 1937, Pro _erties of sufficiency and statistical tests, Proceedings .268-2_2. Hochberg. Judge. G. 2d ed.
of the Royal Socieb', Series A 160:
"_, and A.C. mhane. 1987. Multiple Comparison Procedures. New "_brk: John Wile)' & Sons, t13_ W. E. Gri ffit_ s, R, C. Hilt, H L/itkepohl, and T.-C. Lee. 1985. The Theoo, and Practice of Economerncs. !New York: Johb Wiley & Sons.
;
Miller, R. +!G"Jr. 1981. S_nultaneous
•_
Scheff_. H!+t953. A method for judging all contrasts in the analysis of variance. Biometrika
[ ! i
i i
-,
4
195 +. The Analysis
Statistical Inference. 2d ed, New York: Springer-Verlag.
of Variance. New York: John Wiley & Sons.
Sid_k. Z. !1967. Reetangu ar confidence Ameri_n,
Statistical A;sociation
regions for the means of multivariate
normal distributions.
of the
Snedecor, 1_, W. and W. ( Cochran. 1989. Statistical Methods. 8th ed. Ames. tA: Iowa State University Press. \Viner, B. D.R. Brown and K. M. Michels. t991. Statistical Principles in Experimental Design. 3d ed. New York: +tcGra_v-Hfll.
AlsoSeei +
Complementary:
!
Backgrodnd: Related:
+
Journal
62: 626-633.
}
i
40: 87-104.
m]encode
U] anova, 21.8 Accessing results by other programs tl_] [R] loneway, i[R]calculated table
_M
ultiple-comparison tests
oneway n one-way analysis of variance
Let's begin by reviewing the logic behind these adjustments. comparison of two means is
The "standard"
467
_ statistic for the
t=
/±
1
s_/ n_ + nj where s is the overall standard deviation, ffi is the measured average of ,Y in group i, and ni is the number of observations in the group. We perform hypothesis tests by calculating this t statistic. We simultaneously choose a critical level a and took up the t statistic corresponding to that level in a table. We reject the hypothesis if our calculated t exceeds the value we looked up. Alternatively, since we have a computer at our disposal, we calculate the significance-level e corresponding to our calculated t statistic and, if e < c_, we reject the hypothesis. This logic works well when we are performing a single test. Now consider what happens when we perform a number of separate tests, say n of them. Let's assume, just for discussion, that we set oLequal to 0.05 and that we will perform 6 tests. For each test we have a 0.05 probability of falsely rejecting the eq uality-of-means hypothesis. Overall, then, our chances of falsely rejecting at 1east one of the hypotheses is 1 - (1 - .05) 6 _ .26 if the tests are independent. The idea behind multiple-comparison tests is to control for the fact that we will perform multiple tests and to reduce our overall chances of falsely rejecting each hypothesis to c_ rather than letting it increase with each additional test. (See Miller 1981 and Hochberg and Tamhane 1987 for rather advanced texts on multiple-comparison procedures.) The Bonferroni adjustment (see Miller I981; also see Winer, Brown, and Michels 1991, 158-166) does this by (falsely but approximately) asserting that the critical level we should use. a, is the true critical level a divided by the number of tests n, that is, a = a'/n. For instance, if we are going to perform 6 tests, each at the .05 significance lev el, we want to adopt a critical level of .05/6 _ .00833. We can just as easily apply this logic to e, the significance level to our critical level a. If a comparison has a calculated significance adjusted for the fact of n comparisons, is n- e. If a comparison has and we perform 6 tests, then its "real" significance is .072. If we cannot reject the hypothesis. If we adopt a critical level of .10, we
associated with our t statistic, as of e, then its "real" significance, a significance level of, say, .012, adopt a crilical level of .05, we can reject it.
Of course, this calculation can go above 1, but that just means that there is no a < 1 for which we could reject the hypothesis. (This situation arises due to the crude nature of the Bonferroni adjustment.) Stata handles this case by simply calling the significance level t. Thus. the formula for the Bonferroni significance level is eb = min(1, en ) where n - k(k - 1)/2 is the number of comparisons. The Sidg_kadjustment {Si&ik 1967; also see Winer, Brown, and Michels 1991. 165-166) different and provides a tighter bound. It starts with the assertion that a=l-(1-a) Turning this formula around and substituting
1/n
calculated significance
e_=min{1,1-(1-e)
is slightly
levels, we obtain
n}
For example, if the calculated significance is 0.012 and we perform 6 tests, the "'real" significance is approximately 0.07.
i
......
/
466
111_
........
i oneway --
)ne-way analysis of variance
The rbodel one- ay analysis of variance is Methods andiofFondulas
for level! i = 1.... ,/_]and observations j = 1i .... hi. Define Yi as the (weighted) mean of Yij over Yij Yij!_Define --# nt- O_i-t_ij as the weight associated with Yij, which j and _ is the overaili(weighted) mean of w_j is 1 if the, data are untweighted,wij is normaI!zed to sum to 'n = _. _ ni if aveights are used and is othen__seunnormaltz_ • wi refers to _j wij and w refers to _i u'i. The between group sum of squares is then
i ,!
Sl = _ _,,(_ - _)_
! l
i
The t_tal sum of s_uares is
The _ithin group @m of squares is given By S_ = S - $t. The _etween gro@ mean squ_e is s_ = S1/(k - 1) and the within group mean square is s_ = Se/!(u, - k). Th_ test statistic is Y = s21/s2e.See, for instance, Snedecor and Cochran (1989).
t
i
Bartlett'stest
! _=
Bartleit's test assum,._sthat you have m independent,normal random samplesand tests the hypothesis 2 The test statistic, M, is defined c_ =...= c_m.
t t
} _ i
M - (T-
m) tr!_2 - _'_(Ti - 1) ln_?
1 --t-3(m__l)Z_." Ti--'l
T-m
where th(ire are T ove all observations, T/obs_p,_ations in the ith group, and r_ j=l
o
i=l 5
i _
i
An:approkimate test ott the homogeneity of variance is based on the statistic 3I with critical values oNamed"_rom the k"_q_stnbut_on" " " of m"- 1 degrees of freedom. See Bartlett (t937) or Judge et al.
(_9s5,,4_-449).
/
oneway -- One-way analysis of variance
465
Weighted data Example oneway a one-way data, Your population
can work with both weighted and unweighted data. Let's assume that you wish to perform layout of the death rate on the four Census regions of the United States using state data contain three variables, d_rate (the death rate), region (the region), and pop (the of the state).
To estimate the model, you type oneway drate region abbreviates weight as w. We will also add the tabulate summary statistics differs for weighted data:
[weight=pop], although one typically option to demonstrate how the table of
oneway drate region [w=pop], tabulate (analytic weights assumed) Census region
Mean
Sum, mary of Death Rate Std. Dev. Freq.
NE N Cntrl South West
97.15 88. I0 87.05 75.65
5.82 5.58 i0.40 8.23
49135283 58865670 74734029 43172490
9 12 16 13
Total
87,34
10.43
2.259e+08
50
Obs,
Analysis of Variance SS df MS
Source Between groups Within groups Total
2360.92281 2974. 09635
3 46
786.974272 64,6542685
5335. 01916
49
108.877942
Bartlett's test for equal variances:
chi2(3) =
F
Prob > F
12.17
5.4971
0.0000
Prob>chi2 = 0.139
When the data are weighted, the summary table has four rather than three columns. The column labeled "Freq." reports the sum of the weights. The overall frequency is 2.259- l0 s , meaning that there are approximately 226 million people in the U.S, The ANOVAtable is appropriately
weighted. Also see [u] 14.1.6 weight. q
Saved Results oneway saves in r(): Scalars r(N)
number
of observations
r(F)
F statistic
r(df_r)
within group degrees
r(mss)
between
of freedom
group sum of squares
r (dr..m)
between
group degrees
r(rss)
within
r(chi2bart)
Bartlett's
_c_
r(df_bart)
Bartlett's
degrees
of freedom
group sum of squares
of freedom
l
i
.................
Ur :lemeath that number is reported "0.001".
i
This is the Bonferroni-adjusted significance of the
.,,
differ_nce. The dif _renee is significant at the 0.1% level. Looking down the coIumn, we see that
'.
concelltration 3 is lso worse than concentrmion 1 (4.2% level) as is concentration 4 (3.6% level). Ba_;edon this e idence, we would use concentration 1 if we grew apple trees.
!
i
_>Examl_le
i
We!can just as asily obtain the Scheff_adjusted significance levels. Rather than specifying the bonfoirroni i
option_ we specify the scherzo
option.
Weiwill also addlthe noanova option to prevent Stata from redisplaying the ANOVAtable:
i
_ oneway
_omparison o_ Average weight in fframsby Fertilizer weight treatment, noauova (S_heffe) _cheffe
_owMean-I _01 Mean [
!I
2
1
3
0.001 3
-33.25
25.9167
0.039
O. 101
,_ 4
-34.4
I
24.7667
0.034
-1.15
O. 118
0.999
The di] 'erences are he same as we obtained in the Bonferroni output, but the significance levels
are noti According , the Bonferroni-adjuste_ numbers, the significance of the difference between !
fertilize_-concentrations 1 and 3 is 4.2%. Thq Scheff6-adjusted significance level is 3.9%.
I
We _'ill leave it t( you to deride which rdsults are more accurate.
l _ Example !_
Let'si.,.conclude thi I example by obtaining the Sidfik-adjusted multiple-comparison tests. We do this to illustrate Stata s capabilities to calculate these results. It is understood that searching across adjustm4tnt methods u_til you find the results yo_ want is not a valid technique for obtaining significance
!
levels.
I
i
. freeway weigh_ noanova we!ght si_al_ in grams by Fertilizer ; Cc treatment, _arison of Average
I
RO_ MeanCol.: Mean
!
!
2
I
1 -5
2
(Sldak)3
. 1667 0,001
"
3
-33.25 0.04i
25. 9167 O, 116
4
-34,4 0.035
24.7667 O. 137
:
J
_-t. 15 I.000
We find _esutts that an similar to the Bonferroni-adjusted numbers.
oneway -- One-way analysis of variance
, :
[no] standard tabulate
includes or suppresses only the standard deviations option. See tabulate above.
[no]freq includes or suppresses option. See tabulate above.
only the frequencies
461
from the table produced by the
from the table produced by the tabulate
[no] obs includes or suppresses ordy the reported number of observations from the table produced by the tabulate option. If the data are not weighted, the number of observations is identical to the frequency and by default only the frequency is reported. If the data are weighted, the frequency refers to the sum of the weights. See tabulate above. : ! i
t
'_
bonferroni scheffe sider
reports the results of a Bonferroni
multiple-comparison
reports the results of a Scheff6 multiple-comparison reports the results of a Sid_k multiple-comparison
test.
test.
test.
Remarks The oneway command reports one-way analysis-of-variance (ANOVA) models. To perform a oneway layout of a variable called endog on exog, type oneway endog exog.
> Example | i
You run an experiment varying the amount of fertilizer used in growing apple trees. You test four concentrations, using each concentration in three groves of twelve trees each. Later in the year, you measure the average weight of the fruit. If all had gone well, you would have had three observations on the average weight for each of the four concentrations. Instead, two of the groves were mistakenly leveled by a confused man on a large bulldozer. You are left with the following dataset: • use
apple
(Apple
trees)
describe Contains
data
obs : wars:
from
apple.dta 10 2
size:
Apple trees 19 Jul 2000
140
(99.9_, of memory
storage variable
name
type
display
value
format
label
variable
label
treatment
int
7,8.Og
Fertilizer
weight
double
7,10.Og
Average
Sorted
by :
list
1.
treatm~t I
weight 117.5
2.
1
113.8
3.
1
104.4
4.
2
48.9
5.
2
50.4
6. 7.
2 3
58.9 70.4
8.
3
86.9
9.
4
87.7
10.
4
67.3
16:04
free)
weight
in grams
__
,
pkcollapse-- Generate!pharmacokineticmeasurementdataset
r
Remarks pkcolla_segenerate_ all the summar3, pharnlacokinetic measures.
) Example We demonstrate the u_e of pkcollapse with '_thedata described at the end of [R] Ilk, We have drug concer_tration data d_n15 subjects. Each subject is measured at 13 time points over a 32-hour period. So_e of the recol'ds are • lisl id
s#q
concA
i
1
o
o
o
i i
t 1 I
1 1 I
3.073403 5.188444 5.898577
3.712592 6. 2306d2 7.885914
.5 1 1.5
"
1 1
1 1
5. 096378 6. 094085
9. 241735 13.10507
2 3
con_
time
.92095930
.5o 1.5 2 1
i i
(ou(p Jt omitted)
l
2 2
1 1
l
2 2 2
1 11
7.253442 5. 849345 4.883569
8.7105_9 10. 90552 5. 92581i8
1
6. 761085
8. 42986
i
515
2. 48462 o
i i
(ou[p Jt omitted ) 2
3
Although pt:summ allows _us to view all the pharmacokinetic measures, we can create a dataset with the measure using pkcoilapse i pkc(llapse time honcA concB, id(id) sta:t(auc)keep(seq)
I
..,...
............
_.......
] ':
. ......
. lis_
I
i i
The
i. 2.
id 1 2
seq 1 1
auc.cencA 150.9643 14_°7606
auc#concB 2_B.5551 138.3201
3. 4.
3 4
1 1
160.6548 157,8622
126. 0635 96i. 17461
5. 6.
5 7
I 1
133.6957 160.639
188.9038 228.6922
7. 8. 9.
8 9 10
1 1 2
131. 2604 168.5186 137.0627
1_. 0139 2_. 8962 139,7382
lO.
12
2
12. 11. 13. 14.
14 13 15 18
2 2 2
146.0462 163.4593 158.1457
1tN.5191 1_.7848 I_. 8654 i
15. I6.
19 20
2 2 2
147. t977 164.9988 145.3823
139. 235 16_. 2391 15B,_ 5146
534o38 2o 2.3942
resultin_ dataset conlains one observation pei subject.
:
q
i I
i
i
_ •
! f
ii
516 pkcollapse -- Generate Methods and Formulas pkcollapse is implemented The statistics generated
pharmacokinetic measurement dataset
as an ado-file.
by pkcollapse
are described in [R] pkexamine.
Also See Related:
[R] pkcross, [R] pkequiv,
Background:
JR] pk
[R] pkexamine,
[R] pkshape, [R] pksumm
r
pkcross -- Analyze :rossover experiments
.... .
I
=
i,i
i
gvntax
I
I ill
I.I
II
I
!
f
i
pkcross! outcome
[i_ exp] [in range]
t rreat ment (varnad_e) carryover ra_ode'l(string)
s_eec, ueng ial
[, _aram(#)
(varnanie I none)
l
I
If.
:
i i
sequence(varname) period(varname)
±d(varnarne)
!
]
t
Descriptioi
pkcross i is one of thd pk commands. If you lave not read [R] pk, please do so before reading this entry, i
i
pkcrossl
!
i
analyzes data from a crossover desig n experiment. When analyzing pharmaceutical trial
data, if the tleatment, car!3'over, and sequence variables are known, the ommbus test for separability of the treatn_ent and cart)over effects is calculated.
Optio
!
pax'am (#) s_ecifies which of the 4 parametefiza6ons to use for the _nalysis of a 2 x 2 crossover experiment. This opti_ is ignored with higher-_order crossover designs. The default is param(3). See the t_chnicat note'for 2 x 2 crossover designs for more details.
i I
i
'
paramet_rization
_ estimates the overall _ean. the period effects, the treatment effects, and
the carrylver effects, _ssuming that no sequenie effects exist. paramet_rization_ estimates the overall mean, the period effects, the treatmenteffects, and the period-b3!-treatment inieraction, assuming that ilo sequence effects and no carr3,o,,er effects exist. paramet
rization
3 estimates the overall mean, the period effects; the treatment effects, and
!
the sequehce effects, a_sumJng that no carrvov_ effects exist. This is the default parameterization. paramet _rizat ion 4 estimates the overall m_an. the sequence effects, the treatment effects, and
'
sequence bv-treatmentlmteract]on, assuming that no period or crossover effects exist. When the sequence by treatment !is equivalent to the peridd effect, this reduces to the third parameterization. sequence(_'arname)
sp_ifies
the variable that dontains the sequence in which the treatment was
i_ _-
administered. If this @tion is not specified, sdquence (sequence). is assumed. treatment ='varname) s_cifies the variable that dontains the treatment information. If this option is
,
not speci ]ed, treat I ('_reat) is assumed. carryover{varname l_one) specifies the vafable that contains the carryover information. If carry(n_ne) is sp'eci_ed, the carryover effecis are to be omitted from the model. If this option
i
is not sp4cified, carry_ (carry) is assumed. period(vahTame) specifies the • variable that cor_tains the period information, if this option is not , l . i. specMed,i perlod(pe_'zod) is assumed.
l
id(varname!) specifies thei variable that contains tile subject identifiers. If this option is not specified,
5i7 id(id)
i_ assumed.
!
,!j ,
model(string) specifies the model to be fit. For higher-order crossover designs, this can be useful if you want to fit a model other than the default. However, anova (see [R] anova) can also be used to estimate a crossover model. The default model for higher-order crossover designs is outcome predicted by sequence, period, treatment, and carryover effects. By default, the mode/ statement ismodel(sequence period treat carry). sequentialspecifies that sequential sums of squares are to be estimated.
Remarks pkcross is designed to analyze crossover experiments. Use pkshape first to reshape your data; see [R] pkshape, pkcross assumes that the data were reshaped by pkshape or are organized in the same manner as produced with pkshape. Washout periods are indicated by the number 0. See the technical note in this entry for more information on analyzing 2 × 2 crossover experiments. E3Technical Note The 2 × 2 crossover design cannot be used to estimate more than four parameters because there are only four pieces of information (the four cell means) collected, pkcross uses ANOVAmodels to analyze the data. so one of the figur parameters must be the overall mean of the modet, leaving just three degees of freedom to estimate the remaining effects (period, sequence, treatment, and carryover). Thus, the model is overparameterized. The estimation of tTeatment and carryover effects requires the assumption of either no period effects or no sequence effects. Some researchers maintain that is it is bad idea to estimate carryover effects at the expense of other effects. This is a limitation of this design, pkcross implements four parameterizations for this model. They are numbered sequentially from one to four and are described in the Options section of this entry. Q
_>Example Consider the example data published in Chow and Liu (2000) and described in [R] pkshape. We have entered and reshaped the data with pkshape, and have variables that identify the subjects, periods, treatments, sequence, and carryover treatment. To compute the ANOVAtable, use pkcross:
(Continued on next page)
pkcrbss -- Analyze crossover experiments pkcr
)ss
outcome
i
sequence
variable
=
period
variable
= period
variable variable variable
= = =
treatment carryover +
id Analysis S_urce
of
of
Variation
l+tersubjects + Sequence
variance
(ANOV_)
SS
_ffect
l
_ffect _ffect i
22
62.79 35.97
11 1
3679.43
r
+ Omnibus
2x2
crossover
sequence
treat carry id
study F
Prob
>
276. O0
O. 37
O. 5468
736.89
4.41
O. 0005
62.79 35.97
O. 38 O. 22
O. 5463 O. 6474
F
•
Res _duals
i
a
MS
I
16211.49
i
Treatment Per iod I_ trasubjects
for
df
276. O0
Residuals
i
519
20265.68 meaeure_ofiTotal separability
22
of
167.25
47" treatment
and
carryover
=
29.2893_
There i:s evid race of inters_bject variability, but there are no other significant effects. The omnibus test for: sepan bility is a m_sure reflecting the degree to which the study design allows the treatment
i _
effects be estimated inl:lependent of the caro, dver effects. The measure of the treatmentto and carryover ef_cts indicates approximately 29% separability. Thisofcanseparability be interpreted as
i _
the degree to which the treatment and carryover effects are orthogonal, that is, the treatment and carryover eff_ts are about 129% orthogonal. Thxs _s a characteristic of the design of the study. For a complete dischssion, see R_tkowsky, Evans, and All&edge (1993). Compared to the output in Chow t
l +
• i . and Liu (200t_), the sequence effect is mislabeled as'+. a carryover • effect.• See Ratkowsky, Evans, and Alldred_e (I_3) section 3i2 for a complete dlscusmon of the mlslabehng.
!
By specify ng param(1),
-
l
pkcro:;s
outcome,
+
we obtain parameterization
i
1 for this model.
_aram(1) ;
period
variable
= period
I
i
treatment sequence
variable
=
sequence treat
i
:,
carryover id
variable
=
id carry
AnalySis
•
i
of
(ANOVA)
for
a 2x2
crossover
study
• Treatment
•
e_fect
301.04
1
301.04
0.67
0.4189
I Carryover I Period
e_ect e_fect
_276.00 255.62
1 I
276.00 255.62
0.61 O. 57
0.4388 O. 4561
19890.92
44:
452.07
20265.68
47:
i
f
variance
Residuals Total U_nnJ )us
measure
_f
separabil_ty
of
treatment
and
carryover
=
29.2893}'.
q
i ;
_ Example Consider th ._case of tw0-treatment, four-sequence, two-period crossover design. This design is commonly referred to as Bal_am's design. Ratkowsky et al. (1993) published the following data from trial: an amantadine +
rl_]
520
pkcross-- Analyzecrossoverexperiments
il
id
i
E
seq
period1
period3
period2
2 1 3
-ab -at) -ab
12 9 17
10.5 8.75 15
9.75 8.75 18.5
4
-ab
21
21
21.5
1
-ha
23
22
18
2
-ba
15
15
3
-ha
13
14
4
-ba
24
22.75
21.5
5
-ha
18
17.78
16.75
I
-aa
14
12.5
13
2
-aa
27
24.25
22.5
3
-aa
19
17.25
16.25
4
-aa
30
28.25
29.75
1
-bb
21
2
-bb
11
10.5
3
-bb
20
19.5
20.75
4
-bb
25
22.5
23.5
13 13.75
20
19.51 10
The sequence identifier must be a string with zeros to indicate washout or basefine periods, or a numben If the sequence identifier is numeric, the order option must be specified with pkshape. If the sequence identifier is a string, pkshape will create sequence, period, and treatment identifiers without the order option. In this example, the dash is used to indicate a baseline period, which is an invNid code for this purpose. As a result, the data must be encoded; see [R] encode. encode
seq,
gen(num_seq)
pkshape
id num_seq
pkcross
outcome,
period1
period2
period3,
order(Oaa
Oab
Oba
Obb)
se sequence
variable
= sequence
period treatment
variable variable
= period = treat
carryover variable id variable
Source
Analysis of Variation
Intersubjects Sequence effect Residuals
Intrasubjects Period
of
variance SS
(ANOVA) (:If
for
a crossover MS
study F
= carry = id
Prob
> F
285.82 1221.49
3 13
95.27 93.98
1.01 59.96
0.4180 0.0000
effect
15.13
2
7.56
6.34
0.0048
effect
8.48
1
8.48
8.86
0.0056
Carryover effect Kesiduals
0.11 29.56
1 30
0.Ii 0.99
0.12
0.7366
Total
1560.59
50
Treatment
Omnibus
measure
of separability
of treatment
and
carryover
=
64.6447_
in this example, the sequence specifier used dashes instead of zeros to indicate a baseline period during which no treatment was given. For pkcross to work, we need to encode the swing sequence variable and then use the order option with pkshape. A word of caution: encode does not necessarily choose the first sequence to be sequence I as in this example. Always double check the sequence numbering when using encode. W
pkctoss -- Analyze crossoverexperiments
521
finishlthe analysis hat was started in [R] pk, little additional work is needed. The data were wi_h pkshape a ad are id 1 2 3 4 5 7
sequence 11 1 1 1 1 il
outcome 150.9643 146.7606 160.6548 157.8622 133.6957 160.639
treat A A A A 1 i
carry 0 0 0 0 0 0
period 1 1 1 1 1 1
8 9 I0 12 13 14 15 18
il 11 i2 !2 12 !2 12 !2
131.2604 168.5186 137.0627 153.4038 163.4593 146.0462 158.1457 147.1977
1 1 B B S B B B
0 0 0 0 0 0 0 0
1 1 1 i 1 1 I 1
19 20 1 2 3 4
12 !2 !I i1 i1 !1
B B B B B B B B B B _ A A A
0 0 A A A n A A A A B B B B
1 1 2 2 2 2 2 2 2 2 2 2 2 2
R _ A A
B B B B
2 2 2 2
5
11
7 8 9 '10 12 13 14
!1 Ii !i i2 12 12 12
164.9988 145.3823 218.5551 133.3201 126.0635 96.17461 188.9038 223.6922 104.0139 237.8962 139.7382 202.3942 136.7848 104.519i
I5 18 19 i20
_ _ _ _
165.8654 139.235 166.2391 158.5146
i
model is fi_ using pkcross: i
_s outcome
i i
sequence variable = sequence period variable = period treatment variable = treat carryover variable = carry id variable = id
;
Ana!_sis of variance (ANOV_) for a 2x2 crossover s%udy urce of Variation SS dd MS F Prob > F tersubjacts Sequence _ffect Residuals
378.04 17991.26
_ 14
378.04 1285.09
0.29 1.40
0.5961 0.2691
Iz_rasubjects i Treatment _ffect Period _ffect
455.04 419.47
1 1
455.04 419.47
0.50 0.46
0.4931 0.5102
i
.......
i I Total 32104.59 3_ Residuals 12860.78of treatment 14 918.63 Ounibus measurelof separability and carryover
,
=
29.2893_
q
w
522
pkcross -
Analyze crossover experiments
> Example Consider the case of a six-treatment crossover trial where the squares are not variance balanced, The following dataset is from a partially balanced crossover trial published by Ratkowsky et al. (1993): . list cow i 2 3 4 1 2 3 4 1 2 3 4
seq adbe baed ebda deab dafc fdca cfda acdf efbc beef fceb cbfe
periodl 38.7 48.9 34.6 35.2 32.9 30.4 30.8 25.7 25.4 21.8 21.4 22.8
period2 37.4 46.9 32.3 33.5 33.1 29.4 29.3 26.1 26 23.9 22 21
period3 34.3 42 28.5 28.4 27.5 26.7 26.4 23,4 23.9 21.7 19.4 18.6
period4 31,3 39.6 27.1 25.1 25.t 23.i 23.2 18.7 19.9 17.6 16.6 16.1
block I 1 1 1 2 2 2 2 3 3 3 3
In cases where there is no vEiancc balance in the design, a square or blocking variable is needed to indicate in which treatment cell a sequence was observed, but the mechanical steps are the same. . pkshape cow seq period1 period2 period3 period4 pkcross outcome, model(block
cowlblock period|block
treat carry) se
Number of obs = 48 Root MSE = .730751
R-squared = Adj R-squared =
Source
Seq. SS
df
MS
Model
2650.0419
30
block cowlblock
1607.17045 628.621899
2 9
803.585226 69.8468777
periodlblock treat
407.531876 2.48979215
9 5
carry
4.22788534
5
Residual
9.07794631
17
.533996842
Total
2659.11985
47
56.5770181
88.3347302
F
0.9968 0.9906 Prob > F
165.42
0.0000
1504.85 130.80
0.0000 0.0000
45.2813195 .497958429
84.80 0.93
0.0000 0.4846
.845577068
1.58
0.2179
When the model statement is used and the omnibus measure of separability is desired, specify the variables in the treatment(), carryover(), and sequence() options to pkcross.
q
Methods and Formulas pkcross is implemented pkcross
as an ado-file.
uses ANOVAto fit models for crossover experiments;
The omnibus measure of separability
is
S= where V is Cramer's
100(1
V)%
V and is defined as
V=
min(r-
1,c-
1)
see [R] anova.
pkcr_ss-- Analyze crossoverexpedments :
523
=
The X2 is calculated as
where 0 andIE are the obs er_'ed and expected counts in a table of the number of times each treatment !
!
i
i
is followed I_ the other treatments.
References Chow. S. C. a_LdJ.
R Liu. 2tl00.Design and Analysis of Bioavedtabilityand BioequivalenceStudies.2d ed New York:MarcelDekker.
Neter. J., M t1. Kutner,C. , Nacbtsheim.and W. Was_rman. 1996. Applied Linear Statistical Models. 4th ed. Chicago:lr_era. Ratkowsky,D. _tk.,M. A. Evans_and J. R. Alldredge.1993. Cross-overExperiments:Design,Analysisand Application. New York: VlarcelDekker.
AlsoSee Related:
[R] _kcollapse, [R] pkequiv. _[R] pkexamine, [R] pkshape, JR] pksumm
Complemenl ary:
[R] Statsby
Background:
[R] _k
Title
f
pkequiv
-- Perform bioequivalence
I
I
I11 II
II I
I
tests II
[ I I
I
exp]
[in range]
Syntax pkequiv
outcome treatmentperiod
sequence id [if
[, compare(string)limit(#) _level(#)noboot fieller symmetric anderson tost ]
Description pkequiv this entry.
is one of the pk commands.
If you have not read [R] pk, please do so before reading
pkequiv performs bioequivalence testing for two treatments. By default, pkequiv calculates a standard confidence interval symmetric about the difference between the two treatment means. pkequ:tv also calculates confidence intervals symmetric about zero and intervals based on Fieller's theorem. Additionally, pkequiv can perform interval hypothesis tests for bioequivalence.
Options compare(string) specifies the two treatments to be tested for equivalence. In some cases, there may be more than two treatments, but the equivalence can only be determined between any two treatments. limit (#) specifies the equivalence limit. The default is 20%. The equivalence limit can only be changed symmetrically, that is, it is not possible to have a 15% lower limit and a 20% upper limit in the same test. level (#) specifies the confidence level, in percent, for confidence intervals. Note that this is not controlled by the see level command.
The default is level
(90).
noboot prevents the estimation of the probability that the confidence interval lies within the confidence limits. If this option is not specified, this probability is estimated bv resampling the data. fieller symmetric
specifies that an equivalence
interval based on Fieller's
specifies that a symmetric equivalence
theorem is to be calculated.
interval is to be calculated.
anderson specifies that the Anderson and Hauck hypothesis test for bioequivalence is to be computed. This option is ignored when calculating equivalence intervals based on Fieller's theorem or when calculating a confidence interval that is symmetric about zero. tost specifies that the two one,-sided hypothesis tests for bioequivatence are to be computed. This option is ignored when calculating equivalence intervals based on FielIer's theorem or when calculating a confidence interval that is symmetric about zero.
_
524
pke_l_ uiv -- Perform bioequivalencetests
525
i
Remarks } :_
l
_
',
i
4
•
pkequiv i designed to +nduct tests for bioequivalence based on data from a crossover experiment. pkequiv requires that the User specify the outcome, uvatment, period, sequence, and id variables. The data mus I be in the sake format as produced lff pkshape;
see [R] pkshape.
> Example
We will co ]duct equivalence testing on the data i_troduced in [R] pk. After shaping the data with .list
i
pkshape, the data id are
I
I
i
[
\_
se(uence
1. 2. 3. 4. 5. 6, 7. 8. 9. 10. 11. 12. 13. 14. i5. 16. t7. 18, 19. 20. 21. 22. 23. 24.
1 1 2 2 3 3 4 4 5 5 7 7 8 8 9 9 I0 t0 12 12 13 13 14 14
1 1 1 1 i 1 1 1 1 1 1 1 1 i 1 1 2 2 2 2 2 2 2 2
outcome 150.19643 218.5551 146.7606 133.3201 160.6548 126.0635 157.8622 96.17461 133.6957 188.9038 160,639 223.6922 131.2604 104,0139 168.5186 237.8962 137.0627 139.7382 153.4038 202.3942 163.4593 136.7848 146.0462 104.5191
great
26. 27. 25. 28. 29. 30. 31.
15 18 15 18 19 19 20
2 2 2 2 2 2 2
165.8654 147.1977 158.1457 139.235 164.9988 166.2391 145.3823
A B B h B h B
B 0 B0 0 B 0
2 1 2! 1 2 1
32.
20
2
158.5146
A
B
2
A B A B A B A B A B A B A B A B B A B A B A B A
now can _onduct a bio_quivalence test between treat
!
!
.pkequlv outcome t_eat period seq id
carry 0
period 1
A 0 A 0 i 0 A 0 A 0 A 0 A 0 A 0 B 0 B 0 B 0 B
2 1 2 I 2 1 2 1 2 i 2 l 2 1 2 1 2 1 2 1 2 t 2
-- A and treat
-- B.
i
C}aSSiC confidence interval for bioe{uivalence ! I
4 i ;
i
! difference: rat_o:
[equivalence limits]
[
-30.296 80X
-11.332 92.519_
plobability t_st•limits are
30.296 120X
test limits
within equivalence limits =
]
26.416 i17.439_ 0.6350
The defau|t tput for pk_quiv shows a confidence interval for the difference of the means (tes! limits), the ra_io of the means, and the federal equivalence limits. The classic confidence interval can
!
.
be constructed around the difference between the average measure of effect for the two drugs or around the ratio of the average measure of effect for the two drugs, pkequiv reports both the difference measure and the ratio measure. For these data, U.S. federal government regulations state that the confidence interval for the difference must be entirely contained within the range [-30.296, 30.296 ], and between 80% and I20% for the ratio. In this case, the test limits are within the equivalence limits. Although the test limits are inside the equivalence timks, there is only a 63% assurance that the observed confidence interval will be within the eqmvalence limits in the long run. This is an interesting case because although this sample shows bioequivalence, the evaluation of the long-run performance indicates that there may be problems. These fictitious data were generated with high intersubject variability, which causes poor long-run performance. If we conduct introduced in [R] limits are within seen in expanded pkequiv
a bioequivalence test with the data published in Chow and Liu (2000), which we pk and fully describe in [R] pkshape, we observe that the probability that the test the equivalence limits is very high. The data from Chow and Liu (2000) can be form in [R] pkshape. The equivalence test is
outcome
Classic
treat
period
confidence
seq
id
interval
for
[equivalence difference:
test
limits]
-16. 512
ratio : probability
bioequivalence
16. 512
80_, limits
are
[
limits
-8. 698
120_. within
test
4. 123
89. 464Y,
equivalence
limits
]
=
104. 994_ 0.9970
For these data, the test limits are well within the equivalence limits, and the probability that the test limits are within the equivalence limits is 99.7%. <3
> Example Using the data published in Chow and Liu (2000), we compute a confidence interval that is symmetric about zero: pkequiv
outcome
Westlake's
treat
period
symmetric
seq
confidence [Equivalence
Test
formulat
ion:
id,
75. 145
symmetric interval
for
bioequivalence
limits] 89. 974
[
Test
mean
]
80. 272
The reported equivalence limit is constructed symmetrically about the reference mean, which is equivalent to constructing a confidence interval symmetric about zero for the difference in the two drugs, In the above output, we see that the test formulation mean of 80.272 is within the equivalence limits, indicating that the test drug is bioequivalent to the reference drug. pkequiv will display interval hypothesis tests of bioequivalence if you specify the tost the anderson options. For exanlple,
(Continued on next page)
and/or
ikequiv
i
pk_quiv _i/
outcomeitreat
Classic
con
g
period
dence
seq
interval
id, for
i
i
_oequtvalencetests
527
tpst anderson i blOequlvalence ' limits]
[equivalence
i
Perform
[
test limits
]
i
difference
:
-16.512
r_tio:
I_.512
80Y,
-8.698
120Y,
4.123
89.464Y,
104. 994Y,
! probability i
!test limits
1 schuirmann'i lupper test
Anderson i i
tw° °ne-sided _tatistic
an_ Hauck's
ilo_er test
tes_
1
I
=
tests
limits =
0.9980
I
-5.036
p-value
=
0.000
p-value
=
0.001
test
_tatistic
noncentralit_
!
are withiniequivalence
3.810
parameter
=
4.423
statistic
:
-0.613
" empirical
p-value
=
0.0005
Both of Sc uirmann's ohe-sided tests are hight! significant, suggesting that the two drugs are bioequivale_t. A similar Conclusionis drawn fro_ the Anderson and Hauck test of bioequivalence.
t
q
i
i
t i
Saved Resldts pkexaani: e saves Sc_ ars r(stddev)
in
r_<):
i l_ooled sample std. dev. of p¢riod differences from both sequences
r (uci) r (lci) r(delta) r(u3) I r(13)
_pper confidence interval for'a classic interval lower confidence interval for a classic interval eIta value used in calculating a symmetric confidence interval _pper confidence interval for Fietler's confidence interval l_wer confidence interval for Fiel]er's confidence interval I
)
Methods and Formulas pkequiv s imp|ementcdas an ado-_|e. The ]oweJconfidenceinten,a] for the differencein the two treatmentsin the classicshortest confidence in erval is }
i
The upper linlit is /1 gt I i
i The limits Ior theratio
easure are
t
r!ll!
528
pkequiv -- Perform bioequivalence tests
!,
L2 = (L_-_R+ 1) 100% and
where YT is the mean of the test formulation
of the drug, I_R is the mean of the reference formulation
of the drug, and t(o_,,_l+n2-2) is the t distribution with nl + n2 - 2 degrees of freedom. pooled sample variance of the period differences from both sequences, defined as
n l + n2 - 2
2
nk
k=l
i=1
_d is the
(d,k - _k)2
The upper and lower limits fbr the symmetric confidence interval are 3_ .- A and 12R- A, where
/1
A
1
+ n2 -
ki_dv_/ nl
-
and (simultaneously)
and kl and k2 are computed
1
t
nl
n2
iteratively to satisfy the above equality and the condition
fk k2 f(t)dt 1 where f(t) freedom.
is the probability
density function
1 - 2a of the t distribution
See Chow and Liu (2000) for details about calculating theorem. The two test statistics for the two one-sided
with n_ + n2 - 2 degrees
the confidence
tests of equivalence
of
interval based on Fieller's
are
TL = (_zT - YR) - OL
and
Tvad _g/1 where --OL = 0U and are the regulated confidence
+
n21
limits.
The logic of the Anderson and Hauck test is tricky, and readers are encouraged to read Chow and kiu (2000) for a complete exphmation. However. the test statistic is
TAH _i
l
O'd
-_
n2
l i
, p_equiv -- Performbioequivalencetests and the non entrality parameter is estimated by '
t I
_=
529
^ or-oL ii 1
The empirical p-value !is calculated as
i
l i
where Ft is!the cumulatii'e distribution function _f the t distribution with r_l + n2 - 2 degrees of freedom.
!
! t
i t_
Z
;Reference _,
!
Cho_. S, C. dnd J, P. Liu. _)0. Design and Analysis df Bioavaitabili O, and BioequivalenceStudies. 2d ed. New York:Mar_t Dekker. i Neter, L. M, _t. Kutner. C, i. Nachtsheim,and W. Wa_serman.19%, Applied Linear StatisticalModels. 4th ed, i
t
Chicago:IrWin. ! Ratkowsky,D iA.. M. A. EvanS.and J. R. Alldredge,1993.'Cross-overExperiments:Design,AnalysisandApplic_tion_ Ne_ York:lMarcelDekker
Also See i
Related:
i
l
t
i
[R]
pkeollapse, [R]
Complemen'Iary:
[R] statsby
Background:
JR]pk
pkcrossl [R] pkexamine, [R] pkshape, [R] pksurnm
J
"
[
okexamine
-- Calculate
pharmacok/netic
measures
I
Syntax pkexamine { line
time concentration [ log
I exp(#)
[if exp] [in range]
[, fit(#)
} _graph graph_options
t_rapezoid
]
by ... : may be used with pkexamine; see [R] by.
Description pkexamine is one of the pk commands. this entry.
If you have not read [R] pk, please do so before reading
pkexamine calculates phannacokinetic measures from time-and-concentration subject-level data. pkexamine computes and displays the maximum measured concentration, the time at the maximum measured concentration, the time of the last measurement, the elimination time, the half-life, and the area under the concentration-time curve (AUC). Three estimates of the area under the concentration-time curve from 0 to infinity (AUC0,o_) are also calculated.
Options fit(#) specifies the number of points, counting back from the last time measurement, to use in fitting the extension to estimate the AUC0,oo. The default is the last 3 points, which should be viewed as a minimum: the appropriate number of points will depend on your data. trapezoid specifies that the trapezoidal rule should be used to calculate the AUC. The default is cubic splines, which give better results for most functions. In cases where the curve is very irregular, trapezoid may give better results. line and log specify which of the estimates of the AUC0,c_ to display when graphing the AUCo oo. These options are ignored unless specified with the graph option. exp (#) specifies that the exponential fit for the AUCo,_ be plotted. You must speci_ the maximum time value to which you want to plot the curve, and this time value must be greater than the maximum time measurement in the data. If you specify 0, the curve will be plotted to the point where the linear extension would cross the z-axis. This option is not valid with the line or log options and is ignored unless the graph option is also specified. graph
tells pkexamine
graph_options
to graph the concentration-time
curve.
are any of the options allowed with graph,
twoway; see [G] graph
options.
Remarks pkexamine computes summary statistics for a given patient in a phanlmcokinetic idvar: is specified, statistics will be displayed for each subject in the data. 530 ii
,t
trial. If by
i
!
i
i
1
! i
1
pkexamine+ Calculatepharmacokineticmeasures
531
> Example Chow anal Liu
i (2000,!t, l) presents data on a study examining primidone concentrations versus time for a s_bject over a _2-hour period after dosing • list time cone time 0 .5
,
I. 2.
con c 0 0
3. 4.
1 1.5
2.8 4.4
5.
2
4.4
6.
3
4.7
8. 9. 7. 10.
6 8 4 12 16
4 3.6 4.13 2.5
13.
32
i. 6
t2.
24
! t
11.
We use pkexmnine pkex_ine
!
2
tolproduce the summary statistics.
time cSnc, graph
Maximum concentration = Time of maximum concentration = Tmax =
4.Z 3 32
Elimination rate = Half life
0.0279 24.8503
Ar._aunder the _curve
-
i
AUC [0, inf.) Linear of log cone,
AJC [0, Tmax] i 85.24
AUC [0, inf.) Linear fit
AUC [0, inf.) Exponential fit
107.759
142.603
142.603
i
Fi_ based on l_Lst3 points.
I
i '<__, ! i
\ .\
I
l
! ! !
i 1 0-
_,4 t Ahalysis
l
3'2
T!me
The maximu n concentration of 4.7 occurs at time ] and the time of the last observation (Tmax) is 32. In addition the AUC, c_]culated from 0 to the rhaximum value of time_ pkexamine also reports
_
_,_,
53z
pkexamine -- Calculate pharmacokinetic measures
the area under the curve, computed by extending the curve using each of three methods: a linear fit to the log of the concentration; a linear regression line; and a decreasing exponential regression line. See the Metho& and Formulas section for details on these three methods. By default, all extensions to the AUC are based on the last three points. Looking at the concentrationtime graph for these data, it seems more appropriate to use the last seven points to estimate the AUCo.oo: pkexamine
time
conc,
fit(Z)
Time
Maximum
concentration
=
of maximum
concentration
=
3
Tmax
=
32
rate
=
0.0349
life
=
19.8354
Elimination Half Area
under
the
curve
AUC AUC
[0, Tmax]
Linear
,
[0, inf.) of log
' AUC
cone.
[0, inf.)
Linear
AUC
fit
based
131. 027
on last
96. 805
pkexamine
time
cone,
fit(Z)
graph
129. 181
To see a graph of the AUC0,oo using a
line
Time
Maximum
concentration
=
of maximum
concentration
=
3
Tmax
=
32
rate
=
life
=
Elimination Half under
the
Fit
85.24 [0, Tmax] based
4.Z
O. 0349 19.8354
curve
AUC
AUC
fit
7 points.
This decreased the estimate of the AUCo,oc for all extensions. linear extension, specify the graph and line options.
Area
[0, inf.)
Exponential
J.......................
85.24
Fit
4.7
Linear
on last
[0, inf.)
I AUC
131. 027 of log cone.
]
[0, inf.)
AUC
96. 805 Linear fit
[0, inf.)
129. 181 Exponential
fit
7 points.
,7
\
\ \ \ \
c o
_
\ \ \k \
0
1
I
T
T 46,4557
A na I y'St$ Time
W
mkexamine _ Calculatepharmacokineticmeasures
:
SavedResUlts pkexami_e
i
saves in rk)" Scalars
r(auc) r(ke)
area under the conceritrationcurve half life of the drug eliminationrate
r(tmax)
time at last concentrationmeasurement
I
r (cmax)
maximum concentration
[
r(tocm) r(auc_l_ne)
time of maximumconcentration AUC0,ocestimatedwith a linear fit
!
r(auc_e_p)
AUC0,ooestimatedwith an exponentialfit
r(auc_l_)
AUC0,oo
r(half)
Methodsa I
i
estimated with a linear fit of the natural log
FormUlas
pkexami_e is implem!nted as an ado-file. The AUC_,_x is defir/ed as AUCo,t.... =
Ctdt
i i
533
'° where Ct is the concentralion at time _. By default the integral is calculated numerically using cubic splines. Hov,ever. if the tapezoidal rule is used, ihe AUC0,tmaxis given as
i
AUC0,tmax
=
2
i=2
The AUCb,_c is the AHCo,tm_ + AUC_..... oc, or
; [
AUCo,ac -
Ctd_ +
_
JO trnax
Ctdt ftm
ax _
[
When usinglthe linear exlension to the
,
x-axis at - _. The log e::tension is a linear extension on the log concentration scale. The area for the exponenlial extension' is
!} !
_tUCo_ -
:
i
the integration is cu( off when the line crosses the
L""
e-f_°+t_3_) dt = -
,&
Finally, t!_e eliminatio_ rate Keq is the negatNe of the parameter estimate for a linear regression of log time pn concentrattion and is given in the standard manner:
i ! I
•
AUC0,tmax.
Keq--
g i=1
(< e)
and ti/2
=:
In 2 K.---_q
I_ _'_"
:3_q
pl(examine -- Calculate pharmacokinetic measures
If,
"
References Chow. S. C. and Liu, J. P. 2000. Design and Analysis of Bioavailability and Bioequivalence Studies. 2d ed. New York: Marcel Dekker.
Also See Related:
[R] pkcollapse,
Complementary:
[R] statsby
Background:
[R] pk
[R] pkcross, [R] pkequiv, [R] pkshape, [R] pksumm
le
'
F
pkshape
i
'__
-- Reshape
pharmacokinetic) Latin square data
i
I
i
I
i
rl
I,
i
Syntax i i
1I I
pkshape
id sequence f,eriod] period2 [period tlst]! [, order(string) J
ou_3teol}e(newvar) :reatment perio, l(newvar) !
(newvar)
c_ryover(newvar)
sequence(newvar)
°°'°"""°" ]
pkshape
!s one of the!pk commands. If you have not read [R] pk. please do so before readin_ i ! this er_try. _ _ pkshape ieshapes the dita for use with anova, pkcross, and pkequiv. Latin square and crossover data are ofter_ organized in t_manner that cannot be ehsily analyzed with Stata. pkshape wiI! reorganize the dam in n_emory for us_ in Stata.
i Options '_ ! !_
order(string) specifies tl_e order in which treatr_ents were applied. If the sequence() specifier is a strin_l variable which specifies the order, this option is not necessary. Otherwise. order() specifies i_ow to generate the treatment and car_,over variables. Any strin_ variable can be used to specie' the order. In!the case of crossover ddsigns, any washout periods can be indicated with
_ !
the numN :r O. outcome(nevvar)
i
ti specifirs the name for the outcome variable in the reorganized data. By default.
outcome outcome)is used.
!_
treatment
lwwvar) specifies the name for the tre_itment variable in the reorganized data. Bv. default.
I
treat (t !eat)is used i i carryover ,wwvar) specifies the name for the carryover variable in the reorganized data. By default,
I i l
carry(c rry) isusedt sequence (npwvar) speci_es the name for the seqt_ence variable in the reorganized data. By default. sequence(sequence) is used.
1
period(new_,ar) period(feriod)
i
specifie_ the name for the peribd variable in the reorganized data. By default. is used.
Remarks '
T
}
Often. dath from a Latin. square experiment are naturally organized in a manner that Stata cannot easily manage, pkshape _ill reorganize Latin sqdare type data so that it can be used with a.nova
I
(see [R] ano_ca) or any p!¢ command. This includes the classic 2 x 2 crossover design commonly u_ed in phanhaceutical research, as well as many _ther Latin square designs.
e
l
i
s35
[
pKsnape-- Reshape (pharmacokinetic) Latin square data
the example data published in Chow and Liu (2000). There are 24 patients, 12 in sequence. Sequence i consists of the reference formulation followed by the test formulation; is the test formulation followed by the reference formulation. The measurements reported AUCO-tm_x for each patient and for each period. list, noobs ID 1 4 5 6 11 12 15 16 19 20 23 24 2 3 7 8 9 10 13 14 17 18 21 22
Sequence 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
Periodl 74,675 96.4 101.95 79.05 79.05 85.95 69. 725 86. 275 112.675 99.525 89.425 55. 175 74.825 86. 875 81. 675 92.7 50.45 66.125 122.45 99. 075 86 . 35 49. 925 42.7 91. 725
Period2 73. 675 93.25 102. 125 69.45 69. 025 68.7 59. 425 76. 125 114.875 116.25 64.175 74. 575 37.35 51. 925 72. 175 77.5 71.875 94. 025 124. 975 85. 225 95 . 925 67.1 59. 425 114.05
outcome for a single person is in two different variables, the treatment that was applied individual is a function o1"the period and the sequence. To analyze this using anova, all the need to be in one variable and each covariate needs to be in its own variable. To reorgamze use pkshape: pkshape id seq periodl periodi, order(ab ha) seq id treat
id 1 1 4 4 5 5 6 6 11 11 12 12 15 15 16 16 19
sequence 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
outcome 74. 675 73. 675 96.4 93.25 101.95 102. 125 79.05 69.45 79.05 69. 025 85.95 68.7 69. 725 59. 425 86.275 76. 125 112.675
treat 1 2 1 2 1 2 1 2 t 2 1 2 1 2 1 2 1
carry 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
period 1 2 1 2 1 2 1 2 1 2 1 2 1 2 I 2 1
i
pkshape-- Reshtpe (pharmacokinetic)Latin square data
537 !
18, 19. 20,
19 20 20
1 1 1
114.875 99. 525 116,25
2 1 2
1 0 1
2 1 2
21 22, 23,
23 23 24
1 t :L
89.425 64,175 55.175
1 2 I
0 1 0
1 2 1
24,
24
i
74,575
2
1
2
25, 26, 27, 28, 29, 30 31
2 2 3 3 7 7 8
2 2 2 2 2 2 2
37,35 74,825 51,925 86,875 72,175 81,675 77,5
I 2 1 2 1 2 1
2 0 2 0 2 0 2
2 1 2 1 2 1 2
32 33 34 35 36 37 38 39 40. 41. 42. 43. 44.
8 9 9 10 10 13 13 I4 14 17 17 18 18
2 2 2 2 2 2 2 2 2 2 2 2 2
92,7 71,875 50,45 94.025 66, t25 124,975 122.45 85,225 99. 075 95,925 86.35 67,1 49.925
2 1 2 1 2 1 2 1 2 t 2 1 2
0 2 0 2 0 2 0 2 0 2 0 2 0
1 2 1 2 1 2 1 2 1 2 i 2 1
45, 46, 47, 48,
21 2t 22 22
2 2 2 2
59,425 42.7 114.05 91.725
1 2 1 2
2 0 2 0
'
:'
2 1 2 1
the dat_ are orgam,ord into separate variables that mdlcate each factor level for each of the covariates, s(, the data ma_" be used with anova of pkcrosm see [R] anova and [R] pkcross.
q
Exam#e Consider the stud',, of b_ckground music on bank teller productivity published in Neter et al. (1996). data are " ! _ '
Week 1 2 3 4 5
flonday I18(D)
Tuesday 17(C)
Wednesday, :!4(A)
Thursday 21(B)
Friday 17(E)
17(A) ii7(E) 21(B)
34(B) 29(D) ]3(A) 26(E)
32(B) 24(C) 26(D)
16(A) 27(E) 31(D) 31(C)
15(D) ] 3(C) 25(B) 7(A)
I]3(c)
21(E)
numbers are the p{oductivity scores, and the letters represent the treatment. \Ve entered the into Stala as i id I
seq dcabe
2
cbeatd
[13
day2 17
day3 14
da
day5 17
34
21
_6
15
42,
I
adbec eacdb
7 17
29 13
32 24
27 m
13 2s
5
i
bedca
21
26
26
3tl
7
---
I" ......
_-
n=oIl=p_
tpnarmacoKIneUc) Latin square data
We reshape these data with pkshape: pkshape id seq dayl day2 day3 day4 day5 . list
1. 2. 3. 4. 5. 6. 7. 8. 9. I0. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
id 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
sequence 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
outcome 18 13 7 17 21 17 34 29 13 26 14 21 32 24 26 21 16 27 31 31 17 15 13 25 7
treat 1 2 3 5 4 2 4 1 3 5 3 5 4 2 1 4 3 5 1 2 5 1 2 4 3
carry 0 0 0 0 0 1 2 3 5 4 2 4 1 3 5 3 5 4 2 1 4 3 5 1 2
period 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
In this case, the sequencevariable is a string variable that specifies how the treatments were applied, so the order option is not used. In cases where the sequence variable is a string and the order is specified, the arguments from the order option are used. We could now produce an AN0¥7_table: . anova outcome seq period treat Number of obs = 25 Root MSE = 3.96232
R-squared = Adj R-squared =
Source
Partial SS
df
MS
Model
1223.60
12
101.966667
sequence period treat
82.O0 477.20 664.40
4 4 4
20.50 119.30 166.10
Residual
188.40
12
15.70
Total
1412.00
24
58.8333333
F
0.8666 0.7331
Prob > F
6.49
O. 0014
i.31 7.60 I0.58
O. 3226 0.0027 O.0007
q > Example Consider the Latin square crossover example published in Neter et al. (1996). The example is about apple sales given different methods for displaying apples.
pkshape-- Reshape(pharmacokinetic)Latin square data
q
• _ ._,
i
P_ttern _ 1
Store 1
t
_
i
, _
2
I
,
3
If the data 1
Week _ 9(B)
Week 2 12(C)
Week 3 15(A)
2
4(B)
]2(c)
9(n)
1 2
12(A) 13(A)
14(B) 14(B)
3(C) 3(C)
21
7(C) 5(C)
18(A) 20(A)
6(B) 4(B)
539
re entered ir_to Stata as
-listi I. 2.
I
I
then
idi 2
1 seq 1
9 pl 4
12 p2 12
15 p3 9
1 square 2
3.
i
3
2
12
14
3
1
4. 5. 6.
i" ! :
5 4 6
3 2 3
137 5
18 14 20
36 4
1 2 2
!
i , the data! can be reorganized using descriptp,,e names for the outcome variables. •t ld ^ p3, oraer(bca " • cab) seq(pattern) period(order) pksnipe " seq p !l!pz aDc } | . • > trea_(displ_ys)I anovg outcome pattern order dlsplay idli_ttern
I i
Number of obs = | Sburce
i 7
Root MSE Partial SS
odel I
pa_tern #rder
18
= 1.59426 df
R-squared
=
0.9562
Adj R-squared = 0.9069 MS F Prob > F
443.666667
9
49.2962963
19.40
0.0002
.333333333 233.333333
2 2
.166666667 116.666667
0.07 45.90
0.9370 0.0000
21,00
3
7.00
2.75
0.1120
8
2.54166667
17
27.2941176
/
!
i
id Ipa_tern Residual
20.3333333
t
|
_otal
!
!
These are the same results reported by Neter et al. (1996).
_ Example Returning
i
I i 1
o the examp e from the pk entry, the data are idI
I
se
auc_concA 150.9643
auc_concB 218.5551
i2 '3 i4
146.76o6i33.32ol _1t 16o 6548 126.o635 157.862296.17461
!5
1{
133.6957
188,9038
17 i8 19
11 1t 1
160.639 i31.2604 168.5186
223.6922 104. 0139 237.8962
li
2 2 2!
153. 4038 163. 4593 137.0627
202. 3942 136. 7848 139.7382
158. 1457
165. 8654
! }
'r
464.00
4
W_
:
b4u
pksbape
I,
--
Reshape
i
18 19 20
[_i
pkshape ±d seq . sort id
,i
2 2 2
(pharmacokinetic)
147.1977 164.9988 145.3823
Latin square
data
'_o
139.235 166.2391 158.5146
auc_concA
auc_concB,
sequence 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
outcome 150.9643 218.5551 146.7606 133.3201 126.0635 160.6548 96.17461 157.8622 188.9038 133.6957 160.639 223.6922 131.2604 104.0139 237.8962 168.5186 137.0627 139.7382 202.3942 153.4038 163.4593 136.7848 104.5191 146.0462 165.8654 158.1457 139.235 147.1977 164.9988 166.2391 158.5146 145.3823
order(ab
ha)
list
:_
id 1 1 2 2 3 3 4 4 5 5 7 7 8 8 9 9 i0 10 12 12 13 13 14 14 15 15 18 18 19 19 20 20
1. 2. 3. 4. 5. 6. 7, 8, 9. 10. 11. 12. 13. 14. 15. 16. 17, 18. 19. 20. 21. 22. 23. 24, 25. 26. 27, 28. 29. 30. 31. 32.
These
data
can
be analyzed
with
pkcross
treat 1 2 1 2 2 1 2 1 2 1 1 2 1 2 2 1 2 1 1 2 2 1 1 2 1 2 1 2 2 1 1 2
carry 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 2 2 0 0 2 2 0 2 0 2 0 0 2 2 0
period 1 2 1 2 2 1 2 1 2 1 1 2 1 2 2 1 1 2 2 ! I 2 2 1 2 1 2 1 1 2 2 1
or anova.
Methodsand Formulas pkshape
is
implemented
as an ado-file.
References Chow. S. C. and J. E Liu. 2000, York: M_cel Dekker. Neter, J,, M. H Kutner. C_cago: [rwin.
C. J
De_;ign and Analysis
Nachtsheim,
of BioavailabiHty
and W. Wasserman.
and Bioeq_livalence
1996, Applied
Linear
Studies.
Statistical
2d ed. New
Models.
4th ed.
Also See ;
Related:
JR] pkcollapse,
Background:
JR] pk
[
fi
[R] pkcross,
[R] pkequiv,
[R] pkexamine,
[R] pksumm;
JR] anova
le i_
'_
pksumm -- Summari e pharmacokinetic data I
"
I
t
I
I1
I
I
I
"
Syntax pksu_ i_ i
i
i_lti,,,econce_tration [if expl [in ra,,gc] [, fit(#)
t__rapezoid
stat (lneasure)_ no'orsnotim_chk_graph graph_options ! where meas_treis one of auc !area under concentration-time cu_'e (AUC0,ec,) aucline under concentration-time curve from 0 to _ using a linear extension t !area under !ii concentration-time curve from 0 to vc using an exponential extension aucexp _area auc!og area under tt_e log concentration-time curve extended with a linear fit
l
half
half life of
drug
i
ke
!elimination cdncentration tmaximum r_ate
I o
tmax tome
cmax
!time at last _oncentration !time of maxilrnumconcentration
Deseriptio, pksummii one of the p commands. If you have no_read [R] pk, please do so before reading this
i
pksumm dbtains the firit four moments from the empirical distribution of each pharmaeokinetic measurement and tests tt{e null hypothesis that the distribution of that measurement is normally distributed.
Options }
_ fit (#) spedides the nur_e ,b r of points, counting back from the last time measurement, to use in
_
fitting the!extension to _stimate the AUC0,vc, The default is fit (3). the last 3 points. This should be viewed as a minimlJm;the appropriate number of points will depend on the data,
;i
.ipecifiesthat _hetrapezoidal role should be usedto calculate the AUC.The default is cubic splines, Whichgive belter results for most situations. In cases where the curve is very irregular,
trapezoid
!
the trape+idal rule m+ give better results. star (statistic) specifies the statistic that pksnrm should graph. The default is stat(auc).
i
graph o_ion is not specified, this option is ignored. nodots suppresses the progress dots during calculation, By default, a period is displayed for every call to calculate the phlarmacokineticmeasures.
!
not imechk
! l
If the
iuppresses th_ check that the follow:up time for all subjects is the same, By default. pksumm e_pects the m_ximum follow-up time _obe equal for all subjects.
graph reque_ts_a graph ofl the distribution of the gtatistic specified with s'cat(), graph_optioi_sare any of lthe options allowed with graph, twoway: see [G] graph options. 541
_,,tt
o_z
pKsumm -- _ummanze pharmacokinetic data
-7
Remarks pksnmm will produce summary statistics for the distribution of nine common pharmacokinetic measurements. If there are more than eight subjects, pksumm will also compute a test for normality on each measurement. The nine measurements summarized by pksurr_ are listed above and are described in the Methods and Formulas sections of [R] pkexamine and [R] pk.
> Example We demons_ate the use of pksumm with the data described in [R] pk. We have drug concen_ation data on 15 subjects, each measured at 13 time points over a 32-hour period. A few of the records are . list id 1 1 1 1 1 1
1. 2. 3. 4. 5. 6.
time 0 .5 1 1.5 2 3
cone 0 3.073403 5.188444 5.898577 5.096378 6.094085
0 .5 1 1.5 2 3 4 6 8 12 16 24 32
0 3.86493 6.432444 6.969195 6.307024 6,509584 6.555091 7.318319 5.329813 5.411624 3.891397 5.167516 2,649686
(ou_utomi_ed) 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195.
15 15 15 15 15 15 15 15 15 15 15 15 15
We can use pksmm
to view the summary statistics for all the pharmaco_netic
parameters.
. pksnmm id time cone
Summary statistics for the pharmacokinetie
measures Number of observations =
star. auc aucline aucexp auclog half ke emax tome tmax
15
Mean
Median
Variance
Skewness
Kurtosis
p-value
150.74 408.30 691.68 688.98 94.84 0.02 7.36 3.47 32.00
150.96 214.17 297.08 297.67 29.39 0.02 7.42 3.00 32.00
123.07 188856.87 762679.94 797237.24 18722.13 0.00 0,42 7.62 0.00
-0.26 2.57 2.56 2.59 2.26 0.89 -0.60 2.17
2.10 8.93 8.87 9.02 7.37 3.70 2.56 7.18
0.69 0.00 0.00 0.00 0.00 0.09 0.44 0.00
For the 15 subjects, the
mean
AUCo,t .....
is 150.74 and cr2 = 123.07. The skewness of -0.26 indicates
that the distribution is slightly skewed left. The p-value of 0.69 for the X 2 test of normality indicates that we cannot reject the null hypothesis that the distribution Is normal.
i
'
....
_ pksumm-- Summarizepharmacoldnetic data 543 If we were to consider an3 of the three variants of the AUC&oo, we would see that there is huge variabilib' ant that the distribution is heavily skewed. A skewness different from 0 and a kurtosis different from,3 are expecu,d because the distribution of the AUCo._ is not normal. We now grapt the distribut on of AUe0,tm_ and specify the graph option,
i
!
i "
. pksuml_ id time con:)
graph bin(20)
I
Summari statistics _or the pharmacokinetic measures i
Number of
i
150.7_
Median 150.96
au_line a_cexp
408.38 691._
214.17 297.08
188856.87 762679.94
l
a_clog lhakl_e 1 lcmax
688. 94. 0. O_ 7.3_5
297.67 29.39 0.02 7.42
797237:24 18722,13 0.O0 0.42
l
!tmax !tome
0.00 7.62
1 I
s_at"
n
_ auc
)_e_i_
32. 3.
Variarlce Skewness 123.07 -0.26
32. O0 3.00
f I
i
observations = Kurtosis 2.10
p-value 0.69
8.93 8.87
O.O0 O.O0
2.59 2.26 0,89 -0.60
7.37 9.02 3.70 2.56
O. O.O0 O0 0.09 0,44
2.17
7.18
0.00
2.57 2.56
_
I Area Under
Cu!ve
15
168,5_9
(AUC}
_i
graph, AI.r_0.tm.,.To a graph ofwe one of ask the other pharmacokineticmeasurements. _e needbyde_ult, to specify plots the star () option. plot For example, can Stata to produce a plot of the AUCo._:
I
using the log t;xtension:
(Continued on next page)
544
pksumm -- Summarize pharmacokinetic data . pksumm id time cone, stat(auclog)
graph bin(20)
Summary statistics for the pharmacokinetic measures Number of observations = star.
18
Mean
Median
Variance
Skewness
Kurtosis
p-value
auc aucline
150,74 408.30
150.96 214.17
123,07 188856.87
-0.26 2.57
2,10 8.93
0.69 0.00
aucexp auclog half ke cmax tome tmax
691.68 688.98 94.84 0.02 7.36 3.47 32.00
297.08 297.67 29.39 0.02 7.42 3.00 32.00
762679.94 797237.24 18722.13 0.00 0.42 7.62 0.00
2,56 2.59 2,26 0,89 -0,60 2.17
8,87 9.02 7,37 3.70 2.56 7.18
0,00 0.00 0.00 0.09 0.44 0.00
,666667
-
o.
I
II
I
=
|
18g,135
3624!8 Linear
fit to tog concentration
<1
Methods and Formulas pks-mm is implemented The X2 test for normality test of normality.
as an ado-file. is conducted with sktest;
see [R] sktest
for more information
The statistics reported by pksumm are identical to those reported by summarize [R] summarize and [R] sktest.
and sktest;
Also See Related:
[R] pkcollapse,
Background:
[R] pk
[R] pkcross, [R] pkequiv. [R] pkexamine.
[R] pkshape
on the see
ie i plot-
i i
Dr
Syntax
,!
plot
)','a_1[yvar2 [yl_ar3]])a'ar [if exp]
° 1
i 1
using typewriter characters
in range
[, _columns(g)
e_ncode _
i hlines(#)
by ...:
lines(
) [lines(#)
1
may b_ used with ptSt: see [R] by.
Description plot pro@ces a two-way scatterplot of yvar against war using typewriter characters. If more than one vvarlis specified, single diagam is prodhced that overlays the plot of each yvari a_ainst .t'va r.
i
graph prol,ides more sdphisticated capabilities than does plot:
see the Stata Graphics Manuel.
Options colmms(#) ipecifies the dolumn width of the plot. The number of columns must lie between 30 and I33: the ddfault is 75. _ote that the plot occupids ten fewer columns than the number specific& The extra _n columns ire used to label the diagam. ! i i _ I
encode plots _oints that oct:ur more than once with _ symbol representing the number of occurrences. Points that ioccur once a!e plotted with an asterisk (*). twice with the numeral two (2), three times with the ntimeral three d3), and so on. Points tMt occur ten times are plotted with an 'A. eleven with a B ._and so on. u+td Z. The letter Z is used subsequently, encode may not be specified if there is _ore than on_ war.
! _
hlines(#) c_uses a horiz_)ntal line of dashes (-) io be drawn across the diagram every #-th line. where # re_resents a nulnber between 0 and the line height (lines) of the plot. Specifvin_ # as
i
0. which i_ the default. !esults in no horizontal lines.
i !
lines(#) spdcifies the lin_ height of the plot. The number of lines must lie between 10 and 83; the default is 4_3.Note that !the plot occupies three fewer lines than the number specified. The three extra lines iare used to l_bet the diagram
l
vlines(#)c',i es a verti I line of bars (1) to be drawn on the diagam every #-th column, where # is a number between 0 and the column width columns) of the plot. Speci_'ing # as 0. which is the default, results in lno vertical lines.
i
i ! !
I
i
2
Remarks
4 i
plot displays a line-prin !erplot--a scatter diagram drawn using characters available on an ordinary typewriter • or iine printer. ?,s a result, this scatter diagram can be displayed on any monitor, printed i on any prmteri and edited _y any word processor. _e diagram necessarily has a rougher appearance than one designed to be di,_played on a graphics monitor.
i
i
•
> Example 546 plot -- Draw scatterplot using typewriter characters _ i_
We use the plot command to display the function 9 - z2 for values of x ranging between -10 and 10. Each point is plotted with an asterisk (*). The minimum and maximum xvar are marked, and the variable names are displayed along the axes.
values of yvar and
• set obs 21 obs was O, now 21 generate x=_n-ll • generate y=X*x plot y x 100 +
Y *
,t
l I I 0
*
+
* *
-I0
*
*
*
*
x
10
> Example You can reduce the size of a graph by specifying the number of lines and columns to be used. In this version, we plot y = z 2 in 16 lines and 50 columns: plot y x, lines(16) columns(f0) I00 +
y
*
.
*****
O+ + .......................................
-I0
+
x
I0
<1
I
plot -- Draw s_tterplot using typew_
characters
547
> Example You can u_e the hline_and !
_
vlines
options t°add horizontal and vertical lines to a graph. We
place a horizdntali line e',er_; ''''_" 5 lines and a vertical iine every i0 columns by typing .plot
1
_ x,
hlines(_)
vlines(lO)
Io_+
I I I
* ........
_-+
* i
*
_
y
........
t 1 1
.........
I I I
l I I
*
+ .........
+._........
+ .........
+ .......
)l )I il (t _4_+. .........
i I I t , .........
I l l I +_........
i l I I , .........
t l * I l* +.......
*II il
I I
I I
I 1
,l ......... t
I +._ ........ I
I * +__,...... I
t +....... I
*I
I
*I
I
i
il*
i
I
.......... I
_-+___, ..... I
I
I 1 i
*
Q ++............................... l
I *
-10
*
*1 I
* *1"=.......................... * I
I
x
+ I0
I
i
<1
i > Example Real data _an be messi,'.r to plot than the simple mathematical function used in the previous 74 automobile_: examples. The! following pl _t displays the combinaiions of miles per gallon (mpg) and veight
for
. plot41} _pg+weight
i
, M i 1 e
*
**
a g
I *
e
.
,
i
*
i i
m
{
p g )
I
* **
(
*
*, *
*
I
* * *
**
* *
1
** *** *
,
** , * ***I, ***
*
* *
,
**** *
_
*
*
* *
**
*
12 i + +.........4......................_........................... + 1760 Weight (ibs_) 4840 i
i
Although it is _ot revealed _y this graph, several automobiles have virtually the same mpg weight • • i combination; some of the asterisks represent more than one observation. The encode option reveals
l
this:
_'r
548
plot -- Draw scatterplot using typewriter characters
'41
--_
plot mpg weight, encode 41 + M i I e
*
**
a
g e
* * 2 2
( m
* * **
2* * ***
* *2
p
**
*
g )
*
**
*** *
*
2 2 332* •
*
*
* *
•
*
**2 ** *
*
*
*
*
2-
*
12 +
* 1760
Weight
*
4840
(Ibs.)
Each '*' in this diagram represents one point, each '2' represents two points, and so on.
q
b. Example You can graph up to three y-variables at a time against the x-variable. The first variable is plotted with A's, the second with B's, and the third with C's. Below we graph the price of domestic and foreign cars dprice and fpriee, respectivelymagainst weight: • plot dprice fprice weight 15906 + A A A B
A
B A B
A
B
A
A A
B A B
B
B B BB B B * BB B A A 3291 + B BBA B
A B AAA A AA A A
AAAA AAA A A AAAAAA A A A
A A
.j. ..........................................................
1760
Weight
(Ibs.)
+
4840
The graph indicates that domestic cars typically cost less per pound than foreign cars.
Also See Related:
Stata Graphics Manual
q
i ! '
i i
i t
poisson
Poisson
J
Syntax
i1
poissonde_var [indepv_rs] [weigh,] [if exp] [inrange] [' irr level(#) exposur _(varname) _ffset (varname) robust cluster (varname) score(newvar)
_
noconst __ntconstraints
_
(numlist) notog maximize_options
poisgof by ... : may bemused with poi.son;see [R] by. poisson
shares t_e features of a
i
poissen
may be !used with sw t!) perform stepwise estimation: see [R] sw,
!
i i
!
ts,
Syntaxfor p predict
and pweigttts
are allowed: see [U] 14,1.6 weight. estimation commands: see [U] 23 Estimation and post-estimation commands,
fweights,
I
i_ei
4
i
ict
i
[b,pe]newvalname
noo:f fse£
i
[if exp]
[in ra'lge]
[,
{ n ]ir
i xb'
stdp }
i
These statistics ar; available botl_ in and out of sample: type predict
...
if egsamp!e) ..,
if wanted only for
the estimation sample.
Description
!
poissones![matesa Pois_onmaximum-likelihoodregression of depvar on indepvars, where depvar is a nonnegativ_ count variable.
'
t
Persons whq have panel ata should see [R] xtpols. poisgof, w_aichmay be!used following poisson, performs a goodness-of-fit test of the model. If the test is significant, thi_ would indicate that the Poisson regression model is inappropriate. In this case, you doutd t_'a n_gative binomial model: gee [R] nbreg.
Options
i
! i ! i
irr reports eat!mated coeffijients transformed to incidencerate ratios, i.e.. eb rather than b. Standard errors and _onfidence inervals are similarly transformed. This option affects how results are displayed, nbt how they are estimated. ±rr may be specified at estimation or when replaying previously eitimated resu ts.
i
level (if) specifiesthe conficlencelevel, in percent, for confidenceintervals. The default is level or as set by'set level:see [U] 23.5 Specifying the width of confidence intervals.
=
549
(95)
F_,-
:)ou
polssonm Poissonregression
exposure (varname) and offset (varname) are different ways of specifying the same thing, exposure () specifies a variable that reflects the amount of exposure over which the depvar events were observed for each observation; ln(varname) with coefficient constrained to be t is entered into the log-link function, offset() specifies a variable that is to be entered directly into the log-link function with coefficient constrained to be 1; thus, exposure is assumed to be e varname. robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster () allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,
robust
is implied; see [U] 23.13 Weighted estimation.
cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups, varname specifies to which group each observation belongs; e.g., cluster(personid) in data with repeated observations on individuals, cluster()affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U]23.11 Obtaining robust variance estimates, cluster () can be used with pweights to produce estimates for unstratified cluster-sampled data, see [U] 23.13 Weighted estimation, but also see [R] svy estimators for a command designed especially for survey data. cluster() by itself.
implies robust;
specifying robust
cluster
() is equivalent to typing cluster()
score (newvar) creates newvar containing uj = OlnLj/O(xjb) for each observation j in the sample. The score vector is _ OlnLj/Ob = _ujxj; i.e., the product of newvar with each covariate summed over observations. See [U] 23.12 Obtaining scores. noconstant
suppresses the constant term (intercept) in the regression.
constraints (nurnlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command; see [R] constraint. nolog suppresses the iteration log. maximize_options control the maximization process; see [It] maximize. You should never have to specify them. although we often recommend specifying trace.
Optionsfor predict n,
the default, calculates the predicted number of events, which is exp(xjb) if neither o:ffset(varname) nor exposuxe(varname) was specified when the model was estimated; exp(xjb + offset)if ofgset(varname) was specified; or exp(x3b ) • exposure if exposure (varname) was specified.
ir calculates the incidence rate exp(xjb), is equivalent to n when neither offset the model was estimated.
the predicted number of events when exposure is 1. This (varname) nor exposure (varname) was specified when
xb calculates the linear prediction. strip calculates the standard error of the linear prediction. nooffset is relevant only if you specified offset (varname) or exposure(varname) when you estimated the model. It modifies the calculations made by predict so that they ignore the offset or exposure variable: the linear prediction is treated as x_b rather than xjb- offsety, and specifying predict ... is equivalent to specifying predict ._. , nooffset it.
poisson-- Poisson regression
551
//'ks
!
" _-
The basic id a of Poisson regression was outlined by Coleman (1964. 378-379). See Feller (1968. 156-164) for iJ formation about the Poisson distribution. See Long (1997, chapter 8), McNeil (1996,
"
chapter 6), and Selvin (19951 chapter 12) for an introduction to Poisson regression. Also see Selvin Poisson distrib_ tion. ! (1996, chapter ;) for a discussion of the analysis of spatiat distributions including a discussion of the
i
Poisson regr _ssion is useq to estimate models of the number of occurrences (counts) of an event. i i
" ....
: '
The Poisson di _tribution hasl been applied to divers_ events such as the number of soldierS kicked to death by ho'ses in the Pt'ussian armor (Bortkewitsch I898); the pattern of hits by buzz bombs launched again ;t London d@ng "World War II (Clarke 1946); telephone connections to a wrong number (Thorn(like 1926); ar]d disease incidence, typically with respect to time. but occasionally with respect to space. The basic atssumptions are .
I
,
.
•
,
i. There _s a qtmnt_ty called!the mc_dence rate that is the rate at which events occur. Examples are
ii
5 per second, 20 per l,0(_0 person-years, 17 per square meter, and 38 per cubic centimeter. , _. Yhe incide n_e q rat e can be nultiplied by exposure toiobtain the expected number of obse_'ed e',ents. For example, a rate of 5 _er second multiplied by 30 seconds means t50 events are expected; a rate of 20 p_r 1.000 persc >years multiplied by 2._00 person-years means 40 events are expected:
i I
and so on. _ 3. Over very s_nall exposun'.s e. the probabilib, of finding more than one event is small compared with _. 4. Nonoverlapt_ing exposure t
i !
l l l
are mutually independent.
With these assumptions, to fi _dthe probability of k events in an exposure of size E. divide E into rz subinter,:als Eat, E2 ..... E._ and approximate the answer as the binomial probability of observing h successes in i_ztrials. If ycu let n _ oc. you obtain the Poisson distribution. In the Poiss+n regression model, the incidence rate for the jth observation is assumed to be given
,
?,j = e13a+'21xl,j+"'+13_zkj
tf Ej is lhe ex )osure, the elpected number of events Cj will be
,i
Cj i
i
T
'
= E3eB°+fllXl"+"+c%x_,J = CIn(E-_ )+3°+Btx_'J_'"+fl_zk'3
1
i
his model is elstimated by l_oisson. Without the exposure() or offset() options. Ej is assumed _o be t (equivalent to assuming that exposure i_ unknown_ and controlling for exposure, if necessary,
i :
i is your responsibility. One often _ants to comptre rates and this is mos_ easitv done by calculating incidence rate ratio_
}
(IRR). For inst!nce, what i@he relative incidence rate of'chromosome interchanges in cells as the intensity of radiation increases; the relative incidence irate of telephone connections to a wrong number
i !
as load increases: or the reMive inci:ence rate of deaths due to cancer for females relative io males? That is, one w}nts to hold _ll the x s in the model i:onstant except one. say the ith. The incidence rate ratio for atone-unit cha_ge in xi is
_
i e_(_)+_'++'_(_"+a)+
+,_x_
__ e¢3i
!
More generally, lhe inciden 'e rate ratio for a _xi _hange in xi is e_z_. The lincor_ command can be used atter poisson to display incidence raie ratios for any group relative to another: _ee
i
JR] iincom.
,
> Example Chatteuee, Hadi, and Price (2000, 164) give the number of injury incidents and the propo_ion of flights for each in a single year: airline out of the total number of flights from New York for nine major U.S. airlines • list
I. 2. 3. 4, 5. 6, 7. 8. 9.
airline I 2 3 4 5 6 7 8 9
injuries II 7 7 19 9 4 3 1 3
n 0.0950 0.1920 0.0750 0.2078 0.1382 0.0540 0.1292 0.0503 0.0629
XYZowned 1 0 0 0 0 1 0 0 1
To their data we have added a fictional variable, XYZowned. W%will imagine made that the airlines owned by xYZ Company have a higher injury rate. , poisson
injuries
XYZowned,
exposure(n)
that an accusation is
irr
log likelihood = -23.027197 log likelihood = -23.027177 log likelihood = -23.027177
Iteration O: Iteration 1: Iteration 2:
Poisson regression
Number of obs LR chi2(1) Prob > chi2 Pseudo R2
Log likelihood = -23.027177 injuries
IPd{
XYZowned n
Std. Err.
1.463467 (exposure)
.406872
z 1.37
= = = =
9 1.77 0.1836 0.0370
P>lzl
[95_ Conf. Interval]
0,171
,8486578
2.523675
We specified irr to see the incidence rate ratios rather than the underlying coefficients. We estimate that XYZ Airlines" injury rate is 1.46 times larger than that for other airlines, but the 95% confidence rate. interval is .85 to 2.52; we cannot even reject the hypothesis that xYz Airlines has a lower injury
log likelihood = -22.333875 log likelihood = -22.332276 log likelihood = -22.332276
1
......
_
Poisso4 regression i : _ i
Nlmbez of obs LR chi2(2) Prob > chi2
= = =
9 19.15 0.0001
Log li_elihood = -2_.332276
Pseudo R2
=
0.3001
injhries I •
poisson-- Poissonregression
'
,
.,!
Std. Err.
z
P>Izl
[957,Conf. Interval]
.6_40667 1.424169
.3895877 .3725155
1,76 3;82
0.079 0.000
-.0795111 ,6940517
4 ._ 63891
.7090501
6;86
O.000
3.474178
r
XYZhwned I :! InN I __
Coef.
553
/.cons! I
1.447645 2.154285 6.253603
L
, ! !
In this case, +ther than sp_'cifyingthe exposure (} option, we explicitly included the variable that would normalize for expos1:re in the model. We did not specify the irr option, so we see coefficients rather than incidence rate r fios. We started with the model i
rate
=
e¢3°+fltXYz°wned
COllltt = /2¢ _°+_lXYzowned
i
= e tn(n)+fl°+fllXYZ°wned
The observed :ounts are therefore which amount_ to constrai dng the coefficient on in(n) to 1, This is what was estimated when ourselves and, !rather than o)nstraining the coefficientto be I, estimated the coefficient.
i
weThe specified tt_e exposure In the abovedistance model away we included estimated coefficienI(n)option. is 1.42, a respectable from 1,the andnormalizing consistent exposure with our speculation thai larger airlin#s also use larger airplanes. With this small amount of data, however, we
I
also have a wi!le confidenc_interval that includes 1. Our estimai_d coefficienI on XYZo,anedis now .684, and the implied incidence rate ratio is e.TM ,_, 1.98 (#lhich we could also see by typing poisson, irr). The 95% confidence interval for the coefficient .till includes ) (the interval for the incidence rate ratio includes 1), so while the point estimate is no_alarger, we ill cannot be very certain of our results.
I i
Our expert 4pinion woulc be that, while there is insufficientevidence to support the charge, there !
is enough evidence to justif3 collecting more data.
"i ! !
Example
:
In a famous_'age-specific_tudy of coronary disease deaths among male British doctors, Dolt and Hilt (1966) rep6rted the folNwing data (reprinted in Rothman and Greenland I998. 259):
i
Smokers Deaths Person-years
Nonsmokers Deaths Person-years
_.
Age
_,
35 - 44 45- 54
32 104
52.407 43.248
2 12
18,790 10.673
55-64 75 - 84 65- 74
206 102 t 86
28,612 5.317 t 2.663
28 31 28
5.710 1.462 2585
i i
The first step __sito I enter thes, data into Stain, which we have done: • list
_
i! i
agecat 1 2 3
smokes 1 1 1
deaths 32 104 206
5.
5
1
102
5,317
6. 7. 8. 9. 10.
21 3 4 5
00 0 0 0
122 28 28 31
18,790 10,673 5,710 2,585 1,462
1. 2. 3.
4
4
1
pyears 52,407 43,248 28,612
186 12,663
agecat 1 corresponds to 35-44, agecat 2 to 45-54, and so on. The most "natural" analysis of these data would begin with introducing indicator variables for each age category and a single indicator for smoking: tab agecat, gen(a) agecat
Freq.
1 2 3 4 5
2 2 2 2 2
20.00 20.00 20.00 20.00 20.00
10
100.00
Total • poisson Iteration Iteration Iteration Iteration
Percent
Cum. 20.00 40,00 60.00 80.00 100.00
deaths smokes a2-a5, exposure(pyears) O: log likelihood = -33.823284 1: log likelihood = -33.600471 2: log likelihood = -33.600153 3: log likelihood = -33.600153
Poisson regression
Number of obs LR chi2(5) Prob > chi2 Pseudo 22
Log likelihood = -33.600153 deaths
IP_
smokes a2 ] a4 a5 pyears a3
1.425519 4.410584
28.51678 40.45121 (exposure) ; 13.8392
irr
Std. Err. .1530838 .8605197
z 3.30 7.61
= = = =
i0 922.93 0.0000 0.9321
P>Iz_
[95_ Conf. Interval]
0.001 0.000
1.154984 3.009011
1.759421 6.464997
5.269878 7.775511
18.13 19.25
0.000 0.000
19.85177 27.75326
40.96395 58.95885
2.542638
14.30
0.000
9.654328
19.83809
• poisgof Goodness-of-fit Prob > chi2(4)
In the above, we began by using
chi2
= =
tabulate
12,13244 0.0164
to create the indicator variables,
equal to 1 when ageeat = 1 and 0 otherwise; a2 equal to 1 when agecat and so on. See IV] 28 Commands for dealing with categorical variables.
tabulate created al = 2 and 0 otherwise;
We then estimated our model, specifying irr to obtain incidence rate ratios. We estimate that smokers have 1.43 times the mortality rate of nonsmokers. We also notice, however, that the model does not appear to fit the data well; the goodness-ogfit X 2 tells us that, given the model, we can reject the hypothesis that these data are Poisson distributed at the 1.64_ significance level. So let us now back up and be more careful. We can most easily obtain the incidence within age categories using ir; see [R] epitab:
rate ratios
i,
I
i i
po_sson-- Poissonregression
_-
, ir
d4aths smokes _yeexs, by(_ecat) nocz_de nohet ! agecat IRR [95_ Conf. Interval] t 2 3 4 5 H-_ combined
, 5.736638 2.138812 1.46824 1.35606 .9047304
1.463519 1.173668 .9863626 .9082155 .6000946
49.39901 4.272307 2.264174 2.09649 1.399699
_ 1.424882 I
1.t5470_
1.757784
g,nal=soo os*lagecat= >
'
I
gen _a2 = smokes* agecat==2) gen
34 = smokes_(agecat==3 I agecat==4)
. pois_on deaths sa
i
!zeratdon O:
log
i I
I_erat_on 1: iterat_n 2: IteratiOn 3:
log .ikelihood = -27.788819 log .ikelihood = -27.573604 log .ikelihood = -27.57fi645
i
Iterat_n
I ° i i
I i i
!
(exact) (exact) (exact) (exact) (exact)
"i although we Mll begin by!combining age categories 3 and 4: i
i
I
1.472169 9.624747 23.34176 23.25315 24.31435
robust.} Seeing thiL we will nbw parametefize the smoking effects separately for each age category. l . .
. gen _5
I
M-H Weight
We find that !the mo_alityl incidence ratios are greatly different within age category, being highest for the youn_st categofie_ and actually dropping below 1 for the oldest. (in the las( Case, we might argue that th_se who smoke and who have not died by age 75 are sel_selected to be particularly
i
i
555
•
= smokes*_agecat==5)
4:
sa2 sa34 sa5 a2-a5, exposure(pyears) irr ikelihood = -31.635422
log .ikelihood = -27.572645
P_isso_ regression i
Number of obs LR chi2(8)
= =
i0 934.99
L_g li_e lihood = -2' .572645
Pseudo Prob > R2 chi2
=
0.9443 0.0000
IRR iaths sal sa2 sa34 sa5 a2 a3 a4 a5
P>_z{ Std. Err.
i
5.f36638 2._38812 1._12229 .9_47304 1_.5631
4.181257 .6520701 .20t7485 .1855513 8.067702 34.3741
98}22766 _99.21
70.85013 145.3357
7.671
[9SZConf.
z
Interval]
2.40 2.49 2.42 -0.49 3.09 5.36
0.017 0.013 0.016 0.625 0.002 0.000
1.374811 1.176691 1.067343 .6052658 2.364153 11.60056
23.93711 3.887609 1.868557 1.35236 47.19624 195.8978
6.36 7.26
0.000 0.000
23.89324 47.67694
403.8245 832.365
pois f _ Goodness-o -fit chi2
=
.0774185
i Prob > chit(1)
=
0.7808
£ Note that the _oodnes_-o_f it X2 is now small: we are no longer running roughshod over the data. Let u_ no_ consider simpli_,in _the model. The point estimate of the incidence rate ratio for smoking in age category i1 is much la__er than that for _moking in age category v but the confidence interval | . for sal iS s_ularty wide, [s the difference real?
• test sal=sa2 =ao ;
(I) [deaths]sal[deaths]sa2 = 0.0 pumson _ l"olsson regression chi2( 1) = 1.56 Prob > chi2 = 0.2117
The point estimates may be far apart, but there is insufficient data and we may be observing random differences. With that success, might we also combine the smokers in age categories 3 and 4 with those in I and 2? • test sa34=sa2, accum (I) (2)
[deaths]sal - [deaths]sa2 = 0.0 - [deaths]sa2 + [deaths]sa34 = 0.0 chi2( 2) = Prob > chi2 =
4,73 0.0938
Combining age categories 1 through 4 may be overdoing it to stop us, although others may disagree.
the 9,38%, significance level is enough
Thus, we now estimate our final model: • gen sal2 = (sallsa2) . poisson deaths sa12 sa34 sa5 a2-a5, exposure(pyears) Iteration Iteration Iteration Iteration
O: I: 2: 3:
log log log log
likelihood likelihood likelihood likelihood
= = = =
Poisson regresslon
Number of obs LR chi2(7) Prob > chi2 Pseudo R2
Log likelihood = -28.514535
deaths
IRR
sat2 2.636259 sa34 1.412229 sa5 .9047304 a2 4. 294559 a3 23.42263 a4 48.26309 a5 97,87965 pyears i (exposure)
irr
-31. 967194 -28.524666 -28.514535 -28.514535
Std. Err. .7408403 .2017485 .1855513 .8385329 7. 787716 16. 06939 34. 30881
z 3.45 2.42 -0.49 7.46 9.49 11.64 13.08
P>Izl 0.001 O.016 O.625 O.000 O.000 O. 000 O. 000
= = = =
i0 933. Ii 0.0000 0.9424
[95Y,Conf. Interval] 1,519791 i.067343 .6052658 2. 928987 12.20738 25,13068 49. 24123
4.572907 i.868557 1. 35236 6. 296797 44.94164 92. 68856 194.561
The above strikes us as a fair representation of the data. q
(Continued on next page)
l
i i
l
!._
i i
poigson -- Potsson regression
557
SavedResults .....
:i
poisson s_wes in e():
e(N) Scalars e (k) e (k_.eq) e(k.dv)
number of
observations number of lvariables number of equations number of dependent variables
e(ll_0)
log likelihood, constant-only model
e (1__clust) e (re) e(chi2)
number of clusters return code X2
e (dL.m)
model deglMs of freedom
e (p)
significance
e(r2_p) e(lt)
pseudo
e(ic)
e(rank)
number of iterations rank of e(V)
R-squared
log likelihcod
Macros e(emd)
poisson
e(user)
name of likelihood-evaluator program
e(depvar)! e(title)
name of &pendent variaNe title in estimation output
e(opt) e(chi2tTpe)
type of optimization Wald or LR; type of model X2 test
e(wgype) i e(wexp)
weight tyl_ weight exp ession
e(offseSt) e(prediet)
offset program used to implement predict
e(ctustv@) e (vee'_yp@
name of cluster variable covariance _stimation method
e(cnsli_t)
constraint numbers
e (V)
variance-covariance matrix the estimators
Matrices
!!
e (b) e (ilog)
coefficient _vector / iteration lo_ (up to 20 iterations)
of
Functions e (sample)
marks estit nation sample
MethodsanqtVormul_ts _:)oi_so_.
_d
_Doi_go_
_re
inr_plei_e.ted
The log lil_elihood (witl" weights
a_
ande-_Ay offsets)
ado_|_s.
and scores are given by
Pr Y = 5') -
I
5'!
1
(i = _ifl + offseti
!
e- exp(_i)e_,_,
i
I
i=1
}'i
s,:ore(,3)i
= Yi _ e{_
I I
References
i
Bonke_tsch, I.i yon. 18c_8.D_s Gesetz der Kleiner_ Zahlen: Leipzig: Teubner. Cameron, A. C.{and R K. Triv,_di. t998. Regres.sion analysis of count data. Cambridge: Cambridge Universit> Press.
i
Chat|e@e. S.. ,,k. S. Hadi, and B. Price. 2(_10.Regres._ionAnalvsis _ Example. 3d ed. New York: John Wiley &
i
Clarke. R. D. 1146. An applic lion of the Poisson distribution. Journal of the Institute of Actuarie_ 22: 48.
.,.,_,
pu,_un
--
romson
regression
Coleman. J. S. 1964. Introduction m Mathematical Sociology New York: Free Press. [
Doll, R. and A. B. Hill. 1966. Mortality of British doctors in relation to smoking; observations on coronary thrombosis. In Epidemiological Approaches to the Study of Cancer and Other Chronic Diseases. ed, W. HaenszeL National Cancer Institute Monograph 19: 204-268. Feller, W. 1968. An Introduction to Probability Theory and Its Applications. vol. 1. 3d ed. New York: John Wiley & Sons. Hilbe, J. 1998. sg91: Robust variance estimators for MLE Poisson and negative binomial regression. Stata Technical Bulletin 45: 26-28. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp, 177-180. . 1999. sgt02: Zero-truncated poisson and negative binomial regression. Stata Technical Bulletin 47: 37-40. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 233-236. Hilbe, J. and D. H. Judson. 1998. sg94: Right, left, and uncensored Poisson regression. Stata Technical Bulletin 46: 18-20. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. t86-189. Long, L S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications. McNeil, D. i996. Epidemiological Research Methods. Chichester, England: John Wiley & Sons. Poisson, S. D, 1837 Recherches sur ta probabilitd des jugements en matibre criminetle et en mati_re civile, pr_cddrs des rbgles grndrales du calcul des probabilitds. Paris: Bachelier. Rodrlguez. G. 1993. sbel0: An improvement to poisson. Stata TechnicaI Bulletin Technical Bulletin Reprints, vol. 2. pp. 94-98.
11: 11-14. Reprinted in Stata
Rogers, W_H. 1991. sbel: Poisson regression with rates. Stata Technical Butletin t: l I-12, Bulletin Reprints, vol. I, pp. 62-64.
Reprinted in Stata Technical
Rothman. K. J. and S. Greenland. 1998. Modem Epidemiology. 2d ed. Philadelphia: Lippincott-Raven. Rutherford, E., J. Chadwick. and C. D. Ellis. 1930. Radiations from Radioactive Substances. Cambridge: Cambridge University Press. Selvin, S. 1995. Practical Biostatisticat Methods. Belmont, CA: Duxbury Press. 1996. Statistical Analysis of Epidemiologic Data. 2d ed. New York: Oxford University Press. Thorndike, E 1926. Applications of Poisson's probability summation. Bell System Technical Journal 5: 604-624. Tobias. A. and M. J. Campbell. 1998. stsl3: Time-series regression for counts allowing for autocorrelation. Stata Technical Bulletin 46: 33-37. Reprinted in Stata Technical Bulletin Reprints, vo], 8, pp. 291-296.
Also See Complementary:
[R] adjust, [R] predict,
[R] constraint, [R] lincom, [R] sw, [R] test, [R] testnl,
Related:
[R] epitab,
[R] glm, [R] nbreg,
Background:
[U] 16.5 Accessing [U] 23 Estimation
coefficients and
[R] svy estimators, and
post-estimation
[u] 23.11 Obtaining
robust
[El 23.12 Obtaining
scores
[R] linktest, [R] lrtest, [R] vce, [R] xi
variance
standard
errors.
commands. estimates,
[R] xtpois
[R] mfx,
l!litle ....
'
i
ppermn
i [
i ,
test for unit roots
Syntax pperron pperron
vtrname i
[if
,,xp]
[in
range]
[,
is for u,_e with time-series data; see [R] tsset
noconstant
lags(#)
You must_ tsset
t xrend
regress ........
]
your data before usinz_ pperron.
varHame may coniain time-series loperators; see [U] 14.4.3 Time, series varlists
Description [
excludePperr°nthe co_stant,Pe_f°rms the. P illips-Perron testand/or for umt_ roots lagged on_a variable. userdifference may optionally mcludl a trend term, include values The of the of the variable in the regression.
Options ! i
noconstant
st_ppressesthe constant term (intercep0 in the model.
tags (#) specit_esthe number of Newey-West lags io use in the calculation of the standard error. •
i
/
trend speclfie.,tthat a trend!term should be included in *..heassociated regression. This option may not be speci_ed if nocon_tant is specified. regress speci_es that the ",lssociatedregression table should appear in the output. By default, the re_ression t_ble is not pr_luced,
|
I ! f_
i l I
I i
Remarks
,i
Hamilton (I_94) and Fuller (1976) give excellent overviews of this topic; see especially chapter 17 of the forme_r.Phillips (1_86) and Phillips and Pe_on (1988) present statistics for testing whether a time series h_d a unit-roottautoregressive,component.
Example
:
ii
Here, we ttse the international airline passengers dataset (Box, Jenkins, and Reinsel I994, Series G). This datasei has 144 obs_rt ations on the monthlynumber of international airline passengers from 1949 through 1_760. • pperro_ air Phillips4Perron test [or unit root
i
{
Tes Statis ic Z(rho) Z(t)
-6. _64 -i _44
* HacKi_on
i
Number of obs = Newey-West lags =
143 4
Interpolated Dickey-Fuller I_,Critical 5Y,Critical I0_,Critical Value
Value
Value
-19. 943 -3,496
-13. 786 -2,887
"11.057 -2,577
approxima:e p-value for Z(t) = 0.3588 559
i
; I
Note fail to--reject the hypothesistestthattorthere a unit root in this time series by looking either at ,.,vv that we Vk_,_un rrml,ps-i-'erron unit isroots the MacKinnon approximate asymptotic p-value or at the interpolated Dickey-Fuller critical values.
_>Example In this example, we examine the Canadian lynx data from Newton (1988, 587). Here we include a time trend in the calculation of the statistic. • pperron lynx, trend Phillips-Perron
test for unit root
Number of obs = Newey-West lags =
Test Stat istic
IY,Critical Value
-38.365 -4.585
-27.487 -4. 036
Z (rho) Z(t)
113 4
Interpolated Dickey-Fuller 5Y,Critical lOY,Critical Value Value -20.752 -3.448
-17.543 -3. 148
* MacKinnon approximate p-value for Z(t) = 0.0011
We reject the hypothesis that there is a unit root in this time series. q
Saved Results pperron
saves in r ():
Scalars r(N)
number
of observations
r(lags) r (pval)
Number of lagged differences used MacKinnon approximate p-value (not included
if noconstant
r(Zt)
Phillips-Perron
r test statistic
r(Zrho)
Phillips-Perron
p test statistic
specified)
Methods and Formulas pperron
is implemented as an ado-file.
In the OLS estimation of an AR(I) process with Gaussian errors, Yi = PYi-_ + e_ where ei are independent and identically distributed as N(O, cr2) and Yo = O, the OLS estimate (based on an n-observation time series) of the autocorrelation parameter p is given by n
_-_ Yi- I Yi Pn
i=1
71
i=l
We know that if IPI < 1 then v_{(_'n - p) --+ .IV(0,1 p2). If this result were valid for the case that p = 1, then the resulting distribution collapses to a point mass (the variance is zero).
..........
i
_
I
,
............................
pperron-- Phillips;-Perrontest for unit roots
561
L
It is this m6tivation that rives one to check for the possibility of a unit root in an autoregressive process. :In order to comput| the test statistics, we compute the Phillips-Perron regression Yi = a + PYi-1 + q where we mayIexclude the :onstant or include a trend term (i). There are two statistics, Z o and Z_-, calculateA as 1)
=n(pn
1
_r
__
=
_j n = _'-
2 sr_
I!
2
--
?2i_ti-j
n i=j+l
A
j 1 j=l
_
1
I
_j,,_ q+l_
^2 i=l
where ui is the OLS residu I, k is the number i_
1 n_ Ls. -
"2
A
A 70,,_+ 2
)n
2
f covariates in the regression, q is the number of
Newey-West lhgs to use in lthe calculation of _n, and _ is the OLS standard error of ft. The critical !,'alues (which! have the same distribution as the Dickey-Fuller
statistic: see Dickey and
i !
Fuller (1979))included in tt_e output are linearly inte_olated from the table of values which appear in Fuller (197(_), and the M_cKinnon approximate p-,alues use the regression surface punished in
I
MacKinnon (I _94).
i
Referenoes
i
BOX,EnglevaoodG. E. P.,cllffs,G. M.Nj:JenkinS,Prentic,__Hall.and G. C. Reinsel. !994. Time Series Analysis: Forecastingand Control.3d ed.
l
Dickey..D A. an_ W. A. Fuller. 1979. Distributionof the estimatorsfor autore_ressive_ time series with a umt root. Journalof theiAmerican Stati;ticalAssociation74: 427-431. Fuller,W. A. 197B.Introduction:o Statistical TimeSeries. New York:John Wiley& Sons. Hakkio.
C
S.
19i)4., sts6:
Apprcximate
p-values
for unit
root
and cointegration
tests.
Srata
Technical
Bulletin
17:
i
25-28. Repriniedin Stata Te_hnical BulletinReprints,vo]. 3, pp. 219-224. Hamilton,J. D. 1}94. Time Seri_s Analysis. Princeton:PrincetonUniversityPress.
i l
MacKinnon.J. G.!1994.Approxilaateasymptoticdistributionfunctionsfor unit-rootand cointegrationtests. Jottrnal(ff Businessand _conomic Statisics 12: 167-176. Newton,H, J. 19,_8,TIMESLAB A Time SeriesLaboratory.PacificGrove.CA: Wadsworth& Brooks/Cole.
I. g
Phillips.P.aC.B. _986. Time series regression,xith a unit root Economemca56: I021-104._."
}
Phillips,P_C. B. _nd R Pen-on. 988. Testingfor a unit root in time series regression.Biomemka 75: 335-346.
!
Also See
!
l
i t
Complementary:
[R] tss_t
Related:
[R] d ller
i
-i
prais
m Prais-Winsten I
regression
and Cochrane-Orcutt
[
I
regression nrl
II
i,
i
:-
Syntax prais
depvar
twostep
[vartist]
robust
nodw level(#)
[if exp]
[in range I [, corc ssesearch rhotype(rhomethod)
cluster(varname) no log
hc2
maximize_options
hc3 noconstant
h_ascons
savespace
]
is for use with time-series dam; see [R] tsset. You must :sset your data before using prais. depvar and _,arlistmay contain time-series operators: see [U] 14.4.3Time-series varlists. prais shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. prais
Syntaxfor predict predict
[type] newvarname
[if
exp]
[in range]
[,
These statistics are available both in and out of sample; type predict the estimation sample.
{ xb I residuals I stdp } ] ...
if e(sample) ...
if wanted only for
Description prais estimates a linear regression of depvar on varlist that is corrected for first-order serially-correlated residuals using the Prais-Winsten (1954) transformed regression estimator, the Cochrane-Orcutt (1949) transformed regression estimator, or a version of the search method suggested by Hildreth and Lu (1960).
Options corc specifies that the Cochrane-Orcutt transformation be used to estimate the equation. With this option, the Prais-Winsten transformation of the first observation is not perfomaed and the first observation is dropped when estimating the transformed equation; see Methods and Formulas below. ssesearch specifies that a search be performed for the value of p that minimizes the sum of squared errors of the transformed equation (Cochrane-Orcutt or Prais-Winsten transformation). The search method employed is a combination of quadratic and modified bisection search using golden sections.
562
! i
! i
-].
............
, prais rais-Winsten regressionand Cochrane-Orcutt regression 563 rhotype(rhor_ethod) setedts a specific computation for the autocorrelation parameter p. where rhomethod ban be • re,tress frbg
['reg -- 9 from the residual regression et - tier-1 /'freg = _3from the residual regression et = flet+l
!
ts_orr clwl _
f'tscorr = e'et_l/e'e, where e is the vector of residuals /'dw -- 1 -- dw/2, where dw is the Durbin-Watson d statistic .
!
th_il
/'theit = Ptscorr(N - k)/N
i
na!gar
[',_gar = (Pdw * N 2 4- k_)/(N 2 - k 2)
I i !
li
The prais !estimator can use any consistent estimate of p to transform the equation and each of these estimdtes meets tha_requirement. The default is regress and it produces the minimum sum of squares s_lution (sses_arch option) for the C0chrane-Orcutt transformation--no computation will produc_ the minimur_ sum of squares solution for the full Prais-Winsten transformation. See Judge Grif_ths, Hill. Lii_kepohl. and Lee (1985) for _ discussion of each of the estimates of p. twostep
speci_es that pra±_ will stop on the first iteration after the equation is transformed bv p
the
two-step efficient estima@. Although it is customa_' to iterate these estimators to convergence.
!
they are ef_ient at eachlste p. , " robust specifies that the Nuber/White/sandwich estimator of variance is to be used in place of the traditiov_al calculation, robust combined with cluster() further allows observations which
,t
are not ind+p endent witt in cluster (although they must be independent between clusters). See
I ! !
[U] 23.11 Obtaining rob,,st variance estimates. Note that all estimates fr,)m prais are conditional on the estimated value of ,o. This means that robust variahce estimates in this case are only robust to heteroskedasticity and are not generally robus! to n{isspecificatio_ of the functional form or omitted variables. The estimation of the
! !
functional fiirrn is intertwined with the estimate of p, and all estimates are conditional on p. Thus, we cannot t{erobust to _isspecification of functional form. For these reasons, it is probably best
li :
i
i
• ! i
to mterpret _obust
mth t spirit of White's _19801 original paper on estimatton of heteroskedastic
consistent @variance matrices. cluster(,,arnb,,,e) specifie_ that the observations are independent across groups (clusters) but not neces._arilv Within groups._'lvarname specifies to which group each observation belongs, cluster () affects the astlmated _ " v stan lard errors and variance,-covariance matrix of the estimators (,Cg). but not lhe estirhated coeffici .'nts. Specifying cluster () implies robust he2 and he3 _pecifv an alt _rnative bias correction for the robust variance calculation; for more informationlsee [R] regre_s, he2 and he3 may not be specified with cluster() Specifying he2 or hcB imp_es robust. 1 t
.
/
• I
!
.
.
.
hascons rodin}tea that a us@defined constant, or set of variables that m hnear combination form a constant, ha_ been includetd in the regression. For Some computational concerns, see the discussion in [RI regreb's. savespace
sNcifies that pz ais attempt to save as much space as possible by retaining only those
!!
variables for eslimation. Theused original are isrestored after space estimation. This option rarely usedre+uired hnd should g meratty be only data if there insufficient to estimate a modelis
[
without the _ption.
!
nodw suppresses reporting o the Durbin-Watson
statistic'.
level(#)specifies the confidence level, in percent, for confidence intervals. The default is level (95) oo,, prms -- _,ram-wmslen regress=onand Cochrane-Orcutt regression H
ii ,
*
or as set by set level, see [U] 23.5 Specifying nolog suppresses the iteration log. maximize_options specify them.
control the maximization
the width of confidence
process;
see [R] maximize.
intervals.
You should never have to
Options for predict xb, the default, calculates the fitted values the prediction of xjb for the specified equation. This is the linear predictor from the estimated regression model; it does not apply the estimate of p to prior residuals. residuals
calculates the residuals from the linear prediction.
strip calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is al so referred to as the standard error of the fitted value. As computed for prais, this is strictly the standard error from the variance in the estimates the parameters of the linear model under the assumption that p is estimated without error.
of
Remarks The most common autocorrelated error process is the first-order autoregressive assumption, the linear regression model may be written
process. Under this
Yt --=xtfl + ut where the errors satisfy ?_t -'=- P U¢--I
and the et are independent and identically error term e may then be written as
if,
1 1
distributed
_
et
as N(O, 02). The covariance
1 p
p 1
f12 p
... . ..
pT-1 pT-2
p2
p
1
...
p T- 3
pT-3
...
1
matrix if" of the
p2 pT-1
pT-2
The Prais-Winsten estimator is a generalized least squares (GLS) estimator. ]'he Prais-Winsten method (as described in Judge et al. 1985) is derived from the AR(1) model Jbr the enor term described above. Whereas the Cochrane-Orcutt method uses a lag definition and loses the first observation in the iterative method, the Prais-Winsten method preserves that first observation. In small samples, this can be a significant advantage.
Q TechnicalNote To estimate a model with autocorrelated errors, you must specify your data as time series and have (or create) a variable denoting the time at which an observation was collected. The data for the regression should be equalty spaced in time. Q
!
i
i
!
i
,
V
prais-- =rais-Winstenregressionand Cochrane-Orcuttregression
Example
You wish td estimate a e-series model of usr on idlebut are concerned that the residuals may be _nally, correlated. We will declare the variable t to represent time by typing . tsset
t_ 1
We can obtain _ochrane-Or:utt estimates by specifying the core option: • prais
_sr
Itera_io_
idle,
O:
cor"
rho = 0 i0000
Itera_io_ I: rho = 0,3518 (output o_nirted ) Iteratio_ Cochrane
13:
rho = i.5708
Orcutt
So _ rce M
AR(1) i |regressio; _S d
I
el
Resic_ual
40.13 I
9584
--iterated MS I
166.8T8474
27
est imates Number Prob
6.18142498
R-squared Adj R-squared
C_ef.
Std.
Err.
t
_?n, 14.,7641 4.2,2299
I
Durbin-Wdson
statist:.c
(original)
1.295766
I
Durbin-Wa_son
statist:.c
(transformed)
1.4662_2
!
t
}
I
i i i i
i
]'he estimated model is
P>lt
> F
Root.sE
!
_sr
of obs =
40.1309584_
T_al / 207.0_943328 7.39390831
!
i
565
I
[95%
Conf.
29
=
0.0168
= =
0.1938 O. 1640
: 2.4862 Interval]
0.002 5.78036 23.3,=45
,
_srt = --.125 idler
14.55 + u_
+
and
_t = .5708_t_1 + et
We can also estlmate the m_lel with the Prais-Winsten method: • prais u3r idle Iteration 0: rho = 0 0000 Iteration
I:
rho = 0 3518
(output or #tted) Iteration 14: Prais-Win_ten
rho = (.5535 AR(1)
i Source Mo_el
Residgal i To_al
I
rcgression
-- iterated
estimates
43,00 _ S 6941
df1
MS 43. 0076941
F( 1, 28) = Number Prob > Fof obs = =
7.12 0.012530
169.1( 5739 212.1i3433
28 29
6.04163354 7.31632528
R-squared Adj R-squared Root MSE
0.2027 0.1742 2.458
_sr i_le c_ns
Durbin-Wa son statistic Durbin-Wa_son statist5
Std.
Err.
.0472195 4. 160391
(original) (tremmformed)
t -2.87 3.65
1.295766 1.476004
P>It I O. 008 O. 001
[95_, Conf. -. 2323769 6. 681978
= = =
Interval] -. 0389275 23. 72633
P !i!
ooo
where
pram m P'rals-wlnsten
the Prais-Winsten usrt
As the results
regression and Cochrane-Orcutt
estimated
-
model
-.1357±diet
indicate,
is
+ 15.20 + ut
for these
regression
data there
and
u_ = .5535ut_1
is little to choose
between
+ et
the Cochrane-Orcutt
and
Prais-Winsten estimators whereas the OLSestimate of the slope parameter is substantially different. q
> Example We have data on quarterly sales, in millions of dollars, for five }'ears, and we would like to use this information to model sales for company X. First, we estimate a linear model by OLS and obtain the Durbin-Watson
statistic
regress
csales
using dwstat;
diagnostics.
isales
Source
SS
Model Residual
d_
MS
Number
of obs
=
20
110.256901
1
110.256901
F( 1, Prob > F
.133302302
18
.007405683
R-squared
=
0.9988
5.81001072
Adj R-squared Root MSE
= =
0.9987 .08606
.... Total
110.390204
csales
Coef.
isales
.1762828
_cons
see [R] regression
-1,454753
19
Std.
Err.
.0014447
¢
P>Itl
122.02
0.000
.2141461
-6.79
0.000
[95Z
18) =14888.15 = 0.0000
Conf.
Interval]
.1732475 -1.904657
.1793181 -1.004849
• dwstat Durbin-Watson
d-statistic(
2,
20)
=
.7347276
Nofng that the Durbin-Watson statistic is far from 2 (the expected value under the null hypothesis of no sefiN correlation) and well below the 5% lower limit of 1.2, we conclude that the disturbances are serially correlated. (Upper and lower bounds for the d statistic can be found in most econometrics texts, e.g., Harvey, 1993. The bounds have been derived for only a limited combination of regressors and observations.) We correct for the autocorrelation using the ssesearch option of prais to search for the value of p that minimizes the sum of squared residuals of the Cochrane-Orcutt transformed equation. Normally the default Prais-Winsten dataset, but the less efficient Cochrane-Orcutt of the estimator's
_ansformations would be used with such a small transformation will allow us to demons_ate an aspect
convergence.
. prais csales isales, core ssesearch Iteration I: rho = 0.8944 , criterion
=
-.07298558
Iteration
=
-.07298558
2:
(ou_utomittcd) Iteration 15: Cochrane-Orcutt Source Model Residual Total
rho rho
= 0.8944 = 0.9588
AR(1)
, criterion ,
criterion
regression SS
-- SSE
df
=
-.07167037
search MS
estimates Number
of obs
19
2.33199178
1
2,33199178
.071670369
17
.004215904
R-squared
=
0.9702
.133536786
Adj R-squared Root MSE
= =
0.9684 .06493
2.40366215
18
17)
=
F( I, Prob > F
= =
553.14 0.0000
I
t
-
.
i
|
_
1 _c_ns ho
Ccef.
! i
i
I I
i"
i
Std. Err.
t
P>ltl
[95_ Conf. .1461233
.160_233
.0068253
23.52
0.000
1 1.73_946 .958_209
1.432674
1.21
0.241
I)urbin-Wa_sonstatistic (original)
.
567
Interval] .1749234
-1.283732
4.761624
0.734728
(transformed) 1.724419
Durbin-Wa_son statist
1 ! i
_._
prais -- _rais-Winsten regressionand CoChrane-Orcu_ regression l
csa_es ! isa_es
.............
It was noted m the Optic
section that with the default computation of p the Cochrane-Orcutt
themeth°dssesearchPr°duce_an estima!, IGi_e )fthat p thatthe minimiZeSmethodsthe sum of squared residuals the same criterion as bption, two produce the same results, why would the search method ever be _referred? lt,t__rnsout that the back-and-forth iterations employed by Cochrane-Orcutt can o_en have difficulty, corn e cging if the value of p is large. Using the same data, the Cochrane -Orcutt iterative procedure requires o_er 350 iterations to converge and a higher tolerance must be specified to prevent premature converg, mce: • prais c_ales isales, core tol(le-9) iterate(500) Iteration O: Iteration_l: Iterationl2:
rho = O. rho = 0.5312 rho = 0.5866
Iteration!3: rho = 0.T161 3000 Iteration!4: rho = Iterationl5: rho = (output onA_t_d) Iteration!377: rho Iterationi378: rho Iteration!379: rho
0.7373 0.[550 = 3.9588 = ).9588 = ).9588
Cochrane-_reutt AR(1) regression -- iterated estimates Source
S
df
Mo_el
2,3319
To_a! !
2,4036_208
csa_es
Co f.
isal_s _colOns
,o
t71
MS
Number of obs =
19
1
2.33199171
Prob > F
=
0.0000
18
.133536782
Root MSE Adj R-squared
= =
.06493 0.9684
Std. Err.
.1605_33
.0068253
1.738_46
1.432674
t 23.52 1.21
P>ltl
[95X Conf. Interval]
0_000
.1461233
.1749234
0.241
-1,283732
4.761625
9 8 1o9
Durbin-Wat_on statisti_ (original) Durbin-WatNon statisti
0.734728
I (transformed) 1.724419
Once convergende is achieve 1,| the two methods produce identical results.
(Con6nued on next page)
q
568
prais --
_1 _|
Saved Results prais saves in
:t
scal
I
Prais-Winsten
regression and Cochrane-Orcutt
regression
-_'_
e()"
e (/0
number of observations
e (ross) e (df_ta)
model sum of squares model degrees of freedom
e(rss)
residual sum of squares
e(df_.r) e (r2)
residual degrees of freedom R-squared
e (r2_a) e(F)
adjusted R-squared F statistic
e(rmse) e (11) e(N_clust)
root mean square error log likelihood number of clusters
e(rho)
autocorrelation parameter p
e(dw)
Durbin-Watson
e (dw_O) e (to 1) e(max_ic)
Durbin-Watson d statistic of transformed regression target tolerance maximum number of iterations
e (ic)
number of iterations
e(N._gaps)
number of gaps
d statistic for untransformed regression
Macros e(cmd) prais e(depvar) name of dependent variable e(clustvar) name of cluster variable e(rhotype) e(method) e (vcetype)
methodspecified inrhotype option twostep,iterated,or SSE search covariance estimation method
e(tranmeth) corc or prais e(cons) e(predict)
noconstant or notreported programusedto implement pred£ct
Matrices e(b)
coefficient vector
e(V)
variance-covariance
matrix of the estimators
Functions e (sample)
marks estimation sample
Methods and Formulas prais is implemented
as an ado-file.
Consider the command 'prais from the standard linear regression:
y x z'.
The
Yt = axt An estimate regression:
of the correlation
in the residuals
0-th iteration
is obtained
by estimating
a, b, and c
+ bz_ + c + ut is then obtained.
_t _- PUt--i+ et
By default,
prais
uses the auxiliary
i
I
Y
!
_,.
[_l*_iS
_
_r_is
_W|ns_n
r_ssJ_n
and
Cochr_ne_
O_ut_
This can be cMnged to any )f the computations noted in the rhotype
!
Next we apl_ly a Cochran ,,-Orcutt transformation(l)
_ssion
_9
() option.
for observations t = 2,...,
n
4
! I
i l
v, -
!
=
-
+
-
+ 41 - p)+
(1)
!
and the transformation (1') fi)r t = 1
Thus, the diffe: ences betwe4 n the Cochrane-Orcutt and the Prais-Winsten methods are that the latter uses equa ion (1') in a. Idition to equation (1), _,hereas the former uses only equation (I) and
l i
necessarily dec_ _.asesthe san pie size by one. Equations (1i and (t ') are used to transform the data and obtain new estimates of a, b, and c. When the tubstepoption is specified; the estimation process is halted at this point andthese are
i
the estimates re_orted. Under the default behavior of i!erating to convergence, this process is repeated i i
until the Changei in the estim_Lteof p is within a specified tolerance. The new esti_nates are used to produce fitted values i Yt = axt 4- bzt + "d
t and then p is re[estimated, b
default using the regression defined by
Yt
-- Y% =
lO(Vt--1
Y_--I)
-t- Ut
(2)
i
We then re-estir_ate equatiol and (2) until the_estimate of Convergence ts declared af coixelation between two iterx that this processiwiil always i
I ! i
Under the ss_search opt on a combined quadratic and bisection search using golden sections is used to search f& the value (4 p that minimizes the Sum of squared residuals from the transformed e_uation. The transformation may be either the Cochiane--Orcutt (I only) or the Prais-Winsten (1
i
and 1').
I
(1) using the neu, estimate of p, and continue to iterate between (t) converges. _riterate () iterations or when the absolute difference in the estimated ions is tess than tel (): , see [R] maximize. Sargan (1964) has shown :onverge.
All reported _tatistics are ased on the p-transformed variables and there is an assumption that p is estimated witl_out error. S, Judge et ak (1985) for;:details. I
The Durbin-g[atson
d sta, stic reported by praisand d_rstat is n--1
!
_ (uj+l- uj
l
j=l
j=l
:
t
where uj represo_nlsthe residual of the jth t t 4
observation.
]
ozu
prms N P'ram-wlnsten
regression and Cochrane-Orcutt
regression
Acknowledgment We thank Economics
Richard
Dickens
and Political
Science
of the Centre for testing
for Economic and assistance
Performance with an early
at the version
London
School
of
of this command.
References Chatterjee, S.. A. S. Hadi, and B. Price. 2000. Regression Analysis by Example. 3d ed. New York: John Wiley & Sons. Cochrane, D. and G. H. Orcutt. 1949. Application of least-squares regression to relationships containing autocorrelated error terms. Journal of the American Statistical Association 44:32-61. Durbin, J. and G. S. Watson. 1950 and 1951. Testing for serial correlation in least-squares regression. Biome_ika 409-428 and 38: I59-178.
37:
Hardin, J. W. 1995. stsl0: Prais-Winsten regression. Stata Technical Bulletin 25: 26-29. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 234-237. Harvey, A. C. t993. The Econometric Analysis of Time Series. Cambridge, MA: MIT Press. Hildreth, C. and J Y. Lu. 1960. Demand relations with autocorrelated disturbances. Agricultural Experiment Station Technical Bulletin 276. East Lansing, MI: Michigan State University. Johnston. J. and J. DiNardo. 1997. Econometric Methods. 4th ed. New York: McGraw-Hill. Judge, G. G., W. E. Griffiths, R C. Hill, H. Lfitkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wiley & Sons. Kmenta, J. 1997. Elements of Econometrics. 2d ed. Ann Arbor: University of Michigan Press. Prais, S. J. and C. B. Winsten. 1954. Trend Estimators and Serial Correlation. Cowtes Conm_ission Discussion Paper No. 383. Chicago. Sargan. J. D. 1964 Wages and prices in the United Kingdom: a study in econometric methodology. In Econometric Analysis for National Economic Planning, ed. P. E. Hart, G. Mills, J. K. Whitaker, 25-64. London: Butterworths. Theft, H. 1971. Principles of Econometrics. New York: John Wiley & Sons. White, H. 1980. A heteroskedasticity-consistent Econometrica 48: 817-838.
covariance matrix estimator and a direct test for heteroskedasticity.
Also See Complementary:
[R] adjust, [R] iincom, [R] vce, JR] xi
Related:
[R] regress,
Background:
[U] 16.5 Accessing
[R] mfx, jR] predict,
[R] regression
[U] 23 Estimation [u] 23.11 Obtaining
diagnostics
coefficients
and standard
and post-estimation robust
JR] test,
variance
errors.
commands, estimates
[R] testnl,
i
.....
l :? 1°..o It,. e t --
_
ic ions, residuals, etc., after estimation .i i i i i
i
i
i
After single-eqtlation (SE) estimators
t
Syntax
predict;
[_,pe_ newvarlrame [if
other_op_ons
exp] [in range] [, xb stdp
nooffset
]
After multiple-_quation CME)iestimators
stdp stdrtp nooffse_
other_options ]
DescriptiOn predict call :ulates predicl ions, residuals, influence statistics, and the like after estimation. Exactly what predict an do is dete mined by the previous estimation command; command-specific options are documented larith each est mation command. Regardless of command-specific options, the actions of predict shale certain sirr ilarities across estimation commands:
i
l) Typing p_edict newv _rname creates newvanvame containing "predicted values"--numbers related to,ithe E(_ljlxj t. For instance, after linear regression, predict newvarname creates t l
t
xjb and, _fter probit, creates the probability/b(xjb). 2) predict _ewvarname, xb creates newvarname containing xjb. This may be the same result i hnear , as (1) (e.g_, regression) or different (e.g., probit), but regardless, option xb is allowed.
! i
3) predict _ewvarname,'_ stdp creates newvarnanie containing the standard error of the linear prediction !xj b. !
I
4) predict/lewvarname,_ther_options may createnewvarname containing other useful quantities; _ee help _r the referende manual entD for the particular estimation command to find out about other avai_ble options. I
i i
5) Addling th4 noel fset @tion to any of the above requests that the calculation ignore any offset or e_posule variable s_cified by including the offset(varname) or exposure(varname)
i
options w_en you estim!ted the model. predict
!
can be used to mah in-sample or out-of-sample predictions:
6) tn general, predictcall ulates the requested statistic for all possible observations, whether they were used in estimating the model or not. predict does this for standard options (1) through (3), and generally does ;his for estimator-specific options (4). 7) To restrict ithe predictio_ to the estimation subsample, type , predict
l
I
!newvarname
i:
e(sample)
....
8) Some stati._tics make se _se only with respect to the estimation subsample. In such cases, the calculation iis automatically restricted to the estimation subsampte, and the documentation for the specific!option states this. Even so, you can still specify if e (sample) if you are uncertain.
571
572
predict -- Obtain predictions, residuals, etc., after estimation
9) predict's you can • use
dsl
(estimate • use
I!
ability to make out-of-sample prediction even extends to other datasets. In particular,
a model) /*
two
• predict
hat,
...
another
/* fill
*/
dataset
in the
predictions
*/
I: <
i Options xb calculates the linear prediction from the estimated model. That is, all models can be thought of as estimating a set of parameters bl, b2, ..., bk, and the linear prediction is _ -- bzzlj + b2z2j + • " + bkzkj, often written in matrix notation as _j = xjb. In the case of linear regression, the values _'j are called the predicted values or, for out-of-sample predictions, the forecast. In the case of logit and probit for example, _'j is called the logit or probit index. It is important to understand that zlj, x2j .... , zkj used in the calculation are obtained from the data currently in memory and do not have to correspond to the data on the independent variables used to estimate the model (obtaining the bl, b2, .. , bk). strip calculates the standard error of the prediction after any estimation command• Here, the prediction is understood to mean the same thing as the "index", namely xjb. The statistic produced by strip can be thought of as the standard error of the predicted expected value, or mean index, for the observation's covariate pattern. This is also commonly referred to as the standard error of the fitted value. The calculation can be made in or out of sample. stddp is allowed only after you have previously estimated a multiple-equation error of the difference in linear predictions (Xljbx2jb) between equations
model. The standard 1 and 2 is calculated.
equation(eqno[, eqno] )--synonym outcome ()--is relevant only when you have previously mated a multiple-equation model. It specifies to which equation you are referring.
esti-
equation() is typically filled in with one eqno it would be filled in that way with options xb and strip, for instance, equation(#1) would mean the calculation is to be made for the first equation, equation(#2) would mean the second, and so on. Alternatively, you could refer to the equations by their names, equation(income) would refer to the equation named income and equation(hours) to the equation named hours. If you do not specify equation(),
results are as if you specified equation(#:t).
Other statistics refer to between-equation concepts; stddp might specify equation(#:t,#2) or equation(income,hours). specified, equation() is not optional.
is an example. In those cases, you When two equations must be
noof_set may be combined with most statistics and specifies that the calculation should be made ignoring any offset or exposure variable specified when the model was estimated. This option is available even if not documented for predict the offset (vatvTame) option nor the exposure (varname) was estimaled, specifying noo/_fsel; does nothing. other_options
refers to command-specific
after a specific command. If neither option was specified when the model
options that are documented
with each command.
redict -- Obtain predic|ions,residuals,etc,, after estimation •
573
i
Remarks
,
:
Remarks ar_ presented u _der the headings
{
Estimation-sat:
]pie predictions
Out-of-sample Ra_iduals
predictions
Singte-equatio_
(SE)
i
estimation
Multiple-equation(ME) estimation I i
Most of thd examples ar to all estimators.
presented using linear regression; but the general syntax is applicable
One can th nk of any e fimation command as estimating a set of coefficients bl, b2, ..., bk corresponding to the variabl_ s :rl, z2,.. zk along with a (possibly empty) set of ancillary statistics All estimttion commands the and that _1. % ..... _m. save bi's _i's. predict accesses saved information anal combines : with the data currently in memory to make various calculations. For instance, the linear, predicti, n is _j = bl:rlj + b2:r2j +..-+ bkzl_j. Among other things, predict can make that _catculation. "Ihe data on which predict makes the calculation can be the same data
i I } }
on which the rrodel was esti nated or a d_fferent datalet-- it does not matter, predict uses the saved parameter esti_ates from th_model and, for each observation in the data, obtains the corresponding values of :c, a_d then combines them to produce the desired result.
!
Ii
preC EStimation-sa mple ictions
;
_ Example
:
We have a _4-observatio I dataset on automobiles, including the mileage rating (mpg), the car's }
weight (_eigh!),
and wheth._r the car is foreign ffo_eign).
• regres_ _!
mpg weight
SoUrce
I
M(del
I
Resit [ual
I
_ T_tal
I
I
......
Number
of obs =
22
I
427.990298
Prob
> F
=
0.0005
20
24.493666_
R-squared
=
0.4663
2t
43.7077922
Adj R'squared Root MSE
= =
0.4396 4.9491
917.8
_3636
_
!
............ C_ef. -.01 )426
._?ns
48. !)183
To obtain the
MS
489.8 T3338
If we were to ty _e predict
I
df
427,9_0298
weight
1 i_
if foreign _S
mpg
!
We estimate the model
Std. Err. .0024942 5.871851
t
P>[t [
[95Y, Conf.
Interval]
-4.18
0.000
-.0156287
- .0052232
8,3'3
O. 000
36,66983
61,16676
mpg now, we would obtain
e linear predictions for all 74 observations.
_edictions _iusI for the sample on which we estimated the model, we could type
. predict I pmpg
if e(s_unple)
(option (52 missihg x_ assumed; values ge_erated) f:.tted values) !
:
!
In this e×ample_
I
e_;timatedI the nlodel and the: e are no missing values among the relevam variables. Had there been missing ,,,alues._e (sample) ,'ould also account for t_ose.
e(sample)
is true only for foreign cars because we typed
if
foreign
when we
!
I
574
predict -- Obtain predictions, residuals, etc., after estimation
ti
By the statistics way, theonifthee(sample) be type used with any Stata command, summary estimation restriction sample, wecan could . summarize (output
if
omitted
so to obtain
e(sample) )
<1
Out-of-sample predictions By out-of-sample predictions, example above, typing 'predict
we mean predictions extending beyond the estimation sample. In the pmpg' would generate linear predictions using all 74 observations.
predict will work on other datasets, too. You can use a new dataset and type predict results for that sample.
to obtain
> Example Using the same auto dataset, assume that you wish to estimate the model: mpg = _lweight + fl2weight2+ fl3foreign+ t4 We first create the weight 2 variable and then type the regress command: • use
auto
(1978
Automobile
generate
Data)
weight 2=weight'2
• regress
mpg
weight
Source
weight2 SS
foreign df
MS
Number F(
Model
of
3,
= =
52.25
=
0.0000
3
563.05124
754.30574
70
10.7757963
R-squared
=
0.6913
33.4720474
Adj K-squared Root MSE
= =
0.6781 3.2827
Total
2443.45946
73
mpg
Coef.
Std.
Err.
t
P>ltl
> F
74
1689.15372
Residual
Prob
obs 70)
[957, Conf,
welght
-. 0165729
.0039692
-4.18
O. 000
-. 0244892
weight 2
1.59e-06
6.25e-07
2.55
O. 013
3.45e-07
foreign _cons
-2.2035 56. 53884
1.059246 6.197383
-2.08 9.12
0.041 O.000
-4.3161 44. 17855
Were we to type 'predictpmpg' now, we would obtain predictions data. Instead, we are going to use a new dataset.
Interval] -. 0086567 2.84e-06 -.0909002 68.89913
for all 74 cars in the current
The dataset newautos, dta contains the make, weight, and place of manufacture of two cars. the Pontiac Sunbird and the Volvo 260. Let's use the dataset and create the predictions: use newaut os (New
Automobile
Models)
list
i. Pont. 2.
make Sunbird
we ight 2690
260
3170
Volvo
predict mpg (optlon xb assumed; variable r(lll) ;
weight2
noZ
fitted found
f or e ign Domestic Foreign
values)
II
! !_ }i
1
_
,
pl_dict -- Obtain predictions, residbals, etc., after estimation
Things did not work. We typed predict mpg and Stata responded with the message "weight2 not found", predictcan calcuh Ltepredicted values on a different dataset only if that dataset contains the variables that ,_ent into the aaodet. In this case, our data do not contain a variable called weight2. weight2 is just the square _,f weight, so we can create it and try again: • generate . predic_
(o_tion i
575
weight2=we ight*2 mpg
Ib assttmed;
itted
values)
. list
i 1.
make Pon_:. Sunbird
2.
Volvo
260
weight 2690
foreign Domestic
weight2 7236100
mpg 23.47137
3170
Foreign
1.00e+07
17.78846
i
i
\Ve obtained o,tr predicted alues. The Pontiac SunNrd has a predicted mileage rating of 23.5 mpg whereas _heVovo 260 has alpredicted rating of 17.8mpg By way of comparison, the actual mileage
'_
ratings are 24 or the Pontia_ and 17 for the Volvo.
q
Residuals i
_, Example
!
t
With many estimators, p_edictcan calculate more than predicted val'ues.With most regressiontype estimator_ we can, for _instance, obtain residuals. Using our recession example, we return to
i
our original daia and obtain residuals by typing
-_
. use
I
(Automobfle
au$o,
predic$ !
,
Models)
!
double
summar!ze ,
clear
res_d,
resid
Variable
residuals
i 8_s
Mean
Std.
J , Dev.
Min,
Max
resid
i
_ i
i
i t
l
1 ?_4
-1,78e-15
3.214491
-5.636126
i3.85172
Notice that wd did this wi!hout re-estimating the model. Stata always remembe_ the last set of J.. estimates, ever_as we use n w datasets. It was not n_cessar_ to t're the double in predict double resid, residuals; but we wanted " " ' variable in front of the variable s name; see to remir_dyouI that you ca_ specify the type of [U] 14.4.2 List_ of new variSbles. We made the ne_ :variableresida doublerather than the defaul_ float_
i
If"you wantiyour residua to have a mean as clese to zero as possible, remember to request the extra precision of double.If we had not specified double, the mean of resid would have been , --s ! -14 -]4 sounds _ more precise than 10-s. the difference roughl) 10 rather than _) • Although 10 really does notimatter.
F,_rlinear rti,.zression, r diet can also calculate standardizedresiduals and studentized residuals • . i_ P with the ophoqs rstandar
and rstudent:
for examples
see JR] regression
diagnostics
576
predict -- Obtain predictions, residuals, etc., after estimation
Single-equation (SE) estimation If you have not read the discussion above on using predict after linear regression, please do so. Also note that predict's default calculation almost always produces a statistic in the same metric as the dependent variable of the estimated model e.g., predicted counts for Poisson regression. In any case, xb can always be specified to obtain the linear prediction. predict is also willing to calculate the standard error of the prediction, which is obtained by using the inverse-matrix-of-second-derivatives estimate for the covariance matrix of the estimators.
Example After most binary outcome models (e.g., logistic, legit, probit, cloglog, scobit), predict calculates the probability of a positive outcome if you do not tell it otherwise. You can specify the xb option if you want the linear prediction (also known as the legit or probit index), The odd abbreviation xb is meant to suggest XB. In legit and probit models, for example, the predicted probability is p -- F(XB), where F() is the logistic or normal cumulative distribution function respectively. . logistic foreign (output omitted ) predict (option
mpg
weight
phat
p assumed;
predict
idLhat,
• summarize
foreign
Pr(foreign)) xb phat
Variable
Obs
foreign phat idxhat
74 74 74
idxhat Mean
Std.
.2972973 .2972973 -1.678202
Dev.
.4601885 ,3052979 2.321509
Min
0 .000729 -7.223107
Since this is a legit model, we could obtain the predicted probabilities index gen
phat2
Max
1 .8980594 2.175845
ourselves from the predicted
= exp(idxhat)/(l+exp(idxhat))
but using predict without options is easier. q
Example For all models, predict attempts to produce a predicted value in the same metric as the dependent variable of the model. We have seen that for dichotomous outcome models, the default statistic produced by predict is the probability of a success, statistic produced by predict is the predicted count specify the xb option to obtain the linear combination of (the inner product of the coefficients and z values). For is the natural log of the count. poisson (output
injuries
omitted
predict (option
injhat n assumed;
predict gen
XYZowned
)
idx,
exp_idx
. summarize
predicted
number
of events)
xb = exp(idx)
injuries
injhat
exp_idx
idx
Similarly, tbr Poisson regression, the default for the dependent variable. You can always the coefficients with an observation's x values poisson (without an explicit exposure), this
)redict -- Obtainpredictions,residuals,etc.,after estimation Vari_le
1
I
Ot
Meam
Min
7.111ni .833333 7.11illt .833333_
injb t
exp__dx iidx injuries
i _
Std. De_.
1.955174 7. 111111
I
.122561 _ 5.48735_)
577
Max
66 7.666667 7.666667 1.791759
1
2.036882 19
We note that o_r "'hand-co_ _uted" prediction of the count (exp_idx)
exactly matches what was
i _
produced by th_ default oper _tion of predict. If our model! has an expo,,',ure-time variable, we can use predict to obtain the linear prediction with or without !the exposure. Let's verify what we are getting by obtaining the linear prediction with and without exl_osure, transfi)rming these predictions to count predictions, and comparing with the default count p_diction from predict. We must remember to multiply by the exposure time when
i
usin_ predict!
) !
i
. poisson
nooffset.
injuries
XY_ owned,
exposure(n)
(outputor__i.ed) . predict double injh_.t (option n assumed; predicted i
. predict
I
. gen dou)le
i
• predict
i
. s_mmari,_ei injuries njhat VariaIle Ob
!
• gen
double
of event_)
xb
exp_idx
double
double
idx,
number
exp(idx)
idxr
xb nooffset
exp_idxn
exp(idxn)*n
exp_idx Mean
exp_idxn idx idxn Std. Dev. Min
injuries
9
7.11t111
inj_at
_
7,111111
3.10936
!
exp_ _dx
9
7. 111111
;
exp_i_xn _dx
_ 9
7.111111 I.869722
i
i_xn
9
4. 18814
!
1
5.487359
Max 19
2.919621
12.06158
3. 1093_
2. 919621
12. 06158
3.1093_ .4671044
2.919621 I.0714_4
12.06158 2.490025
.190404_
4.061204
4.442013
| Looking at t_e identical m_ans and standard deviations for injhat,
exp_idx,
and exp_idxn,
we
! )
, ee that ]! _s possible to reproduce the default computations of predict for pozsson esnmatlons. We have also d_monstrated tlle relationship between the count predictions and the linear predictions
i
with and withodt exposure. q
! i i ! i
Multiple-equation (ME) estimation If you have lot read the _bove discussion on using predict after SE estimation, please do so. With ]he exception of the aNlity to select specific ettuations to predict from, the use of predict after ME model,_ follows almost exactly the same for£ as it does for SE models,
Example i i
The details c;f prediction statistics that are specific to particular ME models are documented wi_h the estimation c_)mmand. Users of ME commands that do not have separate discussions on obtaining predictions wou_d also be well-advised to read the predict section in [R] mlogit, even if their interest isnot in!multinomial _ogistic regression. As a general introduction to the ME models, we will
ii_
demonstrate pr!dict!
after slreg:
, _ureg
(price
foreign
disp1)
(weight
foreign
length)
;_._mlngly unrelated regression _jijation prJco w. lght
Obs
Parms
RMSE
"R-sq'
chi2
P
74 74
2 2
2202.447 245.5238
0.4348 0.8988
45.20554 658.8548
0.0000 0.0000
Coef.
Std. Err.
z
P>Jzl
[95_ Conf. Interval]
prlco foreign dl_placement _cons w_|ght foreign length _cons
_ut-g
3137.894 23.06938 680.8438
697.3805 3.443212 859.8142
-154.883 30.67594 -2699.498
75.3204 1.531981 302.3912
cstinmted two equations,
4.50 6.70 0.79
-2.06 20.02 -8.93
one called price
0.000 0.000 0.428
1771.054 16.32081 -1004.361
4504.735 29.81795 2366.049
0.040 0.000 0.000
-302.5082 27.67331 -3292.173
-7.257674 33.67856 -2106.822
and the other weight:
see [R] sureg.
predict prod_p, equation(price) (_q)tionxb assumed; fitted values) }n,odictprod_w, equation(weight) (option xb assumed; fitted values) • .u_arize
price pred_p weight pred_w
Variable
Obs
Mean
price pred_p weight pred_w
74 74 74 74
6165.257 6165.257 3019.459 3019.459
Std. Dev. 2949.496 1678.805 777.1936 726.0468
Min
Max
3291 2664.81 1760 1501.602
15906 10485.33 4840 4447.996
Y,m may Sln\'ifY the equation by name, as we did above, or by number: _;m," Ihing as equation(price) in this case.
equation(#1)
means
the
Methods and Formulas I)cnotc _h_-previously estimated
coefficient vector by b and its estimated
variance matrix by V.
pr,,l* ,'_ x__,Tksby recalling various aspects of the model, such as b, and combining that information witl_ the ,tau currently in memory. Let us write xj for the jth observation currently in memory. 'l'hc t _.......
d value (xb option) is defined _'j
TIw ,nv xf-_--derror of the prediction The
_:.n)_2c3 error o/" the difference
(stdp)
=
xjb
q- offset#
is defined spj = v/xjVx}
in linear predictions between equations
1 and 2 is defined
s% - V/(x_j,-x2j, o,..., 0) v (x_j,-x2j, o,..., o)' See _h_"".=Nvidual estimation Sl_ll isli,'._-
commands
for the computation
of command-specific
predict
i1 iJ
Also See Related:
_redict -- Obtain predi_ions, residuals, etc., after estimation [R] regress, [R] regression diagnostics [P] _wet [ict
Background:
[u] 23 E,,',timation and post-estinlationcommands
I
I
I I
I !
i i
i
! f
579
,
I _,
Title [ probit
--I Maximum-likelihood [
probit estimation
i
I
J
Syntax probit depvar
[indepvars ] [weight]
noconstant r_obust maximize_options
dprobit
cl_uster(varname)
exp]
[in range]
[, level(#)nocoef
score(newvarname)
asis
offset
(varname)
]
[ depvar indepvars
probit_options
[if
[weight]
[if
exp] [in range]]
[, at(matname)
classic
]
by ... : may be used with probit and dprobit; see [R] by. fweights, iweights, and pweights are allowed; see [U] 14.1.6 weight. These commandsshare the features of all estimation commands:see [U] 23 Estimation and post-estimation commands. probit may be used with sw to perform stepwise estimatlon; see [R] sw.
Syntaxfor predict predict
[type] newvarname
[if
exp]
[in range]
[.
{ p
xb
stdp}
rules asif nooffset ] These statistics are available both in and out of sample; type the estimation sample.
predict
...
if
e(sample)
...
if wanted only for
Description probit
estimates a maximum-likelihood
probit model.
dprobit estimates maximum-likelihood probit models and is an alternative to probit. Rather than reporting the coefficients, dprobit reports the change in the probability for an infinitesimal change in each independent, continuous variable and, by default, reports the discrete change in the probability for dummy variables. You may not specify the noconstant option with dprobit, probit may be typed without arguments after dprobit estimation to see the model in coefficient form. If estimating on grouped data, see the bprobit
command described in [R] glogit.
A number of auxiliary commands may be run after probit, for a description of these commands. See [R] logistic for a list of related estimation
commands.
580
togit,
or logistic;
see [R] logistic
......
,
probit --I_laxh_um_i_kelihoodprobe estimation
581
Options
i l
Options,for p,obit
level (#) spec ifies the confi,tence level, in percent, for confidence intervals. The default is level or as set by set level:see [U] 23.5 Specifying the width of confidence intervals
(95)
l
nocoef specifies that the c( efficient table is not to be displayed. This option is sometimes used by programmel but is of n( use interactively.
I
noconstmat s_ppresses the :onstant term (intercept)in the probit model. This option is not available for dprobi_.
!
robust species that the H _ber/'White/sandwich estimator of variance is to be used in place of the traditional dalculation; s_:e [U] 23.11 Obtaining robust variance estimates, robust combined
!
with cluster]
l
be independent between :lusters). If you specify pweights robust is implied; see [U] 23.13 Weighted estimation.
! i ! i
l !_ } i _ i }
I i {
() allows
bservations which are not independent within cluster (although they must
cluster(varn_me) specifi_m that the observations are independent across groups (cluSters) but not necessaily within gr)ups, varname specifies to which group each observation belongs; e.g.. cluster (p _rsonid) in data with repeated observations on individuals, cluster() affects the estimated s_andard error: and variance-covariance matrix of the estimators (vcH), but not the estimated c,)efficients; see [U] 23.11 Obtaining robust variance estimates, cluster() can be the unstratified cluster-_ampled data. but used with _weights to produce estimates for see svyprobitcommand in [R] s_Westimators for a command designed especially for survey data. 1 cluster()limplies by itself. score(newva,lname)
rob_tst; create
speci_ing
robust
cluster()
is equivalent to typing cluster()
newvar containing uj = OInLj/O(xjb)
for each observation j in the
sample. Th_ score vecto is _ OlnLj/Ob = _ u.jxj; i.e., the product of newvar with each covm-iate s_mmed over ,servations. See [u] 23.12 Obtaining scores. asis requests _,thatall spec ]ed variables and observations be retained in the maximization process. This option I is typically r_ot specified and may introduce numerical instability. Normally probit 1
drops variables that perf_ctly predict success or failure in the dependent variable. The associated observation_ are also dr(pped. In those cases, the effective coefficient on the dropped variables is infinity (_egative infinity) for variables that completely determine a success (failure). Dropping the variable, and perfectly predicted observations has no effect on the likelihood or estimates of the remaining cbefficients an,t increases the numericaI stability of the optimization process. Speci_ing this option _orces retenti_)n of perfect predictor _:ariables;and their associated perfectly predicted observationL offset (varmmu:) specifies that varname _sto be included in the model with the coefficient constrained to be 1. madmize_optWns specify thma.
control tt_e maximizalion process: _ee [R] maximize. You should never have to.
_----
l_lv_l_
Illlfli^lflllll_llllll--III11,_llllll//11,/lbl
I./ll./i./ll
1_6LIIII_I|IUI'|
Optionsfor dprobit at (matname) specifies ::i
the point around which the transformation of results is to be made. The default is to perform the transformation around _, the mean of the independent variables. If there are k independent variables, rnatname may be 1 × k or 1 x (k + 1), that is, it may optionally include final element 1 reflecting the constant, at () may be specified when the model is estimated or when results are redisplayed.
classic requests that the mean effects be calculated using the formula f(_b)b_ in all cases. If classic is not specified, f(x-b)bi is used for continuous variables, but the mean effects for dummy variables are calculated as ff(_lb) - _5(2ob). Here 51 = _ but with element i set to 1. 20 - _ but with element i set to 0, and _ is the mean of the independent variables or the vector specified by at(). classic may be specified at estimation time or when the results are redisplayed. Results calculated without classic may be redisplayed with classic and vice versa. probit_options
are any of the options allowed by probit;
see Options for probit, above.
Optionsfor predict p, the default, calculates
the probability
of a positive outcome.
xb calculates the linear prediction. strip calculates the standard error of the linear prediction. rules requests that Stata use any "rules" that were used to identify the model when making prediction. By default, Stata calculates missing for excluded observations. asif requests that Stata ignore the rules and the exclusion criteria and calculate predictions observations possible using the estimated parameter from the model. nooffset
is relevant only if you specified offset
(varv_ame) for probit.
the
for all
It modifies the calculations
made by predict so that they ignore the offset variable: the linear prediction rather than xjb + offsetj.
is treated as xjb
Remarks Remarks are presented under the headings Robust standard errors dprobit Model identification Obtainingpredicwd values Performinghypothesis tests probit performs maximum likelihood estimation of models with dichotomous hand-side) variables coded as 011 (or, more precisely, coded as 0 and not 0).
dependent
(left-
> Example You have data on the make, weight, and mileage rating of 22 foreign and 52 domestic automobiles. You wish to estimate a probit model explaining whether a ca" is foreign based on its weight and mileage. Here is an overview of your data:
r I
_'_
'
!dec I
Contain_
!
size:
_
i
variabl_
i
make mpg
I
data
obs: v_rs:
from
°,3
_uto.dta
7_ 'I 1,9911
name
1978 7 Jul (99,7Z
stora_ type
of
memory
display format
'/,-18s _8.Og
weight
!
int
_8.0gc
foreign
!
byte
_,8.0g
Data
free)
value label
strl int
Aatomobile 2000 13:51
variable
label
Make and Model Mileage (mpg) Weight origin
Car
(Ibs.)
type
S_rted _y: foreign No_e: ! . inspect
dataset
las
changed
since
last
saved
foreign
foreign: Car type
Numberof Observations
i
Total
!
#
*
Negative
# #
'_
#
#
r
# 0 (2 !
|
_ique
NonIntegers
Integers -
Zero Positlve
52 22
52 22
Total
74
74
Missing 1
74
values
f_reign
is
la_eled
and
all
values
ar_
documented
in
the
label.
The variable f_reign take_ on two unique values. 0 and 1. The value 0 denotes a domestic car and t denotes _ foreign car. l
The model _ou wish to e ;timate is Pr(:_oreign - I)= _(_o+ glweight+ g2mpg) where _ is the cumulative n )rmal distribution. i
To estimate his model, y _utype
ItezatioNO:
log likelihood=
Iteration_
log
1 :
lik#lihood
f outputo__i_ed ) Iteration5: Probit
Log
i
iI
i
|
log likJlihood= -26.844189
es _imates
! likellhood
fore!gn
-45.03321 -29.244141
I
=
-26. _4 4189
(
_pg [ -.10_: 503 _clns 8.27 464
Std,
Err,
.0515689 2.554142
z
-2.02', 3.24
Number LR chi2 of (2) obs
= =
Prob > R2 chi2 Pseudo
=
P>,zI
0.044 0.001
[95_
Conf.
-.2050235 3.269438
74 36,38
0.0000 0.4039
Interval]
-.0028772 13.28149
You find that heavier cars are less likely to be foreign and that cars yielding better gas mileage are also less likely to be foreign, at least holding the weight of the car constant. _
-IllaAltliUIIl'llRUil|lIJi_,i See [R]JJIIJIJIt maximize for an explanation _/IgiJl| of the_O|IIIICI|IU|I output.
<1
IIi
[] TechnicalNote Stata interprets a value of 0 as a negative outcome (failure) and treats all other values (except missing) as positive outcomes (successes). Thus, if your dependent variable takes on the values 0 and I, 0 is interpreted as failure and 1 as success. If your dependent variable takes on the values 0, t, and 2, 0 is still interpreted as failure, but both I and 2 are treated as successes. If you prefer a more formal mathematical statement, when you type probit the model Pr(yj
#
y z, Stata estimates
01x ) =
where _I,is the standard cumulative normal.
O
Robuststandard errors If you specify the robustoption, probit reports robust standard errors as described in [u] 23.11 Obtaining robust variance estimates. In the case of the model of foreign on weight and mpg, the robust calculation increases the standard error of the coefficient on mpg by almost 15 percent: probit foreign weight mpg, robust nolog Probit estimates
Number of obs Wald chi2 (2) Prob > chi2
= = =
74 30.26 0.0000
Log likelihood = -26.844189
Pseudo R2
=
0.4039
Robust foreign weight mpg _cons
Coef. -. 0023355 -. 1039503 8. 275464
Std. Err. .0004934 .0593548 2. 539176
z -4.73 -1.75 3.26
P>Iz I O. 000 0.080 O. 001
[95Y,Conf. Interval] -. 0033025 -. 2202836 3. 29877
-. 0013686 .0123829 :13. 25216
the standard error for the coefficient on mpg was reported to be .052 with a resulting confidence interval of [-.21.-.00]. Without
robust,
robust with the cluster () option has the ability to relax the independence assumption required by the probit estimator to being just independence between clusters. To demonstrate this. we will switch to a different dataset. You are studying unionization of women in the United States and have a dataset with 26,200 observations on 4.434 women between t970 and 1988. For our purposes, we will use the variables age (the women were 14-26 in 1968 and your data thus ._panthe age range of 16-46), grade (years of schooling completed, ranging from 0 to 18), not_smsa (28% of the person-time was spent living outside an SMSA standard metropolitan statistical area), south (4I% of the person-time was in the South), and southXt (south interacted with year, treating 1970 as year 0i. You also have variable union. Overall, 22% of the person-time is marked as time under union membership and 44% of these women have belonged to a union.
r
probit--! Maximum_likelihood probltestimation
You estimate the follow ag model ignoring that the women are observed an average of 5.9 times each in these lata: . probit
union
I_eraticn
0:
log li_elihood
age
fade not_smsa =
south
southXt
I_erati_n
i:
log l±_elihood
= -13548•436
Iteration
2:
log liCelihood
= -13547.308
-13864.23
Probit It_rati¢ 3timates _ 3: log li :elihood = -13547.308
i
Number
_ -
Log Ilk
ihood
= -13_47,308
u_ion i
:oef.
iage g_ade not__mse
i
585
s_uth
i
sou_hXt __ons "
i
Std.
Err.
z
=
26200
LR chi2(5) Prob > chi2
= =
633.84 0.0000
Pseudo
=
0.0229
P> Izl
of obs
R2
[95Z
Conf•
Interval]
•0015798 .0036651 .0202523
3•T6 7.20 -6.44
0.000 0.000 0•000
.0028496 .0192066 -.1700848
,0090425 .0335735 -•0906975
7254
.033989
-Ii.85
0.000
-.4693426
-.3361081
.00 3088 -1,1 3091
•0029253 .0657808
1.13 -16.92
0.258 0.000
-.0024247 -i•242019
.0090423 -.9841628
'
.00;9461 2639 -.13 3911
]
-.40
I I I
The reposed standard errors n this model are probably meaningless. Women are observed repeatedly and so the observations are not independent. Looking at the coefficients, you find a large southern effect against u}aionization a_d little time trend. The robust and cluster () options provide a way to estimate thistmodel and o gtain correct standard errors: • probit _nion
i
not_smsa
i:
log
2: 0: 3:
log lik _lihood = -13547.308 log lik log likelihood _lihood = -13547.308 -13864.23 Number of obs Wald chi2(5)
= =
26200 165.75
i
Prob
=
0.0000
= -135 17.308 I.
(standard
> chi2
Pseudo R2 = on idcode) 0.0229 _djusted for clustering
errors
Robust
un_on,
1
cluster(id)
_ i
i
I•
robust
ilk _lihood = -13548.436
estimates
i
i
south/t,
Iteratior Ite_atioI Iteratior
Log likelihood : i
i
south
Iteratior
Probit
i
age grlde
Cdef.
Std.
Err.
z
P>Izl
[957 Conf.
Interval]
.001327 ,0110282 -.209595
.0105651 .04i7518 -.0511873
_ge grade not_s_sa
.005R461 ,02_39 -.130_911
.0023567 ,0078378 .0404109
2.52 3.37 -3.23
0.012 0.001 0.001
so_th
-.4027_254
.0514458
-7.83
0.000
-.5035573
-.3018935
souz_Xt _c_ns
.003_)88 -I.1131)91
.0039793 .I188478
0.83 -9.37
0.406 0.000
-.0044904 -1.346028
.0111081 -.8801534
'
!
l
Thesestandard _ors arerou_hly50% larger thanth0sereported by theinappropriate conventional calculation. By Comparison. mother model we could estimate is an equal-correlation population-
i
a_eraged probit _odet:
i
i
I
Iteration : tolerance = .04796083 . xtprobitiunion age g:'ade no__smsa Iteration : tolerance = .00352657 Iteration tolerance = .00017886 IZer&_ion l_erat_on
_: tolerance _: tolerance
= 4.150e-07 = 8.654e-06
south
southXt,
i(id) pa
586
probit -- Maximum-likelihood probit estimation GEE population-averaged Group variable: Link: Family: Correlation:
model
Scale parameter: .
,
Number of obs Number of groups Obs per group: min avg max Wald chi2(5) Prob > chi2
idcode probit binomial exchangeable 1
union
Coef.
age grade not_smsa south southXt _cons
.0031597 .0329992 -.0721799 -.409029 .0081828 -I.184799
Std. Err, .0014678 .0062334 .0275189 .0372213 .002545 .0890117
z 2.15 5.29 -2.62 -10.99 3.22 -13.31
P>IzI 0.031 0.000 0,009 0.000 0.001 0.000
= = = = = = =
26200 4434 1 5.9 12 241.66 0.0000
[95_ Conf. Interval] .0002829 .020782 -.1261159 -.4819815 .0031946 -1.359259
.0060366 .0452163 -.0182439 -.3360765 .0131709 -1.01034
The coefficient estimates are similar but these standard errors are smaller than those produced by probit, robust cluster(). This is as we would expect. If the equal-correlation assumption is valid, the population-averaged probit estimator above should be more efficient. Is the assumption valid? That is a difficult correspond to an assumption of exchangeable to assume an AR(I) correlation within person that we do not wish to impose any structure.
question to answer. The population-averaged estimates correlation within person. It would not be unreasonable or to assume that the observations are correlated, but See [R] xtgee for full details.
What is important to understand is that probit, robust cluster () is robust to assumptions about within-cluster correlation. That is, it inefficiently sums within cluster for the standard error calculation rather than attempting to exploit what might be assumed about the within-cluster correlation.
dprobit A probit model is defined Pr(yj where 62 is the standard cumulative
# 0[xj)
= 62(xyb)
normal distribution and xjb
is called the probit score or index.
Since xjb has a normal distribution, interpreting probit coefficients requires thinking (normal quantile) metric. For instance, pretend we estimated the probit equation Pr(yj
# 0) = 62(.08233za
- 1.529x2
in the Z
- 3.139)
The interpretation of the xl coefficient is that each one-unit increase in :t,1 leads to increasing the probit index by 08233 standard deviations. Learning to think in the Z metric takes practice and, even if you do, communicating results to others who have not learned to think this way is difficult. A transformation of the results helps some people think about them. The change in the probability somehow feels more natural, but how big that change is depends on where we start. Why not choose as a starting point the mean of the data? If 51 - 21.29 and 52 = .42. then we would report something like .0257. meaning the change in the probability calculated at the mean. We could make the calculation as follows. The mean normal index is .08233 x 21.29 4- 1.529 × .42 -3.139 = -.7440 and the corresponding probability is _( .7440) = .2284. Adding our coefficient of .08233 to the index and recalculating the probability, we obtain 62(-.7440 + .08233) = .2541. Thus, the change in the probability is .2541 .2284 = .0257.
i
r
probit-- Maximum-likelihood probitestimation
587
In prattle. Feople mak _,this calculation somewhat differently and produce a slightly differcnl numb£r.Ratbi- than make the calculation for a one-unit change in z, they calculate tile slope of the probabilir_.'unction. D( ing a little calculus, they derive that the change in the probabiliU for a change_in.r: _,_,9"" (?.Q) is he height of the normal density multiplied by the zl coefficienu that is. 0@ ;
0X 1
Going throug_ The 'differe this ce between 0257 andobtain .0249.0249. is not much. they differ because the .0257 is the e_act calculat on, they answer Ifor a lone-unit incr _ase in ¢_ whereas .0249 is the answer for an infinitesimal chariot.
I0
extrapolated obt Ot.l[.
dpr_bit ,,lm_the ¢las 1ic option transfom_s results as an infinitesimal change exm_pokued Example .L.._ Consider the a_tomobile dat again:
!
. use a_ ;o, clear i
• gen gc (1978 A_ rdplus :omobile= Dat_ repi I>=4 if rep78-=. (5 missii_ values generated)
I i
dprobi foreign mpg goodplus, classic Iteratio: 0: log liielihood = -42.400729 Iteratio: I: log li_elihood = -27.643138
I
It_ratio: It4_ratio:2: 3:
1 i
log li_elihood -26.953126 log li_elihood == -26.942119
Pr@bit Iteratio: e_timates 4: log li :elihood = -26.942114 _
Number of obs = LR chi2(2) =
Log likelihood = -26.!142114
Prob > R2 Pseudo chi2
foreign
dF/dx
mpg goodplus _cons
.0249187 .46276 -.9499603
obs. P
.3043478
pred. P
.2286624
Std. Err. .0110853 .1187437 .2281006
z
P>Izl
2.30 3.81 -3.82
O.022 0.000 0.000
x-bar
[
69 30.92
= 0.3646 0.0000
95Y.C,I.
]
21.2899 .003192 .046646 .42029 .230027 .695493 1 -I,39703 -.502891
(at x-bar)
z and!P>Izl are t_,etest of the underlying coefficient being 0
Afterestimation with dpro )it, the untransformedcoefficientresults can be seen by typing probit • probit i
I
_ithoutProbit options:! estimates Log likelihood ! fore_n
}[ [.
Number LR cbi2 Prob > Pseudo
= -26.912114 t I I
pZg I __CO S good s ]
Co,._f, Std, Err.
.082 33 -3.138 37 1.528 _92
.0358292 .8209689 .4010866
z
2.30 -3.82 3,81
P>Izl
0.022 O.000 0.000
of obs (2) chi2 R2
= = = =
69 30.92 0.0000 0.3646
[95X Conf. Interval]
.0121091 -4..7428771 747807
.152557 -12.315108 . 5_9668
.,..,v
W,,..L,,t
--
r_ux.llU.l-,iKellnOOO
esUmat|on
proDIt
There is one case in which one can argue that the classic, infinitesimal-change based adjustment could be improved on, and that is in the case of a dummy variable. A dummy variable is a variable that takes on the values 0 and 1 only--1 indicates that something is true and 0 that it is not. goodplus is such a variable. It is natural to summarize its effect by asking how much goodplus being true changes the outcome probability over that of goodplus being false. That is, "'at the means", the predicted probability of foreign for a car with goodplus = 0 is q5(.08233_1 - 3.139) = .0829. For the same car with goodplus = 1, the probability is I'(.08233 E_ + 1.529 - 3.139) = .5569. The difference is thus .5569 -.0829 = .4740. When we do not specify the classic option, dprobit makes the calculation for dummy variables in this way. Even though we estimated the model with the classic option, we can redisplay results with it omitted: i f
dprobit Probit estimates
Log
likelihood
= -26.942114
foreign
dF/dx
Std.
Err.
z
21.2899
,0110853
2.30
O. 022
.4740077
.I 114816
3.81
O. 000
obs.
P
.3043478
pred.
P
.2286624
of dummy
variable
P>[zl
> chi2 R2
= 0.3646
[
.42029
69 30.92
957, C.I.
]
.003192
.046646
.255508
.692508
(at x-bar)
discrete are
= 0.0000
x-bar
.0249187
is for
Prob
P>tzJ
mpg
z and
= =
Pseudo
goodplus*
(*) dF/dx
Number of obs LR chi2(2)
the
change test
of the
underlying
from
0 to 1
coefficient
being
0
q
[3 Technical Note at (mamame) allows you to evaluate effects at points other than the means. Let's obtain the effects for the above model at mpg = 20 and goodplus = 1: • matrix .
myx
dprobit,
Probit
Log
= (20,1)
at(myx) estimates
likelihood
Number
= -26.942114
foreign
dF/dx
Std.
Err.
z
of obs
=
69
LR chi2(2) Prob > chi2
= 30.92 = 0.0000
Pseudo
= 0.3646
P>Izl
x
R2
[
95_
C.I.
]
mpg
.0328237
.0144157
2.30
0.022
20
.004569
.061078
goodplus*
.4468843
.1130835
3.81
0,000
1
.225245
.668524
obs,
P
.3043478
pred.
P
.2286624
(at x-bar)
pred.
P
.5147238
(at x)
(*) dF/dx
is for
z and P>Iz}
discrete are
the
change test
of dummy
of the
variable
underlying
from
0 to I
coefficient
being
0
Q
t
i
i
,
prObit-- Maximum-likelihoodprobit estimation
589
Model identification The probi_; command h s one more feature, and it is probably the most useful. It will automatically check the model for identification and, if it is underidentified, drop whatever variables and obser_ ations i
are necessary !or estimatior to proceed. 1
> Example
i
Have you ei'er estimated [a probit model where one or more of your independent variables perfectly ! _
predicted" one br the other _utcome? For instanc{e, th! following i consider " " small amount of data: Outcome ?4 Indepeiadent Variable z
I
0
J
!
o1
o (.)
t I
l.et's imagine _'e wish to pn dict the outcome on the basis of the independent variable. Notice that the
!
outcome, is ah_{ayszero whel,ever the independent variable is one. In our data Prty = 0 ix = 1) - 1, ,',,rcn ]n turn ;means that tire probit coefficient on x must be minus infinity with a corresponding infinite stand_d error. At this point, you may suspect we: have a problem. UnfortunatOly, not all suctt problems are so easily detected, especially if you have a lot of independent variables in yohr model. If ,,ou have ever had such difficulties, then you have experienced one of the
i_ } ! _i ! ! *
more unpleas@ aspects of _amputer optimization. The computer has no idea that it is trying to solve for an infinite i:oefficient as it begins its iterative process All it knows is that. at each step, making the coefficient }alittle bigge_, or a little smaller, works wonders. It continues on its merry way until either (1) the _,hole thing c _mes crashing to the grdund when a numerical overflow error occurs or _2) it reaches s_me predeterr tined cutoff that stops the process. Meanwhile, you have been waiting. In addition, the e_timates that ou finally receive, if an3;. may be nothing more than numerical roundoff. i
State watches for these s,)rts of problems, alerts you. fixes them, and then properly estimates the model. ; 1
|
i
Let's return _toour automobile data. Among the variables we have in the data is one called repair that takes on tDee values. 4_ value of 1 indicates that the car has a poor repair wcord, 2 indicates
!
an avera_ze rec+rd, and 3 indicates a better-than-average record. Here is a tabulation of our data:
{
Car
tyre
2
3
Total
repair
I
Do=
ti {
2r
}
3
9
30
18
Foreign
i
Tot_l '
9 12
,58
1 i i
Notice that all ithe cars with 3oor repair records (repair==l) are domestic. If we were to attempt _o predict foreign on the basis of the repair records, the predicted probability for the repair==l category :would!have to be zero. This in turn means thai the probit coeN cient must be minus infinity, and that Would!set most corr.puter programs buzzing,
t l
Let's try, State on this proglem, First. we make up two new variables, rep_is_l lhat indicate thi repair cat,.'gory.
and
rep_is_2,
590
probit -- Maximum-likelihood probit estimation • generate
rep_is_1
= repair==1
generate
rep_is_2
= repair==2
The statement generate rep_is_l=repair==l creates a new variable, rep_is_l, that takes on the value 1 when repair is 1 and zero otherwise. Similarly, the next generate statement creates rep_is__.2 that takes on the value 1 when repair is 2 and zero otherwise. We are now ready to estimate our model: • probit note:
rep_is_2 failure perfectly 10 obs not used
Iteration
O:
log
likelihood
= -26.992087
1:
log
likelihood
= -22.276479
Iteration
2:
log
likelihood
= -22.229184
Iteration
3:
log
likelihood
= -22.229138
Log
I'
estimates
likelihood
Number
= -22.229138
foreign
Coef.
rep_is_2
- t. 281552
_cons
'
rep_is_l
Iteration
Probit
L
for
rep_is_1~=O predicts rep_is_l dropped and
I,21e-I 6
Err.
.4297324 ,295409
48
Prob
=
0.9020
=
0.1765
z
P>lzl
-2.98
O. 003
O. O0
= =
Pseudo
Std.
of obs
LR chi2(1)
I. 000
> chi2 R2
[95_, Conf. -2,123812 -, 578991
9.53
Interval] -.4392916 .578991
Remember that alI the cars with poor repair records (rep_is_l) are domestic, so the model cannot be estimated, or at least it cannot be estimated if we restrict ourselves to finite coefficients. Stata noted that fact. It said, "Note: rep_is_l-=0 predicts failure perfectly". This is Stata's mathematically precise way of saying what we said in English. When rep_is_l is not equal to 0, the car is domestic. Stata then went on to say, "'rep_is_l dropped and 10 obs not used". This is Stata eliminating the problem. First, the variable rep_is_l had to be removed from the model because it would have an infinite coefficient. Then, the 10 observations that led to the problem had to be eliminated as well so as not to bias the remaining coefficients in the model. The 10 observations that are not used are the 10 domestic cars that have poor repair records. Finally, Stata estimated what was left of the model, which is all that can be estimated. q
Technical Note Stata is pretty smart about catching these problems. variable", as we demonstrated above.
It will catch "one-way causation by a dummy
Stata also watches for "two-way causation"; that is, a variable that perfectly determines the outcome, both successes and failures. In this case Stata says, "so-and-so predicts outcome perfectly" and stops. Statistics dictates that no model can be estimated. Stata also checks your data for collinear variables; it will say "so-and-so dropped due to collineari ty". No observations need to be eliminated in this case, and model estimation will proceed without the offending variable. It will estimating age, and perfectly". model.
also catch a subtle problem that can arise with continuous data. For instance, if we were the chances of surviving the first year after an operation, and if we included in our model if all the persons over 65 died within the year, Stata will say, "'age > 65 predicts failure It will then inform us about the fixup it takes and estimate what can be estimated of our
f
IF
i probit
_j
(_nd logit
note:
.
an_ logistic) 0 successes
4 failures
probit_ , -_-,M.... aximum-likelihood probitestimation
591
will also occasionally display messages such as completely
determined.
The. cause!of this mess; ge and what to do if you get it are described in [R] legit. Q
Obtainingpredictedvlues Once you !have estimat_d a probit model, you can obtain the predicted probabilities using the predictcorr[mand for bolh the estimation sample,and other samples: see [U] 23 Estimation and post-estimati4n command, and [R] predict. Here we will make only a few additional comments.
i ! i
predict
'4ithout argur_rots calculates the predicted probability of a positive outcome. With the
i i_ i
xb option, it ¢_alculatesthe linear combination xjb; where xj are the independent variables in the jth observatio_ and b is th,_ estimated parameter vector. This is known as the index function since the cumulatiw density inde_ed at this value is the probability of a positive outcome.
i.
In both cas ',s, Stata remctubers any "rules" used to identify the model and calculates missing for
i
excluded obse vations unle_ rules or asif is specified. This is covered in the following example. Withithe s ;dp option, _redict calculates the standard error of the prediction, which is not adjusted_forre31icated cova iate patterns in the data.
!'_ i
One can c_ culate the u_adjusted-tbr-repticated-covariate-patternsdiagonal elements of the hat matrix, or leverage, by typir_g } . . predic!
pred
• predic_
sgdp,
genera_e
hat
I stdp! = std_2*pred*(l-pred)
> Example V';
In the pre lqus example, _'e estimated the probit model probit "Ib obtain predicted probabililies,
!
(option • predicti
"
p! assumed; p
(aO:missi_g • smmmari_e
_
f
foreign
rep_is_l
rep_is_2.
Pr foreign))
values foreign generated) Pl
r
.2068966 25
.4086186 1956984
0 .1
1 .5
I
Stata remember8 any "rules" used to identi_' the model and sets predictions to missing for any
i
excluded the'previous example, rep_is_lfrom our model andobserv_ttions.In excluded lO obser_'ations.Thus. whenprohitdropped we typed predictthe p.variable those same 10 obser_ation._ v,ere aa,ain excldded and their predictions se_to missing. predic:t's r41es option rill use the rules in the prediction• During estimation, we were told "'rep_is_l-=0 predicts failure )effectly", so the rule is that when rep_is_lis not zero. one should predict 0 probability of succe_ or a positive outcome: • sulmuariz_ foreign . predict _2,• rules
p +
592
probit -- Maximum-likelihood probit estimation Variable
Obs
Mean
foreign p p2
58 48 58
.2068966 .25 .2068966
Std. Dev. .4086186 .1956984 .2016268
Min
Max
0 .1 0
I .5 .5
predict's asif option will ignore the rules and the exclusion criteria, and calculate for all observations possible using the estimated parameters from the model:
predictions
predict p3, asif • summarize for p p2 p3 Variable
Obs
Mean
foreign p p2 p3
58 48 58 58
.2068966 .25 ,2068966 .2931034
Std. Dev. .4086186 .1956984 .2016268 .2016268
Min
Max
0 .1 0 .1
1 .5 .5 o5
Which is right? By default, predict uses the most conservative approach. If a large number of observations had been excluded due to a simple rule, one could be reasonably certain that the rules prediction is correct. The asif prediction is only correct if the exclusion is a fluke and you would be willing to exclude the variable from the analysis anyway. In that case. however, you should re-estimate the model to include the excluded observations.
Performinghypothesistests After estimation with probit, commands; see [U] 23 Estimation
you can perform hypothesis tests using the test or testnl and post-estimation commands.
Saved Results ,,
probit saves in
e()"
Scalars e(N)
number
e(df__m)
model
of observations
e (r2_p)
pseudo R-squared
e (11)
log likelihood
degrees
of freedom
e(ll_0)
log likelihood,
e(N_clust)
number
e (chi2)
X2
e(clustvar)
name of cluster
e(vcetype)
covariance
constant-only
model
of clusters
l
Macros variable
e(cmd)
probit
e(depvar)
name of dependent
e(wtype)
weight
type
e(chi2type)
Weld or LK; type of model
X_ test
e(wexp)
weight
expression
e(predict)
program
predict
e (g)
variance-covariance estimators
variable
estimation
method
used to implement
Matrices e (b)
coefficient
vector
Functions e(sample)
marks
estimation
sample
matrix
of the
,rr i
probtt -- Maximum-likelihood probit estimation
dprrbit [
593
s_ves in e()"
Scalars e(l_)
number of _bservations
e (lq_clust)
number of clusters
!
e(df_m)
model deg_'es of freedom
e(¢hi2)
X"_
i
e(r2_p) e(lt) e(lt_0)
pseudo R-s. unfed log likeliho_t log likeliho d. constant-only model
e(pbar) e(xbar) e(offbar)
fraction of successes observed in data average probit score average offset
e (emd) e(depva/-) e(wt_e)
dprobit name of de rendent variable weight type
e (_ cetype) e(chi2type) e(predict)
covariance estimation methodx 2 test Watd or LK; type of model program used to implement predict
e(wexp) e (clustvart)
weight expression name of clu _tervariable
e(dummy)
string of blank-separated 0s and Is: 0 means corresponding independent
I
Macros
i
!
variable is not a dummy, I means that it is
e (b) Matrices
coefficient
ctor vt vafiance-co_afiance matrix of
e(_/)
marginal effects
e(se_dfdx)
standard errors of the marginal effects
the estimators !
Functions e(sample)
i
e(dfax)
marks estim4tion sample
MethodsandFormula Probit analysis
originate
in connection
with bioassay,
and the word probit,
a contraction
"probability unit", was suggested by Bliss (1934). FOr an introduction to probit and example, Aldrich and Nelsor_ (t984), Hamilton (1992). Johnston and DiNardo (t997), or Powers and
i ! I
of
logit, see, for Long (1997),
The log-like ihood functio | b for probit is
I
|_
!
lnL+-
i
. 'wjlnff(xjb)
13_s
}
I
where _ is the cumulative in [R] ma_Ximiz¢ If robust [Rf regress --i;+(xjb)/{
sta,dard
!
ff(xjb) t "
nor[hal and wj denotes the optional weights.
errors
.ire requested,
the calculation
described
In L is maximized in Methods
and Formutas
of
formul_
Yurnine to dprobit, whict is implemented as an ado-file, let b and V denote the coefficients and variance matrix Calculated bv _robit.Let b_ refer to the ith element of b. For continuous variables. or for all variables if classi is specified, dprobit reports
*-
= 6(_b)b_
! !.
as described
is _arried forwa d with uj = {(j(xjb_/_(xjb)}xj for the positive outcomes and t -_ _(xjb)}ixj for the negative outcomes, where e is the normal density, qc is given
by its asymptotiC-like ! i !
ln{tj_s
t
'!
+
The correspondifig
variance
rrtatrix is DvD'
I
where D - 6(2b){I
(_b)b_}.
594
probit -- Maximum-likelihood probit estimation
For dummy variables taking on values 0 and 1 when classic is not specified, dprobit makes the discrete calculation associated with the dummy changing from 0 to 1. b_ = _(glb) - ff(Xob) where T0 -- N1 - x except that the ith elements of x0 and xl are set to 0 and 1, respectively. The variance of bi is given by dVd' where d -- _b(_lb)_l - _(xob)No. Note that in all cases, dprobit
reports test statistics zi based on the underlying coefficients
bi.
References Aldrich, J. H. and F. D. Nelson. 1984. Linear Probability,Logit, and P;obit Models. Newbury Park, CA: Sage Publications. Berkson, J. 1944. Applicationof the logistic function to bio-assay,Journat of the America;_Statistical Association 39: 357-365. Bliss. C. I. 1934. The method of probits. Science 79: 38-39, 409-410. Hamilton, L. C. 1992, Regression with Graphics. Pacific Grove, CA: Brooks/Cole Publishing Company. Hilbe, J. 1996. sg54: Extended probit regression. Stata Technical Bulletin 32: 20-21. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 131-132. Johnston. J. and J. DiNardo. 1997. Econometric Methods. 4th ed. New York: McGraw-Hill. Judge, G. G., W. E. Griffiths, R. C. Hill, H. Liitkepohl,and T.-C.Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wiley & Sons. Long, J. S 1997. Regression Models for Categorical and Limited Dependent Variables.Thousand Oaks, CA: Sage Publications. Powers, D. A. and Y. Xie. 2000. Statistical Methods for CategoricalData Analysis. San Diego, CA: Academic Press.
Also See Complementary:
[R] adjust, [R] lincom, [R] linktest, [R] Irtest, JR] mfx, [R] predict, JR] roe, [R] sw, [R] test, [R] testnl, [R] vce, JR] xi
Related:
JR] biprobit, JR] dogit, [R] cusum, [R] gim, [R] glogit, [R] hetprob, [R] logistic, JR] logit, [R] scobit, [R] svy estimators, [R] xtclog, [R] xtgee, JR] xflogit, [R] xtprobit
Background:
[U] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and post-estimation commands. [U] 23.11 Obtaining robust variance estimates, [U] 23.12 Obtaining scores,
i !
[R] maximize
!
l
Title priest--iOneI ! . llll !
,
i
a
tw -sTple
i
testsi uof proportions Ii i lll
I
i
-
"
1
Syntax prtest ..dmame--#, if exp][i_range][.level(#)]
I
prtest vdrnamel-va name, [ifexp][inrange][.level(#)] prtest va_ame [if_ p] [inrange],by(gro_pvar)[.level(#)]
p.t.ti #pl
! ;
prtosti
,level(#) .o=t]
#busl #pl #obs! #p2 [, level(#)
coua% ]
by ... : may bd usedwithprt st (butnot prtestiK see[R] by.
{ [
Description
,
priest per_brms tests oJ the equality of proportions using large-sample statistics. In the first form, prtest tests that varname has a proportion of #p. In the second form, prtest tests that varn_me_ and var_ame2 have the same proportion, In the third form. prtest tests that varnamehas tile same propcrtion within the two groups defined by groupvar.
[ i I
prteSti
is !the immediat
form of prtest;
see [U] 22 Immediate commands.
The bitest!command is _ better version of the first form of prtest in that it gives exact p-values, Researchers are advised to e bitest when possible, especially for small samples; see [R] bitest.
I
Options
1
by (groupvar) s_ecifies a nu eric xaffable that contains the group information for a given observation. This variable!must have o_t3 two values, Do not confuse the by() option with the by... : prefix: both may be rspecified. !
level(#) speci_es the confid _ncelevel, in percent, for confidenco intervals. The default is level(95) or as set by _et level: :e [U] 23.5 Specifying the width of confidence intervals.
! I i !
count specifies!that integer -ounts instead of proportion,_ are being used in the immediate forms of prtest. Ila the first syr tax, prtesti expects #obsl and #pl to be counts. #px _<#obsl- and expects #p2 tb be a propc "tion. In the second syntax, prtesti expects all four numbers to be integer counts_,#obst >_#pl. and #obs2 -> #p2.
} I
i
Remarks The priest qutput followslthe output of ttest in providing a lot of information. Each proportion is presented alon_ with a cont_dence interval. The appropriate one- or two-sample test is performed and the two-sidell and both o_e-sided results are included at the bottom of the output, in lhe case of a two-sampleitest, the cal_ulated difference is also presented with its confidence interxal. This command may be used for bo_h large-sample testing and large-sample interval estimation.
i, i
1 I
595
596
prtest-
One- and two-sampie tests of proportions
D Example
i ,
In the first form, priest tests whether the mean of the sample is equal to a known constant. Assume you have a sample of 74 automobiles. You wish to test whether the proportion of automobiles that are foreign is different from 40 percent. . priest
foreign=.4
One-sample
test
of proportion
Variable
Mean
foreign
.29T2973
Std.
Err.
.0531331
Ho: Ha:
foreign:
z
P>lz[
5. 59533
O. 0000
proportion(foreign)
foreign < .4 z = -1.803
Ha:
P < z = 0.0357
of obs
[95Z
=
Conf.
74
Interval]
.1931583
.4014363
= .4
foreign-= .4 z = -1.803
Ha:
P > Izl = 0.0713
The test indicates that we cannot reject the hypothesis .40 at the 5% significance level.
Number
foreign > ,4 z = -I. 803
P > z = 0.9643
that the proportion
of foreign automobiles
is
<1
Example We have two headache remedies that we give to patients. Each remedy is recorded as 0 for failing to relieve the headache and I for relieving the headache. We wish to test the equality of the proportion of people relieved by the two _eatmen_. curel=cure2
•prtest Two-sample
test
of proportion
Variable
Mean
curel
.52
c11re2 diff
.7118844 -.1918644 under
Ho: Ha:
Ho:
Std.
curel: cure2:
Err.
You find that the proportions 3.9%.
[95_ Conf.
50 59
.0706541
7.3598
0.0000
,3815205
.6584795
.0589618
12.0733
0.0000
.5963013
.8274275
-.372229
-.0114998
-2.0605
0.0394
.0920245 .0931155
0
0,0197
= =
P>Izl
- proportion(cure2) Ha:
z = -2.060 P < z =
of obs of obs
z
proportion(curel)
diff<
Number Number
z P >
Izl
diff
~= 0
= -2.060 =
0.0394
are statistically
=diff
Interval]
= 0
Ha:
diff
> 0
z = -2.060 P > z =
0.9803
different from each other at any level greater than
_j
prtest -- One-;and two-sampletests of proportions
i
597
Immediate for m
i
Example
I ! !_
pr't;esti is like prtes" except that you specify summary statistics rather than variables as arguments. Foz instance, vo_ are reading an article Which reports the proportion of registered voters among 50 randomly, selecte_ eligible voters as .52. You wish to test whether the proportion is .7: prtesti
i
50 .52 .70
One-samp].e
test
of proportion
Variable I
Mean
x ;
,52
t
x: Number of obs =
Std. Err. .0706541
z
P>Izl
7.3598
O.0000
[95%Conf,
50 Interval]
.3815205
.6584795
!
I
, H_: x < .7
Ho: proportion(x) = ,7 Ha: x -= .7
Ha: x > .7
zl = -2.777
z = -2.777
z = -2.777
P <
!
0.0027
P > Iz[ =
0.0055
P > z =
0.9973
Example
i i
In order to jhdge teacher effectiveness, we wish to test whether the same proportion of people from two classds will answe_ an advanced question correctly. In the first classroom of 30 students.
_i
40% answered the question correctly, whereas in the second classroom of 45 students, 67% answered the question cofi'ectly. ! Two-sample test of pr . prtesti!30 .4 45 .6_
Variable
Mean
x y
.4 .67
ortion
x: Number of obs = y: }_umberof obs =
Std. Err.
z
P>[z_
30 45
[957,Conf. Interval]
<
i
diff
-.27
.0894427 .0700952 !
4.47214 9.55843
.1136368
}
z P -2.309 P < z ,_
-2.30885
-.0472759
0.0210
Ho: pro ortion(x) - proportion(y) = diff = 0
Ha: diff < 0
i
.5753045 .807384
,2246955 .532616 - ,4927241
under Ho: I .1169416 i
O.0000 O,0000
0.0105
Ha: diff ~= 0
z = -2.309 P > Izl
=
0.0210
Ha: diff>
0
z = -2.309 P > z =
0.9895
Saved Results •
I
prtest
Scalars saves_in :
r()
r(z)I
z statistic
r(P.__)
proportion
r(N__)
:or variable #
number of obser_'ations
_br variable #
598
prtest -- One- and two-sample tests of proportions
Methods and Formulas prtest and prtesti areimplemented A large-sample
(1 - a)100%
as ado-files.
confidence interval for a proportion p is
pq-"
and a (1 - a)100%
confidence for the difference of two proportions
(P'I
where _"= 1 -_
Zl_a/2v/P-----_
-- P2)
Jr- Zl_ol/2
is given by
/P'tqq + P'2q"2 V nl n2
and z is calculated from the inverse normal distribution.
The one-tailed and two-tailed statistic calculated as
test of a population
proportion
uses a normally distributed
test
_-po
z= ;_/_g_/,_ where P0 is the hypothesized proportion. A test of the difference normally distributed test statistic calculated as
of two proportions
also uses a
Z=
v%%(llnl+11n2) where _p
__. Xl
-}-X2
nl + n2
and xl and x2 are the total number of successes in the two populations.
References Sincich, T, I987. Statistics By Example 3d ed. San Francisco:Dellen Publishing Company.
Also See Related:
[R] bitest, [R] ci, [R] hotel, [R] oneway, [R] sdtest, [R] signrank,
Background:
[U] 22 Immediate
commands
[R] ttest