Stata Reference H-P Release 7 - PDF Free Download

_f'( ./l ii_ StataReferenceManual Relea 7 Volume2 H-P Stata Press CollegeStation,Texas Stata Press, 4905 Lakeway Dri...

Author: Stata Press

27 downloads 1627 Views 56MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

_f'( ./l

ii_

StataReferenceManual Relea 7 Volume2 H-P Stata Press CollegeStation,Texas

Stata Press, 4905 Lakeway Drive, College Station, Texas 77845 Copyright C) 1985-2001 All fights reserved Version 7.0

by Stata Corporation

Typeset in TEX Printed in the United States of America l0 9 8 7 6 ISBN ISBN 1SBN ISBN ISBN

5 4 3 2

1-881228-47-9 1-88t228-48-7 1-881228-49-5 1-881228-50-9 1-881228-51-7

1

(volumes 1-4) (volume I) (volume 2) (volume 3) (volume 4)

This manual is protected by copyright. All rights are reserved. No part of this manual may be reproduced, stored in a retrieval system, or transcribed, in any form or by any means--electronic, mechanical, photocopying, recording, or otherwise--without the prior written permission of Stata Corporation (StataCorp), StataCorp provides this manual "as is" without warranty of any kind, either expressed or implied, including, bul not limited to the implied warranties of merchantability and fitness for a particular purpose. StataCorp may make improvements and/or changes in the product(s) and the programCsl described in this manual at any time and without notice. The software described in this manual is furnished under a license agreement or nondiscIosure agreement. The software may be copied only in accordance with the terms of the agreement. It is against the law m copy the software onto CD. disk, diskette, tape, or any other medium for any purpose other than backup or archival purposes. The automobile dataset appearing on the accompanying media is Copyright (_) 1979, 1993 by Consumers Union of U.S.. Inc.. Yonkers. NY 10703-1057 and is reproduced by permission from CONSUMER REPORTS, April 1979, April 1993. The Stata for Windows installation software was produced using Wise Installation System, which is Copyright @ 1994-2000 Wise Solutions. Inc, Portions of the Macintosh installation software are Copyright (_ 1990-2000 Aladdin Systems, Inc.. and Raymond Lau. Stata is a registered trademark and NetCourse is a trademark of Stata Corporation. Alpha and DEC are trademarks of Compaq Computer Corporation. AT&T is a registered trademark of American Telephone and Telegraph Company. HP-UX and HP LaserJet are registered trademarks of Hewlett-Packard Company. IBM and 0S/2 are registered trademarks and AIX. PowerPC. and RISC System/6000 are trademarks of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. Macintosh is a registered trademark and Power Macintosh is a trademark of Apple Computer. Inc. MS-I)OS. Microsoft, and Windows are registered trademarks of Microsoft Corporation. Pentium is a trademark of Intel Corporation. PostScript and Display PostScript are registered trademarks of Adobe Systems. Inc. SPARC is a registered trademark of SPARC International. Inc Star/Transfer is a trademark of Circle Systems. Sun, SunOS. Sunview, Solaris. and NFS are trademarks or registered trademarks of Sun Microsysrems. Inc. TEN is a trademark of the American Mathematical Society. UNIX and OPEN LOOK are registered trademarks and X Window System is a trademark of The Open Group Limited. WordPerfect is a registered trademark of Corel Corporation.

The suggested citation for this software is StataCorp. 2001_ Stata Statistical Software: Release 7.0. College Station. TX: Stata Corporation.

i I

i

Title hadimvo -- Identify multivariate outliers

Syntax hadiravo varlist [if exp] [in range], g_en_rate(newvarl

[newvar2]) [p(#)]

;Description hadimvoidentifies multiple outtiers in multiv_ate data using the method of Hadi (1992, 1994), creating newvarl equal to 1 if an observation is an "outlier" and 0 otherwise. Optionally, newvar2 can also be created containing the distances from!the basic subset.

Options E

generate (newvarl [newvar2]) is not optional; it identifiesthe new variable(s) to be created. Whether you specify two variables or one, however, is dptional, newvarl--which is required--will create newvarl containing 1 if the observation is an outlier in the Hadi sense and 0 otherwise: Specifying gen (odd) would call this variable odd. newvai'2,if specified, will also create newvar2 containing the distances (not the distances squared) from the basic subset. Specifying gen (odd dist) creates odd and also creates dist containing the Hadi distances. p(#) specifies the sl_,mficance level for outlier cutoff; 0 < # < 1. The default is p(.05). Larger numbers identify a larger proportion of the sample as outliers. If # is specified greater than I. it is interpreted as a percent. Thus, p(5) is the Sameas p(.05).

Remarks Multivariate analysis techniques are commonly used to analyze data from many fields of study. The data often contain outliers. The search for subsets of the data which, if deleted, would change results markedly is known as the search for outli+rs,hadimvo provides one computer-intensive but practical method for identifying such observations. Classical outlier detection methods (e.g., Mahalhnobisdistance and Wilks' test) are powerful when the data contain only one outlier, but the power Of these methods decreases drastically when more than one outlying observation is present. The losisof power is usually due to what are known as masking and swamping problems (false negative and false positive decisions), but in addition, these methods often fail simply because they are affecied by the very observations they are supposed to identih,. Solutions to these problems often involve an unreasonable amount of calculation and therefore computer time. (Solutions involving hundreds of! millions of calculations for samples as small as 30 have been suggested.) The method developed:,{bY Hadi (1992, I994) attempts to surmount these problems and produce an answer, albeit second b{st, in finite time. A basic outline of the procedure is as follows_A measure of distance from an observation to a cluster of points is defined. A base cluster of r pdints is selected and then that cluster is continually redetined by taking the r + 1 points "closest" to the cluster as the new base cluster. This continues until some rule stops the redefinition of the clustei.

d:

.au=mvo

_

=o_nt.y

mUlZlVarlale

OUlllers

Ignoring many of the fine details, given k variables, the initial base cluster is defined as r = k + 1 points. The distance that is minimized in selecting these k + 1 points is a covariance-matrix distance on the variables with their medians removed. (We wilt use the language loosely; if we were being more precise, we would have said the distance is based on a matrix of second moments, but remember, the medians of the variables have been removed. We would also discuss how the k -r- 1 points must be of full column rank and how they would be expanded to include additional points if they are not.) Given the base cluster, a more standard mean-based center of the r-observation cluster is defined and the r + 1 observations closest in the covariance-matrix sense are chosen as a new base cluster. This is then repeated until the base cluster has r = int{(n + k + 1)/2} points. At this point, the method continues in much the same way, except a stopping rule based on the distance of the additional point and the user-specified p() is introduced. Simulation results are presented in Hadi (1994).

Examples hadimvo

price

• list

if odd

• summ

price

drop

gen(odd) /*

weight

if -odd

price

weight,

gen(odd

the

*/

outliers stats

for

clean

data

*/

D)

id=_n

graph

/* make

Did

graph price weight [w=D] graph price weight [w=i/D] summarize D, detail sort D list

list

/* summary

odd

hadimvo gen

weight,

make

hadimvo • regress

price

weight

price weight ... if -odd2

an index

/* index

plot

/* 2-way /* same,

scatter, outliers

variable

,/

of D

,/ outliers small

big

*/ ,/

D odd

mpg,

gen(odd2

Di)

p(.01)

Identifying outliers You have a theory about xz, x2, ..., xk which we will write as F(xl, x2,..., xk). Your theory might be that xl, x2, ..., xk are jointly distributed normally, perhaps with a particular mean and covariance matrix: or your theory might he that

xl =/31 T/32x2 + .-- + _kxk + u where u _ N(0, or2); or your theory might be

xl -/310 +/312x2+ ,,_14x4+ ua x2 -/320 + 1321xl+ _23xz + u2 or your theory might be anything else it does not matter. You have some data on x_, x2, ..., xk, which you will assume are generated by F(.). and from that data you plan to estimate the parameters (if any) of your theory and then test your theory in the sense of how well it explains the observed data.

h_dimvo-- Identifymultivariateoutliers

l

3

What if, however, some of your data are generated not by F(-) but by G(.), a different process? a a wages example, you For have theory on how are_assignedto employees in firm and. for the bulk of employees, that theory is correct. There are, hdwever, six employees at the top of the hierarchy for whom wages are set by a completely different process. Or, you have a theory on how individuals select different health insurance options except that, for a handful of individuals already diagnosed with serious illness, a different process controls ihe selection process. Or, you are testing a drug that reduces trauma after surgery except that, for a few patients with a high level of a particular protein, the drug has no effect. Or, in another drug experiment, some of the historical data are simply misrecorded. The data values generated by G(.) rather than F(.) are called contaminant observations. Of course, the analysis should be based only on the observations generated by F(.), but in practice we do not know which observations those are. In addition, if it happened by chance that some of the observations are within a reasonable distance _om the center of F(.), it becomes impossible to detennine whether they are contaminants. Accordingly.we adopt the following operational definition: Outliers are observations that do not conformto the pattern suggestedbythe majority of the observations in a dataset. Accordingly, observations generated b_ F(.) but located at the tail of F(.) are considered outliers. On the other hand, contaminants that are within a statistically reasonable distance from the center of F(.) are not considered outliers. It is well worth noting that outliership is strongly related to the completeness of the theory--a grand unified theory would have no outliers becafise it would explain all processes (including, one supposes, errors in recording the data). Grand uni_d theories, however, are difficult to come by and are most often developed by synthesizingthe results of many special theories. i / Theoretical work has tended to focus on one !pecial case: data containing only one outlier. As mentioned above, the single-outlier techniques ofteh fail to identify multiple outliers, even if applied recursively. One of the classic examples is the star t cluster data (a.k.a. Hertzsprung-Russell diagram) shown in the figure below (Rousseeuw and Leroy 1987, 27). For 47 stars, the data contain the (log) light intensity and the (log) effective temperature _t the star's surface. (For the sake of illustration, we treat the data here as bivafiate data--not as r_gression data--i.e., the two variables are treated ..... i . similarly with no dlstmcuon between which vanatile is dependent and which is independent.) This graph presents a scatter of the data along with two ellipses expected to contain 95% of the data. The larger ellipse is based on the mean and dovariance matrix of the full data. All 47 stars are inside the larger ellipse, indicating that classical iingle-case analysis fails to identify any outliers, The smaller ellipse is based on the mean and co_,afiancematrix of the data without the five stars identified by had±taro as outliers. These observaiions are located outside the smaller ellipse. The i dramatic effects of the outliers can be seen by comparing the two ellipses. The volume of the larger ellipse is much greater than that of the smaller one and the two ellipses have completely different orientations. In fact, their major axes are nearly orthogonal to each other; the larger ellipse indicates a negative correlation (r = -0.2) whereas the smalle_rellipse indicates a positive correlation (r = 0.7] (Theory would suggest a positive correlation: hot ihings glow.)

:i

(Graph on _ext page)

i

•

_r

Itc_u..vu

"-;"

lUt:lluly

inuiLivuna|e

OUtllers

I

I

8-

/

/ "_ =

/

o o

0 0

_o_ o -4

4"

0

\\

_.

// /

@__2"j

2

/

.//

F_Log

lemperalute

The single-outlier techniques make calculations for each observation under the assumption that it is the only outlier and the remaining n - 1 observations are generated by .F('.) producing a statistic for each of the n observations. Thinking about multiple oufliers is no more difficult. In the case of two outliers, consider all possible pairs of observations (there are n(nI)/2 of them) and, for each pair, make a calculation assuming the remaining n 2 observations are generated by F(-). For the three-outlier case, consider all possible triples of observations (there are rz(_ - 1)(n - 2)/(3 x 2) of them) and, for each triple, make a calculation assuming the remaining rL 3 observations are generated by F(-). Conceptually, this is easy but practically, it is difficult because of the rapidly increasing number of calculations required (there are also theoretical problems in determining how many outliers to test simultaneously). Techniques designed for detecting multiple outliers, therefore, make various simplifying assumptions to reduce the calculation burden and, along the way, lose some of the theoretical foundation. This loss, however, is no reason for ignoring the problem and the (admittedly second best) solutions available today. It is unreasonable to assume that outtiers do not occur in real data. If outliers exist in the data, they can distort parameter estimation, invalidate test statistics, and lead to incorrect statistical inference. The search for outliers is not merely to improve the estimates of the current model but also to provide valuable insight into the shortcomings of the current model. In addition, outliers themselves can sometimes provide valuable clues as to where more effort should be expended. In a drug experiment, for example, the patients excluded as outliers might well be further researched to understand why they do not fit the theory.

Multivariate, multiple outliers hadimvo is an example of a multivariate, multiple outlier technique. The multivariate aspect deserves some attention, In the single-equation regression techniques for identifying outliers, such as residuals and leverage, an important distinction is drawn between the dependent and independent variables--the b' and the x's in y = x/3 + u. The notion that the ff is a linear function of x can be exploited and. moreover, the fact that some point (Yi-xi) is "far" from the bulk of the other points has different meanings if that "farness" is due to ;ti or xi. A point that is far due to xi but, despite that, still close in the _/i given xi metric adds precision to the measurements of the coefficients and may not indicate a problem at all. In fact, if we have the luxury of designing the experiment, which means choosing the values of x a priori, we attempt to maximize the distance between the x's (within

h_dimvo-- Identify multivariateoutliers

5

the bounds of x we believe are covered by our {inear model) to maximize that precision. In that extreme case, the distance of xi carries no information as we set it prior to running the experiment. More recently, Hadi and Simonoff (1993) exploit _he structure of the linear model and suggest two

i

diagnostics). methods for identifying muhiple outliers v,hen the inodel is fitted to the data (also see [R] regression In the multivariate case, we do not know the strhcture of the model, so (y,, x+) is just a point and the y is treated no differently from an5"of the x's_a, fact which we emphasize by+writin_ the point as (xaz,x2i) or simply (Xi). The technique doeg assume, however, that the X's are multivariate normal o1"at least elliptically symmetric. This lead_ to a problem if some of the X's are functionally related to the other X's, such as the inclusion of x _nd x 2, interactions such as .r_x> or even dummy variables for multiple categories (in which one of the dummies being 1 means the other dummies must be 0). There is no good solution to this problrm. One idea, however, is to perform the analysis with and without the functionally related variables _ind to subject all observations identified for further study (see What to do with outliers below).

]

An implication of hadJ.mvo being a multivariaie technique is that it would be inappropriate to apply it to (9,x) when x is the result of experi_ntal design. The technique would know nothing of our design of x and would inappropriately treat "distance" in the x-metric the same as distance in the +j-metric. Even when x is inuhivariate norNal, unless y and x are treated similarly it may still be inappropriate to apply had±taro to (9, x)because of the different roles that _/ and x play in regression. However, one may apply had±taro on x to identify outliers which, in this case. are called leverage points. (We should also mention here that if had:i.mvo is applied to x when it contains + constants or any collinear variables, those variables will be correctly i'_nored, allowine_ the analxs_. ,_'s to continue.) It is also inappropriate to use hadimvo (and other outlier detection techniques) when the sample + size is too small, had:i.mvo uses a small-sample correction factor to adjust the covariance matrix of the "clean" subset. Because the quantity n - (3L:@ I) appears in the denominator of the correction factor, the sample size must be larger than 31,:+ _. Some authors would require the sample size to be at least 5h, i.e., at least five observations per vhriable, With these warnings, it is difficult to misapply {his tool assuming that you do not take the results as more than suggestive, hadimvo has a p () optioh that is a "significance level" for the outliers that i

are chosen. quote the term level b_cause, although has that beeninterpretation expended to really make We a significance level,significance approximations a_e involved and itgreat will effort not have i in all cases. It can be thought of as an index between 0 and 1, with increasing values resulting in the labeling of more obse_,'ations as outliers and _ith the suggestion that you select a number much as you would a significance level--it is roughly the probability of identifying any given point as an outlier if the data truly were multivariate normal. Nevertheless, the terms significance level or critical values should be taken with a grain of shlt. It is suggested that one examine a graphical display (e.g., an index plotl of the distance with berhaps different values of p(). The graphs give more information than a simple yes/no answer. FOr example, the graph may indicate that some of the observations (inliers or outliers) are only mar_nally so,

i }

What to do with outliers After a reading of the literature on outlier ddtection., many people are left with the incorrect impression that once outliers are identified, they, s_ould be deleted from the data and analysis should be continued. Automatic deletion (or even automatic down-weighting) of outliers is not ahvavs correct because outliers are not necessarily bad obser 'atlons. On the contrary,, if they are correct, they' may bc thc most informative points in the data. For _xample, they may indicate that the data do not

!:

wuaunmwvv

v

--

Igl_lltlly

illglIiVarla|e

come from a normally distributed techniques.

OUtllers

population,

as is commonly

assumed

by almost all multivariate

The proper use of this tool is to label the outliers and then subject the outtiers to further study, not simply to discard them and continue the analysis with the rest of the data. After further study, it may indeed turn out to be reasonable to discard the outliers, but some mention of the oufliers must certainly be made in the presentation of the final results. Other corrective actions may include correction of errors in the data, deletion or down-weighting of outliers, redesigning the experiment or sample survey, collecting more data, etc.

Saved Results hadimvo saves in r(): Scalars r(N)

number of outliers remaining

Methodsand Formulas hadimvo

is implemented

as an ado-file. Formulas are given in Hadi (I992,

1994).

Acknowledgment We would like to thank Ali S. Hadi of Comell University for his assistance

in writing hadimvo.

References Gould, W. W. and A. S. Hadi. 1993. smv6: Identifying multivariate outliers. Stata Technical Bulletin 11: 28-32. Reprinted in Stata TechnicalBulletin Reprints, vol. 2. pp. 163-168 Hadi, A. S. 1992. Identifying multiple outliers in multivariatedata. Journal of the Royal Statistical Society, Series B 54: 761-771. 1994. A modification of a method for the detection of outliers in multivariate samples. Journal of the Royal Statistical SocieD',Series B 56: 393-396. Hadi, A. S. and J. S. Simonoff. 1993. Procedures for the identificationof multiple outtiers in linear models. Journal of the American Statistical Association 88:1264 1272. Rousseeuw,P. J. and A. M. Leroy. 1987. Robust Regression and Outlier Detection. New York: John Wiley & Sons.

Also See Related:

JR] mvreg, [R] regression diagnostics,

[R] sureg

Title hausman -- Hausman specification test

Syntax hausman, save

hausman

[, [more

l!ess ] constant a_leqs skipeqs(eqtixt) i

sigmamore

p_rior(string)current (string)equations (matchlist)] hausman, clear where matchlist in equat ions () is

#:#[,#i#[,...]] For instance, equations (:1.: 1), equations(1

:_, 2: 2), or equations (1 :2).

i Description hausmanperforms Hausman's (t978) specificition test. /

Options save requests that Stata save the current estimation results, hausman will later compare these results with the estimation results from another model. A model must be saved in this fashion before a test against other models can be performed. more specifies that the most recently estimated model is the more efficient estimate. This is the default, less specifies that the most recently estimated model is the less efficient estimate. constant specifies that the estimated intercept(s) are to be included in the model comparison: by default, they are excluded. The default behavi_ is appropriate for models where the constant does not have a common interpretation across the two models. i alleqs specifies that all the equations in the mod_l be used to perform the Hausman test: by default. only the first equation is used. skipeqs(eqlist) specifies in eqlist the names of equations to be excluded from the test. Equation numbers are not allowed in this context as it is the equation names, along with the variable names. that are used to identify common coefficients. sigmamore allows you to specify that the two Cbvariancematrices used in the test be based on a common estimate of disturbance variance (cr2i--the variance from the fully efficient estimator. This option provides a proper estimate of the :contrast variance for so-called tests of e×ogeneity and over-identification in instrumental variablegregression: see Baltagi (1998, 29!). Note that this option can only be specified when both estimators save e(sigma) or e(rmse). !

prior (string) and current (string) are formattin_options that allow you to specify alternate wording for the "Prior" and "Current" default labels used to identify the columns of coefficients.

i i J

v

,JlauoIIIglll

equations

--

I ll;Ig;_lll(:ill

(matchlist)

If equations

:tl.l_l.;llli.;dl.l[.}l]

test

specifies• by number, the pairs of equations

that are to be compared.

() is not specified, then equations are matched on equation names.

equations() handles the situation where one does not. For instance, equations(l:2) means equations(i:1, 2:2) means equation 1 is to to be tested against equation 2. If equations() ignored.

estimator ases equation names and the other equation 1 is to be tested against equation 2. be tested against equation 1 and equation 2 is is specified, options alleqs and skipeqs are

clear discards the previously saved estimation resuhs and frees some memory; it is not necessary to specify hausmaxl, clear before specifying hausman, save.

Remarks hausm_n is a general implementation of Hausman's estimator that is known to be consistent with an estimator tested. The null hypothesis is that the efficient estimator the true parameters. If it is• there should be no systematic

(1978) specification test that compares an that is efficient under the assumption being is a consistent and efficient estimator of difference between the coefficients of the

efficient estimator and a comparison estimator that is known to be consistent for the true parameters. If the two models display a systematic difference in the estimated coefficients, then we have reason to doubt the assumptions on which the efficient estimator is based. To use hausman, you (estimate

the less efficient

• hausman,

save

(estimate • hausman

the fully

Alternatively,

efficient

model

)

model)

you can turn this around:

(estimate the fully efficient model) • hausman, • (estimate • hausman,

save the tess efficient

model)

less

> Example We are studying the factors that affect the wages of young women in the United States between 1970 and 1988 and have a panel-data sample of individual women over that time span.

(Continued

on next page)

lausman -- Hausman specificationtest

9

• describe Contains

data

from nlswork.dta

obs:

28,534

National i

Longitudinal

Young Women in 1968

14-26

Survey. years

of age

Z

vars :

6

size:

485,078

I Aug 2000 (88.4_ of memory

storage

09:48

fr_e)

display

value i

type

format

label!

idcode

imt

_8.0g

NLS id

year

byte

_8.0g

interview

age

byte

_8.0g

age in current

year

msp

byte

_8.0g

1 if married,

spouse

ttl_exp

float

_9.0g

total

work

experience

In_wage

float

_9.0g

In(wage/GNP

deflator)

variable

Sorted

name

by:

Note:

idcode

year

dataset

has

changed

since

last i

variable

label

year present

saved

We believe that a random-effects specification is @ppropriate for individual-level effects in our model. We estimate a fixed-effects model that will capture all temporally constant individual-level effects. . xtreg

In_wage

Fixed-effects

age msp ttl_exp,

(within)

fe Number

of obs

=

28494

(i) : idcode

Number

of groups

=

4710

within

= 0.1373

Obs per

between overall

= 0.2571 = 0.1800

Group

variable

_-sq:

corr(u_i,

Xb)

in_wage

regression

= 0.1476

Coef.

Std. Err.

i

t

group:

min =

1

avg = max =

6.0 15

F(3,23781)

=

Prob

=

P>[tl

> F

[95Z Conf.

1262.01 0.0000

Interval]

age

-.005485

.000837

_6.55

0.000

-.0071256

-.0038443

msp

.0033427

.0054868

!0.61

0.542

-.0074118

.0140971

ttl_exp

,0383604

.0012416

_0.90

0.000

.0359268

.0407941

_cons

1.593953

.0177538

_9.78

0.000

1.559154

1.628752

sigma_u

.37674223

sigma_e rho

.29751014 .61591044

F test

that

all u_i=O:

! (fraction

i of!variance ii i

F(4709,23781)ii =

due to u_i)

7.76

Prob

> F = 0.0000

{J

We assume that this model is consistent for the true parameters and save the results by typing . hausman,

save

Now, we estimate a random-effects model _s a fully efficient specification of the individual effects under the assumption that they follow a rahdom-normal distribution. These estimates are then compared to the previously saved results using tile hausman command.

1

•_

Jtuu_w

mvow I --

! lOU_I

1I_1

i _}.ll_'_ll

II,;BllOn

Ie$l

I

• xtreg

in_wage

age

msp

ttl_exp,

re

Random-effects

GLS

regression

Number

of

obs

=

28494

Group

variable

(i)

: idcode

Number

of

groups

=

4710

R-sq:

within

= 0.1373

between overall

= 0.2552 = 0.1Z97

effects

u_i

Random

corr(u_i,

X)

in_wage

Obs per

group:

min

=

1

avg max

= =

6.0 15

~ Gaussian

Wald

chi2(S)

=

5100.33

= 0 (assumed)

Prob

> chi2

=

0.0000

Coef.

Std.

Err.

z

P>Izl

[95Z

Conf.

Interval]

age

-.0069749

.0006882

-10.13

0.000

-.0083238

-.0056259

msp

.0046594

.0051012

0.91

0.361

-.0053387

.0146575

ttlexp _cons

.0429635 1.609916

.0010169 .0159176

42.25 101.14

0.000 0.000

.0409704 1.578718

.0449567 1.641114

sigma_u

.32648519

sigma_e rho

.29751014 .54633481

(fraction

of

variance

to

due

u_i)

• hausman Coefficients (b) Prior

(B) Current

(b-B) Difference

sqrt(diag(V_b-V_B)) S.E.

age

-.005485

-.0069749

msp

.0033427

.0046594

-.0013167

.0020206

.0383604

.0429635

-.0046031

.0007124

tt1_exp

b = less

efficient

B = fully Test:

Ho:

difference chi2(3)

.0014899

estimates

efficient

obtained

estimates

in coefficients

obtained

not

.0004764

previously from

from

xtreg

xtreg

systematic

= (b-B)'[(V_b-V_B)'(-I)](b-B) = 275.44 Prob>chi2

=

0.0000

Using the current specification, our initial hypothesis that the individual-level effects are adequately modeled by a random-effects model is resoundingly rejected. We realize, of course, that this result is based on the rest of our model specification and that it is entirely possible that random effects might be appropriate for some alternate model of wages. hausman is a generic implementation of the Hausman test and assumes that the user knows exactly what they want tested. The test between random- and fixed-effects is so common that Stata provides a special command for use after xtreg. We could have obtained the above test in a slightly different format by typing xthausman Hausman

specification

test -

Coefficients Fixed

in_wage

Random

Effects

Effects

Difference

age

-. 005485

-. 0069749

msp

.0033427

.0046594

-.0013167

tt l_exp

.0383604

.0429635

-.0046031

.0014899

t hausman--

Hausmanspecificationtest

11

i Test:

Ho:

difference

in coefficients

not

chi2(3)

= (b-B)'ES'(-!)]Cb-8),

Prob>chi2

= =

systematic

S = (S_fe - S_re)

275.44 O.0000

q

Example A stringent assumption of multinomial and cbnditional logit models is that outcome categories • i for the model have the property of independence of irrelevant alternatives (IIA). Stated simply, this assumption requires that the inclusion or exclusmn of categories does not affect the relative risks associated with the regressors in the remaining citegories. One classic example of a situation where thi_ assumption would be violated involves choice of transportation mode: see McFadden (1974). For s_mplicity, postulate a transportation model with the four possible outcomes: rides a train to work, t_es a bus to work, drives the Ford to work, and drives the Chevrolet to work• Clearly "drives the _ord" is a closer substitute to' drives the Chevrolet" than it is to "rides a train" (at least for most people). This means that excluding "drives the Ford" from the model could be expected to affect the relative risks of the remaining options and the model would not obey the IIA assumption. i Using the data presented in [R] mlogit, we w_l use a simplified model to test for IIA. Choice of insurance type among indemnity, prepaid, and un_sured is modeled as a function of age and gender. The indemnity category is allowed to be the base _ategory and the model including all three outcomes is estimated. i

• mlogit insure age male Iteration O: Iteration I: Iteration 2:

log likelihood = -555.854_6 log likelihood = -551.329t3 log likelihood = -551.32802

Multinomial regression

Number of obs LR chi2(4) Prob > chi2

= = =

615 9,05 0.0598

Log likelihood = -551.32802

Pseudo R2

=

0.0081

,i

insure

Coef.

Std. Err.

Prepai6 age male

-. 0100251 .5095767

.0060181 ,1977893

_cons

.2633838

.2787574

Uninsure age male _cons

z i i _1.67 2.58

I

O, 94

P>lz]

[95Y,Conf. Interval]

O. 096 O. 010

-. 0218204 .1219148

.0017702 .8972345

O. 345

-. 2829708

.8097383

0.648 O. 189 0.001

-.027501 -.2343477 -2.797504

.017116 i.184057 -.7161824

i

i -.0051925 .4748547 -1.756843

.0113821 .3618446 .5309591

40.46 iI.31 43.31 :i

(Outcome insure==Indem is the comparisonlgrouP)

]

• hausman, save

i i

!

Under the IIA assumption, we would expect no _;ystematic change in the coefficients if we excluded one of the outcomes from the model. (For an ektensive discussion, see Hausman and McFadden. 1984.) We re-estimate the model, excluding the!uninsured outcome, and perform a Hausman test

i i

. mlogit insure age male if insure-=

against the fully efficient full model.

"U_insure":insure

I

_:r,,

..

..u_.,..--n_usman

specmcaIlontest

Iteration

O:

log

likelihood

=

-394.8693

Iteration

I:

log likelihood

=

-390.4871

I_eration

2:

log likelihood

= -390.48643

Multinomial

Log

regression

likelihood

= -390.48643

Number of obs LR chi2(2)

= =

Prob

=

0.0125

=

0.0111

> chi2

Pseudo

Std.

Err.

z

R2

P>Iz}

[95Z

Conf.

570 8.77

insure

Coef.

Interval]

age male

-.0101521 .5144003

.0060049 .1981735

-1.69 2.60

0.091 0.009

-.0219214 .1259875

.0016173 .9028132

_cons

.2678043

.2775562

0.96

0.335

-.2761959

.8118046

Prepaid

(Outcome

insure==Indem

hausman,

alleqs

is the

less

comparison

group)

constant

Coefficients (b) Current

(B) Prior

(b-B) Difference

sqrt (diag (V_b-V_B)) S.E.

age male

-.0101521 .5144003

-.0100251 .509574Z

-.0001269 .0048256

_cons

.2678043

.2633838

.0044205

b = less

efficient

B = fully Test:

Ho:

efficient

difference

estimates

in coefficients

chi2(3)

obtained

estimates

from

obtained

not

.012334

mlogit

previously

from

mlogit

systematic

= (b-B)'[(V_b-g_B)_(-l)](b-B) = 0.08 Prob>chi2

=

0.9944

First, note that the somewhat used to identify the "Uninsured"

subtle syntax of the if condition on the mlogit command was simply category using the insure value label; see [U] 15.6.3 Value label.

On examining the output has been violated.

hausman,

Second, mlogit

from

since the Hausman

requires

test is a standardized

that the base category

most frequem category use the basecategory()

we see that there

be the same

is no evidence

comparison

of model

in both competing

that

the IIA assumption

coefficients,

models.

using

it with

In particular,

if the

(the default base category) is being removed to test for IIA, then you must option in mlogit to manually set the base category to something else.

The missing values for the square root of the diagonal of the covariance matrix of the differences is not comforting but is also not surprising. This covariance matrix is guaranteed to be positive definite only asymptotically and assurances are not made about the diagonal elements, Negative values _ong the diagonal are possible, and the fourth column of the table is provided maitdy for descriptive use. We can also perform the Hausman • mlogit

insure

age

male

IIA

if insure

test against the remaining alternative in the model. ~=

"Prepaid":insure

Iteration

O:

log

likelihood

= -132.59915

Iteration

i:

log

likelihood

= -131.78009

Iteration

2:

log

likelihood

= -131.76808

Iteration

3:

log

likelihood

= -131.76807

Multinomial

Log

regression

likelihood

= -131.76807

Number of obs LR chi2(2)

= =

Prob

> chi2

=

0.4356

R2

=

0.0063

Pseudo

338 i.66

lausman -- Hausmanspecificationtest

it

13

i

insure

Coef.

Std. Err.

z

P>Jzl

[95Z Conf. Interval]

Uninsure age male _cons

-.0041055 .4591072 -1. 801774

.0115807 .3595663 .5474476

_0.35 I.28 43.29

O.723 O. 202 O. 001

-.0268033 -.2456298 -2. 874752

.0185923 I.163844 -, 7287968

(Outcome insure==Indem is the comparison group)

• hausman, alleqs less constant -Coefficients -(b) Current age male _cons

-.0041055 .4591072 -1.801774

i

(B) Prior

(b-B) sqrt(diag(V b-V_B) ) Difference S.E.

-.0051925 .4748547 -1.756843

.001087 -.0157475 -.0449311

i

.0021357 .1333464

I

b = less efficient estim_ttesobtained from mlogit B = fully efficient estiz_atesobtained previously from mlogit Test:

Ho:

difference in coefficients not systematic chi2(3)

= (b-B)'[(V_b-__B)'(-I)](b-B) = -0.18 dhi2 model estimated on these } _ata fails to meet the asymptotic $ssumptions of the Hausman test i

/

In this case, the X2 statistic is actually negati_'e. We might interpret this as strong evidence that we cannot reject the null hypothesis. Such a result is not an unusual outcome for the Hausman test, particularly when the sample is relatively small'there are only 45 uninsured individuals in this dataset. Z

Are we surprised by the results of the Hausn_an test in this example? Not really. Judging from the z-statistics on the original multinomiat logit model, we were struggling to identify any structure in the data with the current specification. Even when we were willing to make the assumption of IIA I and estimate the most efficient model under this assumption, few of the effects could be identified as statistically different from those on the base category. Trying to base a Hausman test on a contrast (difference) between two poor estimates is just a_king too much of the existing data. <1 For an example applying the Hausman test to the endogeneity of variables in a simultaneous system, see [R] irreg.

]

,,

Saved Results hausman

saves

in r ()"

i ]

Scalars

i

r (chi2)

_2

r(df)

degrees of freOdom for the statistic

r(p)

p-value for the x -_

! ]

1:,,

!_

,,_

nausman -- Nausman specification test

Acknowledgment Portions of hausman Netherlands.

'

are based on an early implementation

by Jeroen Weesie, Utrecht University,

Methods and Formulas hausman

is implemented

as an ado-file.

The Hausman statistic is distributed

as X 2 and is computed

as

//- (/3c-/_e)'(_ - _)-1(/_c 9e) where tic 3e V_ V_

is is is is

the the the the

coefficient vector from the consistent estimator coefficient vector from the efficient estimator covariance matrix of the coefficients from the consistent estimator covariance matrix of the coefficients from the efficient estimator

In cases where the difference in the variance matrices is not positive definite, a Moore-Penrose generalized inverse is used. As noted in Gourieroux and Monfort (1989, 125-128), choice of generalized inverse is not important asymptotically. The degrees of freedom for the statistic are the rank of the difference in the variance matrices. When the difference is positive definite, this is the number of common coefficients in the models being compared.

References Baltagi, B. H. 1998. Econometrics. New York:Springer-Verlag. Gourieroux, H. and A. Monfort. 1989. Statistics and Econometric Modets, Vol. 2. New York: Springer-Verlag. Hausman, J. 1978. Specificationtests in econometrics. Econometrica46: 1251-1271. Hausman, J. and D. McFadden. 1984. Specificationtests in eco_ometncs. Econometrica 52. 12t9-1240. McFadden, D. 1974. Measurement of urban travel demand. Journal of Public Economics 3: 303-328.

Also See Related:

i:

[R| lrtest, [R] test, [R] xtreg, [R] xtregar

Title

[

i

hq_-kman -- Heckman selection model II1

[I

l

J

l

i

_

/I

I

I

'

11

,Syntax

i

Basic syntax heckman depvar

[vartist],

se!oct(varlis4)

[varlist],

setect(depvars!=

[ twostep

]

or

hectman

depvar

varlists)

[ twostep

]

Full syntax for maximum likelihood estimates only heckraan depvar

[varlist] [weight] [if

exp]] [in range],

select(

[ depvars = ] varlists [, off_et(varname)

[ robust

cluster

noconstant

] )

('varname) score (nev_'varlist) n_sshazard (newvarname)

L--

!

mills(newvarname)

offset(varname)

_oconstant constraints(numlist)

first

noskip level(#) iterate(O)nolog ldaximize_options ] /

Full syntax for Heckman's two-step consistent es_mates only i heckman depvar select(

[varlist] [ifexp] [in range],

[depvars

= ]varlists

,[ nshazard(newvalwame) L --

[ rhosigma by ...

I rhotrunc

: may be used with heckman;

[,

twostep

noccalstant

mills(newvamame) --

] rholimited

]

)

noconstant

rhoforce

first

level(#)

--

] ]

see [R] by.

pweights, aweights, fweights, and i_eights are allowed with maximum likelihood estimation: see [U] 14.1.6 weight. No weights are allowed if twostep is specified. heckman

shares the features of all estimation commands; _ee [U] 23 Estimation

and post-estimation

command_.

i ] i

!Syntaxfor predict predict [type] newvarname where statistic is

[il

exp] [in range]

[, statistic nooffset ]

-Ii _''

1_

neci(man -- Heckman selection model

xb

xjb.

ycond

E (yj IYj observed)

__ yexpected

E

nshazard psel xbsel stdpsel

or mills

* (yj),

fitted values (the default)

yj taken to be 0 where unobserved

nonselection hazard (also called inverse of Mills" ratio) P(yj observed) linear prediction for selection equation standard error of the linear prediction for selection equation

pr(a,b)

Pr(yj

[ a < yj < b)

e(a,b)

E(yj

ystar(a,b) __ strip stdf

E (yj), * yj* = max{a, min(yj,b)} standard error of the prediction standard error of the forecast

t a < yj < b)

where a and b may be numbers or variables; a equal to ' " means -oc: These statistics are available the estimation sample.

both in and om of sample;

wpe

predict

...

if

b equal to ".' means +oo.

e(sa_'nple)

. ..

if wanted only for

Description hecl_nan estimates regression models with selection using either Heckman's estimator or full maximum-likelihood.

two-step consistent

Options select (...) specifies the variables and options for the selection equation. specifying a Heckman model and is not optional. twostep specifies that Heckman's (1979) two-step errors, and covariance matrix are to be produced.

efficient estimates

It is an integral part of

of the parameters,

standard

robust specifies that the Huber/White/sandwich estlmator of the variance is to be used in place of the conventional MLE variance estimator, robust combined with cluster() further allows observations clusters).

which are not independent

If you specify pweights,

robust

within cluster (although they must be independent

is implied; see IU] 23,11 Obtaining

robust

variance

between estimates.

cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily independent within groups, varname specities to which group each observation belongs. cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients, cluster() estimates for unstratified cluster-sampled data. cluster() cluster()

implies robust; by itself.

that is. specifying

can be used with pweights robust

cluster()

is equivalent

to produce to typing

score (newvarlist) creates a new variable, or a set of new variables, containing the contributions the scores for each equation and ancillary parameter in the model.

to

_heckman- Heckman selection model

17

....

The first new variable specified will contain uij = OlnLj/O(xj_) for each observation j in the sample, where lnLj is the jth observation's contribution to the log likelihood. • i The second new variable: u2j = OlnLj/O(zj_) The third: u3j = OlnLj /O(atanh p) The fourth: u4j = OInL.j/O(ln a)

i If only one variable is specified, only the first score is computed; if two variables are specified, only the first two scores are computed: and soon. i The jth observation's contribution to the score! vector is i

{01,,L/o 01nLj/0(-y) 01nL/0(atanh p) 01nLj/O(ln}= ( ljxj , jzj ',3:j ) !i

The score vector can be obtained by summing over j; see [U] 23.12 Obtaining

scores.

nshazard(newvarname) and mills (newvarnamk) are synonyms; either will create a new variable containing the nonselection hazard--what Hecl_an i (1979) referred to as the inverse of the Mills' ratio--from the selection equation. The nonselection hazard is computed from the estimated parameters of the selection equation. offset (varname) is a rarely used option that spdcifies a variable to be added directly to Xb. This option may be specified on the regession equOtion, the selection equation, or both. noconstant omits the constant term from the equations. This option may be specified on the regression equation, the selection equation, or both• i constraints(numtist) specifies by .number • the • linear constraints to be applied during estimation, i The default is to perform unconstrained est_matmn. Constraints are specified using the constraint command; see [R] constraint. See [R] reg3 for ihe use of constraints in multiple-equation contexts. constraints (numtist) may not be specified With twostep. I

first specifies that the first-step probit estimatds of the selection equation be displayed prior to estimation, i noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be estimated. This model is not di@ayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asympto!ically equivalent Watd test of all the parameters in the regression equation being zero (exceptthe constant). For many models, this option can substantially increase estimation time.

_ ]

level(#)specifies the confidence level, in percent, for confidence intervals. The default is level(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. iterate(0) produces Heckman's (1979) two-step parameter estimates with standard errors computed from the inverse Hessian of the full information! matrix at the two-step solution for the parameters. As an alternative, the twostep option comput¢s Heckman's two-step consistent estimates of the standard errors, iterate(#) can also be used tb restrict the maximum number of iterations during optimization; see [R] rholimited, maximize. rhosigma, rhotrunc, and rhofdrce are rarely used options to specify how the two-step estimator, option twostep, handles _nusual cases where the two-step estimate of p is outside the admissible range for a correlation, [_ 1, 1]. When abs(p) > 1, the two-step estimate of the coefficient variance-covariance matrix may'not be positive definite, and thus may be unusable for testing. The default is rhosigma. :i rhotrunc specifies that p be truncated to lie in the range [- 1, 1]. If the two-step estimate is less than -1, p is set to -I: if the two-step estimate is greater than 1, p is set to 1. This truncated value of p is used in all computations to estimate the two-step covariance matrix. i

]

i

i

1 i

_:l:

,..

,,,_..R,,,,:,. -- nuuKman SeleC'[ion moclel rhosigmaspecifies that p be truncated, as with option rhotrunc,and that the estimate of o"be made consistent with _', the truncated estimate of p. So, _ =/3,._; see Methods and Formulas for the definition of/3 m. Both the truncated p and the new estimate of _ are used in all computations to estimate the two-step covariance matrix.

rholimited specifies that p be truncated only in the computation of the diagonal matrix D as it enters Vtwostep and Q; see Methods and Formulas. In all other computations, the untruncated estimate of p is used. rhoforee specifies that the two-step estimate of p be retained even if it is outside the admissible range for a correlation. This may, in rare cases, lead to a nonpositive-definite covariance matrix. These options have no effect when estimation is by maximum likelihood, the default. They also have no effect when the two-step estimate of p is in the range [ - 1,1 ], nolog suppresses the iteration log. maximize._options control the maximization process; see [R] maximize. You will likely never need to specify any of these options except for iterate(0) and possibly difficult.If the iteration log shows many "not concave" messages and it is taking many iterations to converge, you may want to try using the difficult option and see if that helps it converge in fewer steps.

Options for predict xb, the default, calculates the linear prediction xjb. ycond calculates the expected value of the dependent variable conditional on the dependent variable being observed, i.e., selected; E(_Ij [ yj observed). yexpected calculates the expected value of the dependent variable (y_) where that value is taken to be 0 when it is expected to he unobserved: y; = Pr(yj observed)E(yj _ yj observed). The assumption of 0 is valid for many cases where nonselection implies nonparticipation (e.g., unobserved wage levels, insurance claims from those who are uninsured, etc.) but may be inappropriate for some problems (e.g., unobserved disease incidence). nshazardand millsare synonyms; either calculates the nonselection hazard what Heckman (1979) referred to as the inverse of the Mills' ratio--from the selection equation. psel calculates the probability of selection (or being observed): P(yj observed) = Pr(zj7 + u2j > 0). xbsel calculates the linear prediction for the selection equation. stdpsel

calculates the standard error of the linear prediction for the selection equation.

pr(a,b) calculates Pr(a < xjb q- Ul < b), the probability that yjlxj interval (a, b).

would be observed in the

a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xjb + Ul < 30); pr(lb, ub) calculates Pr(lb < x_b--r ul < ub); and pr(20,ub) calculates Pr(20 < xjb + zh < ub). a =. means -oc; pr(. ,30) calculates Pr(xjb + "u-1< 30); pr(lb,30) calculates Pr(xjb + ul < 30) in observations for which lb = . (and calculates Pr(/b < xjb + ul < 30) elsewhere).

•

iii

b =. means +ec; pr(20,. ) calculates Pr(xjb + ul > 20): pr(20,ub) calculates Pr(xjb + zh > 20) in observations for which ub -- . (and calculates Pr(20 < xjb + ul < ub) elsewhere).

"

_eckman -- Heckman selection model e(a,b) calculates E(xjb-1-ulla < xjb + ul _ b), the expected value of yjlxj yj Ixj being in the interval (a, b), which is to shy, yj [xj is censored. a and b are specified as ffley are for prO.

19

conditional on

ystar(a,b) calculates E(y_) where yj = a ifixjb + uj _ a, yj. = b if xjb + uj > b, and y_ = xjb + uj otherwise, which is to say, y_ i s truncated, a and b are specified as they are for prO. stdp calculates the standard error of the prediction. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast. !This is the standard error of the point prediction for a single observation. It is commonly referrefl to as the standard error of the future or forecast value. By construction, the standard errors pr_uced by stdf are always larger than those by stdp; see [R] regress Methods and Formulas. noofgset is relevant when you specify offset (vilrname) for heckman. It modifies the calculations made by predict so that they ignore the offsdt variable; the linear prediction is treated as xjb rather than xjb + offsetj.

Remarks The Heckman selection model (Gronau 1974, Lewis 1974, Heckman 1976) assumes that there exists an underlying regression relationship Yj = xj_ + ulj

regression

equation

The dependent variable, however, is not always observed. Rather, the dependent variable for observation j is observed if zj'y + u2j > 0

selection

equation

where

corr( ,

=p

When p ¢i O, standard regression techniques applied to the first equation yield biased results. hectman provides consistenL asymptotically efficient estimates for all the parameters in such models.

t i I

In one classic example, the first equation describes the wages of women. Women choose whether to work and thus, from our point of view as researdhers, whether we observe their wages in our data. tf women made this decision randomly, we could ignore the fact that not all wages are observed and use ordinary regression to estimate a wage model. Such a random-participation-in-the-labor-force assumption, however, is unlikely to be true; wome_ who would have tow wages may be unlikely to choose to work and thus the sample of observed wages is biased upward. In the jargon of economics, wage greater women choose Thus, not toit work their that personal than have the wage by employers. is alsowhen possible wome_reservation who choose not isto work could even offered higher reservation wages. One could tell a story that competency is related to wages, but competency is rewarded more at home than in the labor force. offer wages than those who do work--they

may _ave high offer wages, but they have even higher

.,

r.u

.¢-,,;_lu|all

--

nls;Klllafl

lseteCl[ion moael

In any case, in this problem--which is the paradigm for most such problems--a solution can be found if there are some variables that strongly affect the chances for observation (the reservation wage) but not the outcome under study (the offer wage). Such a variable might be the number of children in the home. (Theoretically, one does not need such identifying variables, but without them, one is depending on functional form to identify the model. It would be difficult for anyone to take such results seriously since the functional-form assumptions have no firm basis in theory.)

_>Example In the syntax for heckman, depvar and varlist are the dependent variable and regressors for the underlying regression model to be estimated (y = X/_) and varlis% are the variables (Z) thought to determine whether depvar is observed or unobserved (selected or not selected). In our female wage example, number of children at home would be included in the second list. By default, heckman will assume that missing values (see [U]15.2.1 Missing values) of depvar imply that the dependent variable is unobserved (not selected). With some datasets, it is more convenient to specify a binary variable (depvars) that identifies the observations for which the dependent is observed/selected (depvars¢ O) or not observed (depvars= 0); beckman will accommodate either type of data. We have a (fictional) dataset on 2,000 women, of whom 1.343 work: • summarize age educ married children wage Variable

Obs

Mean

age education married children wage

2000 2000 2000 2000 1343

36.208 13.084 .6705 1.6445 23.69217

Std. Dev. 8. 28656 3. 045912 .4701492 1. 398963 6.305374

Min

Max

20 10 0 0 5,88497

59 20 1 5 45.80979

We will assume that the hourly wage is a function of education and age whereas the likelihood of working (the likelihood of the wage being observed) is a function of marital status, number of children at home, and (implicitly) the wage (via the inclusion of age and education which we think determine the wage):

(Continued on next page)

hbckman-Heckman selecUonmodel i! • heckman

wage educ

Iteration Iteration Iteration

O: t: 2:

log log log

Heckman selection

age, select(married likelihood likelihood likelihood

ch_dren

21

educ age)

= -5178.7009 = -5178.3049 = -5178.3045

model

(regression model wi_h sample selection)

Log likelihood = -5178.304 Coef.

Std, Err.

wage education

.9899537

.0532565

age _cons

.2131294 ,4857752

.0206031 I.077037

z

Number of obs

=

2000

Censored obs Uncensored obs

= =

657 1343

Weld chi2(2)

=

508.44

Prob > chi2

=

0.0000

P>Izl

[95_ Conf. Interval]

18.59

O,000

.8855729

I.094334

I0.B4 0.45

O,000 O,652

.1727481 -I.625179

.2535108 2.59673

.5772647 .4931601

'I

select married

.44_51721

.0673954

6 .B1

O.000

.3130794

children education age _cons

,43,87068 .05.57318 .0365098 -2.491015

.0277828 .0107349 .0041533 .1893402

15.79 5.19 8.79 -13.116

O.000 0.000 O.000 O.000

.3842534 .0346917 ,0283694 -2.862115

.0767718 .0446502 -2.119915

i i

/athrho

.8742086

.1014225

/Insigma

I.792559

,027598

rho

.7035061

sigma lambda

6.004797 4.224412

........

.6754241

I.072993

I.738468 /

I,84665

.0512264

.5885365

.7905862

.1657202 .3992265

5.68862 3.441942

6.338548 5.006881

LR test of indep, eqns. (rho = 0):

8 .!62 O.000 64.95

O.000

chi2(i) =

61.20

Prob > chi2 = 0.0000

heckman assumes thatwage is the dependent vada_]e and thatthe firstvariablelist(educ and age) are the determinants of wage. The variablesspecifiedin the select() option (married, children.

educ,andage)areassumedtodetermine whetherihedependent variable isobserved(theselection equation). Thus, we estimated the model wage = fi0+ flledUc + fl2age+ ul andwe assumedthatwage isobservedif 70 + 71married

+ 72children

_ 73educ + 74age + u2 > 0

where ut and u2 have correlation p.

l t

The reported results for the wage equation are interpreted exactly as though we observed wage data for all women in the sample; the coefficients on age and education level represent the estimated marginal effects of the regressors in the underlying regression equation. The results lbr the two ancillary parameters require some explanation, hec_anan does not directly estimate p: to constrain ,o within its valid limits, and for numerical stabiliiv during optimization, it estimates the inverse hyperbolic tangent of p:

atanh p = _

1-_/

I,

22

heckman -- Heckman selection model

This estimate is reported as /athrho. In the bottom panel of the output, heckman undoes this transformation for you: the estimated value of p is .7035061. The standard error for p is computed using the delta method and its confidence intervals are the transformed intervals of/aghrho. Similarly, or, the standard error of the residual in the wage equation, numerical stability, heckman instead estimates In or. The untransformed of the output: 6.004797.

is not directly estimated; for sigma is reported at the end

Finally, some researchers--especially economists are used to the selectivity effect summarized not by p but by A = per. heckman reports this, too, along with an estimate of the standard error and confidence interval.

q

[] Technical Note If each of the equations in the model had contained many regressors, the heckman command could become quite long. An alternate way of specifying our wage model would make use of Stata's global macros. The following lines are an equivalent way of estimating our model. global

wageeq

global

seleq

. heckman

"wage "married

$wageeq,

educ

age"

children

edue

age"

select($seleq)

[]

o Technical Note The reported model X _ test is a Wald test of all coefficients in the regression model (except the constant) being 0. heckman is an estimation command, so you can use test, testnl, or lrtest to perform tests against whatever nested alternate model you choose; see [R] test, [R] testnl, and [R] lrtest. The estimation of P and cr in the form atanh p and In cr extends the range of these parameters to infinity in both directions, thus avoiding boundary problems during the maximization. Tests of p must be made in the transformed units. However, since atanh(0) - 0, the reported test for atanh p = 0 is equivalent to the test for ,o = O. The likelihood-ratio test reported at the bottom of the output is an equivalent test for p = 0 and is computationally the comparison of the joint likelihood of an independent probit model for the selection equation and a regression model on the observed wage data against the heckman model likelihood. The z -- 8.619 and X _ of 61.20, both significantly different from zero. clearly justify the Heckman selection equation with this data. []

Example heckman supports the HuberAVhite/sandwich estimator of variance under the robust and cluster() options, or when the pweights are used for population weighted data: see IU] 23.11 Obtaining robust variance estimates. We can obtain robust standard errors for our wage model by specifying clustering on county of residence (the county variable).

h$ckman-- Heckmanselection model

23

i

• heckman

wage

educ

Iteration

O:

log likelihood

= -5178.7009

Iteration

I:

log likelihood

= -5178.3049

Iteration

2:

log likelihood

= -5178,3045

Heckman

selection

(regression

select(married

children

model

model

Log likelihood

age,

with

educ age)

Number sample

selection)

= -5178.304

Coef.

cluster(county)

of obs

=

2000

Censored obs Uncensored obs

= =

657 1343

Wald

=

272.17

=

0.0000

chi2(1)

Prob > chi2 (standard

errorsiladjusted :!

Robust

i!

Std,

Err.

for clustering

P>lzl

[957 Conf.

on county)

Interval]

! wage education age _cons

.9899537

.0600061

16.$0

0.000

.8723438

1.107564

.2131294 .4857752

.020995 1.302103

I0,i5 0._7

0.000 0.709

.17198 -2.066299

.2542789 3.03785

.4451721 .4387068

.0731472 .0312386

6.09 14.04

0.000 0.000

.3018062 .3774802

.5885379 .4999333

5._6

0.000

.0341645

.0772991

0.000 0.000

.0285954 -2,717059

.0444242 -2.264972

0.000

.5991596

1.149258

0.000

1.741902

1.843216

select married children education age _cons

.0110039 •004038 .1153305

9.$4 -21._0

.8742086

.1403337

6._3

1.792559

.0258458

69._6

rho

•7035061

,0708796

.5364513

.817508

sigma lambda

6.004797 4.224412

.155199 .5186709

5.708189 3.207835

6.316818 5.240988

Prob

= 0.0000

/athrho /Insigma

Wald

.0557318 •0365098 -2.491015

test

of indep,

eqns.

(rho = 0): chi2(l!

=

38.81

> chi2

The robust standard errors tend to be a bit larger, bit we do not notice any systematic differences. This is not surprising since the data were not constructed to have any county-specific correlations or other characteristics that would deviate from the assumptions of the Heckman model. q

Example The default statistic produced by predict after tieckman is the expected value of the dependenl variable from the underlying distribution of the regression model. In our wage model, this is the expected wage rate among all women, regardless of whether they were observed to participate in the labor force. • predict heckwage (option xb assumed;

fitted

values) }

It is instructive to compare these predicted wage v_lues from the Heckman model with an ordinary regression model--a model without the selection adiustment:

;_r;

24

heckman-Heckmanselectionmodel wage educ age

. regress

]

Source

13524.0337

wage

age _cons education

(option

MS

2

39830.8609

Total

• predict

df

i

Model Residual

)

SS

1340

53354,8948

1342

I

Coef.

Std.

I

.1465739 6.084875 .8965829

Number of obs F( 2, 1340)

= =

1343 227.49

6762.01687

Prob

=

0.0000

29.7245231

R-squared

=

0.2535

39.7577456

Adj R-squared Root MSE

= =

0.2524 5.452

Err.

t

.0187135 .8896182 .0498061

7.83 6.84 18.00

> F

P> It J

[95Y, Conf.

0.000 O. 000 0.000

.109863 4.339679 .7988765

Interval]

.1832848 7. 830071 .9942893

re.age xb assumed;

. summarize

fitted

heckwage

values)

re.age

'

Variable

0bs

Mean

Std.

Dev.

Min

Max

(

heckwage

2000

21. 15532

3. 83965

14.6479

32. 85949

regwage

2000

23. 12291

3. 241911

17. 98218

32. 66439

Since this dataset was concocted, we know the true coefficients of the wage regression equation to be 1, 0.2, and 1, respectively. We can compute the true mean wage for our sample. • gen

truewage

• sum

truewage

= i +

,2*age

+ l*educ

Variable

I

0bs

Mean

truewage

I

2000

21. 3256

Std.

Dev.

3.797904

Min

Max

15

32.8

Whereas the mean of the predictions from heckmanis within 18 cents of the true mean wage, ordinary regression yields predictions that are on average about $1.80 per hour too high due to the selection effect. The regression predictions also show somewhat less variation than the true wages. The coefficients from heckman are so close to the true values that they are not worth testing. Conversely, the regression equation is significantly off, but seems to give the right sense. Would we be led far astray if we relied on the OLS coefficients? The effect of age is off by over 5 cents per year of age and the coefficient on education level is off by about 10%. We can test the OLS coefficient on education level against the true value using test. • test (I)

educ

= 1

education F(

= 1.0

1, 1340) = Prob > F =

4.31 0.0380

Not only is the OL$ coefficient on education substantially lower than the true parameter, the difference from the true parameter is statistically significant beyond the 5% level. We can perform a similar test for the OLS age coefficient: • test (1)

age

=

.2

age

=

.2

F(

1, 1340) = Prob > F =

8.15 0.0044

We find even stronger evidence that the OLS regression results are biased away from the true parameters. <1 !

!

tw,_..

heckman-- Heckmanselection model

25

> Example Several other interesting aspects of the Heckmah model can be explored with predict;. Continuing with our wage model, the expected wages for wOmen conditional on participating in the labor force can be obtained with the ycond option. Let's gdt these predictions and compare them with actual wages for women participating in the labor forcel • predict

hcndwage,

• stmm_lize

wage

ycond

hcndwage

Variable wage hcndwage

if wage

-=

Obs

Mean

Std ! Dev.

1343

23.69217

6.3_5374,

1343

23.68239

3.355087 i

Min

Max

5.88497

45.80979

16. 18337

33.7567

We see that the average predictions from beckman are very close to the observed levels but do not have exactly the same mean. These conditional w'age predictions are available for all observations in the dataset, but can only be directly compared with observed wages where individuals are participating in the labor force. What if we were interested in making predictions about mean wages for all women? In this case, the expected wage is 0 for those who are not exp_ted to participate in the labor force, with expected participation determined by the selection equation.: These values can be obtained with the yexpected option of predict. For comparison, a variable can be generated where the wage is set to 0 for nonparticipants. . predict

hexpwage,

yexpected

• gen wageO=

wage

(657 missing

values

generated)

. replace

wageO=

0 if wage

(657 real

changes

made)

• summarize Variable hexpwage wageO

hexpwage

== .

wageO

fibs

Mean

Stdi Dev.

2000

15. 92511

5.949336

2000

15,90929

12. _7081

Min 2.492469

Max 32.45858

0

45.80979

i

i

Again, we note that the predictions from heckman are very' close to the observed mean hourly wage rate for all women. Why aren't the predictions uging ycond and yexpected exactly equal to their observed sample equivalents? For the Heckman _odel, unlike linear regression, the sample moments implied by the optimal solution to the model likelihood do not require that these predictions exactly match observed data. Properly accounting for thh additional variation from the selection equation ,-quires that the model use more information thar_just the sample moments of the observed wao_es. q

Example Stata wilt also produce Heckman's (1979) two-step efficient estimator of the model with the twostep option. Maximum likelihood estinaation of the parameters can be time-consuming with large datasets and the two-step estimates may provide a_ood alternative in such cases. Continuing with the women's wage model, we can obtain the two-step estimates with Heckman's consistent covariance estimates by typing

!

I

',,

26

heckman m Heckman selection model

• heckman wage educ age, select(married children ednc age) twostep Heckman selection model -- two-step estimates (regression model with sample selection)

Coef, wage education age _cons

Std, Err.

z

Number of obs Censored obs Uncensored obs

= = =

2000 657 1343

Wald chi2(4) Prob > chi2

= =

551.37 0.0000

P>Izl

[95_ Conf. Interval]

.9825259 .2118695 .7340391

.0538821 .0220511 1.248331

18.23 9.61 0.59

0.000 0.000 0.557

.8789189 .1686502 -1.712645

1.088133 .2550888 3.180723

.4308575 .4473249 .0583645 .0347211 -2.467365

.074208 .0287417 .0109742 .0042293 .1925635

5.81 15.56 5.32 8.21 -12.81

0.000 0.000 0.000 0.000 0.000

.2854125 .3909922 .0368555 .0264318 -2.844782

.5763025 .5036576 .0798735 .0430105 -2.089948

select married children education age _cons mills lambda

4.001615

rho sigma lambda

0.67284 5.9473529 4.0016155

.6065388

6.60

0.000

2.812821

5.19041

.6065388

q

t] Technical Note The Heckman selection mode] depends strongly on the model being correct; much more so than ordinary regression. Running a separate probit or ]ogit for sample inclusion followed by a regression, referred to in the literature as the two-part model (Manning, Duan, and Rogers 1987) not to be confused with Heckman's two-step procedure--is an especially attractive alternative if the regression part of the model arose because of taking a logarithm of zero values. When the goal is to analyze an underlying regression model or predict the value of the dependent variable that would be observed in the absence of selection, however, the Heckman model is more appropriate. When the goal is to predict an actual response, the two-part model is usually the better choice. The Heckman selection model can be unstable when the model is not properly specified, or if a specific dataset simply does not support the model's assumptions. For example, let's examine the solution to another simulated problem.

(Continued

on next page)

heckman-- _man

• heckman

yt xl x2 x3,

Iteration Iteration Iteration Iteration Iteration

O: i: 2: 3: 4:

log log log log log

selec¢(zl

likelihood likelihood likelihood likelihood likelihood

selection model

27

z2) = = = = =

-t11.94996 -110.82258 -II0.17707 -107.70663 (not concave) -107.07729 (not concave)

(outputo_.ed ) Iteration 31: Iteration 32:

log likelihood = -104.08268 log likelihood = -104.08267 (backed up)

Heckman selection model

Number of obs

=

150

(regression model with sample selection)

Censored obs Uncensored obs

= =

87 63

Wald chi2(3)

=

8.64e+07

Prob

=

0.0000

Log likelihood

= -104.0827 Coef.

Std. Err.

z

> chi2

P>Izl

[957,Conf, Interval]

yt xl x2

.8974192 -2,525302

.0006338 1415._ .0003934 -6418.57

O.000 O.000

.896177 -2.526074

.8986615 -2.524531 2. 856651

x3 _cons

2.855786 .6971255

.0004417 6465.84 .0851518 8.I_

O.000 O.000

2.85492 ,5302311

zl

-.6830377

.0824049

-8.29

O.000

-.8445484

-.521527

z2 _cons

1.004249 -.361413

.1211501 .1165081

8. _ -3,ID

O. 000 O.002

.7667993 -.589"/647

1. 241699 -. 1330613

/athrho /insigma

15.12596 -.5402571

151.3627 ,1206355

0.10 -4._

0.920 O.000

-281.5395 -.7766984

311.7914 -.3038158

.8640198

select

i

rho sigma lambda

1 .5825984 .5825984

4.40e-

LR test ol indep, eqns. (rho = 0):

I

I !

11

-1

.0702821 .0702821

.459922 .4448481 chi2(i) =

25.67

1 .7379968 .7203488

Prob > chi2 = 0,0000

the form of the likelihood for the Heckman selectioh model, this implies a division by zero and it is surprising that the model solution turns out as will as it does. Reparameterizing p has allowed The model has converged to a value of p that is 1.0--within machine rounding tolerances. Given the estimation to converge, but we clearly have problems with the estimates. Moreover, if this had occurred in a large dataset, waiting over 32 iteration_ for convergence might take considerable time. This dataset was not intentionally developed to cause problems. It is actually generated by a "Heckman process" and when generated starting fromidifferent random values can be easily estimated. The luck of the draw in this case merely led to daia that, despite its source, did not support the assumptions of the Heckman model. The two-step model is generally more stable in chses where the data are problematic. It is even tolerant of estimates of p less than -1 and greater !than t. For these reasons, the two-step model may be preferred when exploring a large dataset. Still, if the maximum likelihood estimates cannot converge, or converge to a value of p that is at the bouhdary of acceptable values, there is scant support for estimating a Heckman selection model on the d_ta. Heckman (1979) discusses the implications of p being exactly t or 0, together with the implica!ions of other possible covariance relationships among the model's determinants.

_

l_,T

28

heckman --

Saved Results heckman

saves

Heckman selection model

in e():

Scalars e (N)

number of observations

e (re)

return code

e (k)

number of variables

e (chi2)

x2

e(k_eq)

number of equations

e(chi2_c)

X2 for comparison test

e(k_dv) e (N_eens)

number of dependent variables number of censored observations

e(p_c) e (p)

p-value for comparison test significance of comparison test

e (dr_m)

model degrees of freedom

e (rho)

p

e(11)

log likelihood

e(£c)

number of iterations

e(ll_O) e(N_clust)

log likelihood, constant-only model number of clusters

e(rank) e(rankO)

rank of e(V) rank of e(V) for constant-only

e(lambda)

A

model

e(selambda) standard errorof A

e(sigma)

sigma

typeof optimization

Macros e(cmd)

heckman

e(opt)

e(depv_')

name(s)of dependent variable(s)

e(chi2type) Wald or Lit; typeof modcl x2 test

e(title) e(title2)

title in estimation output secondary title in estimation output

e(chi2_ct)

Wald or LR; type of model )c2 test corresponding to e(chi2_c)

e(utype)

weight type

e(offset#)

offset for equation #

e (wexp) e(clustvar)

weight expression name of cluster variable

e (mills)

variable containing nonselection hazard (inverse of Mills')

e (method)

requested estimation method

e(predict)

program used to implement predict

e(vcetype) e (user)

covanance estimation method name of tikelihood-evaluator program

e(cnslist)

constraint numbers

e(b)

coefficient vector

e(V)

variance-covariance

e(ilog)

iteration log (up to 20 iterations)

Matrices matrix of

the estimators

Functions e(sample)

marks estimation sample

Methods and Formulas heckma_n is implemented 446-450)

provide

as an ado-file.

an introduction

Greene

Regression estimates using the nonselection maximum likelihood estimation. The regression

equation

(2000,

928-933)

to the Heckmm-a selection hazard

(Heckman

is

yj = xjO + ulj The selection

equation

is zj'7 + u2j

> 0

where

ul _ N(O, a) u2 _ N(0,

1)

1

I_'i''1 :-

cor_(_l,u_)= p

or Johnston

and DiNardo

(1997,

model. t979')

provide

starting

values

for

|

!

--

i necKman- Heclcmanselection model

2g

The log likelihood for observation j is

observed lj =

V/1-- "_

-_

a

/

-- Wj

ln(

wjln @(-zjT)

yj

yj not observed i

where _0

is the standard cumulative normal and wj is an optional weight for observation j.

In the maximum likelihood estimation, o-and p are not directly estimated. Directly estimated are In a and atanh p:

(

l+p

_

i i i

atanh p = _ ln\ _] The standard error of ,_ = #a is approximated through the propagation of error (delta) method: that is, Var(A) _ D Var{(atanh

p lncr)} D'

where D is the Jacobian of )_ with respect to at_h p and In a. The two-step estimates are computed using H_ckman's (1979) procedure. Probit estimates of the selection equation Pr(yj

observed I zj)-

_(zj")')

are obtained. From these estimates the nonselection hazard, what Heckman (t979) referred to as the inverse of the Mills' ratio, m j, for each observa¢ion 3 is computed as

¢(zjS) mj where ¢ is the normal density. We also define

Following Heckman, the two-step parameter estimates of /3 are obtained by augmenting the regression equation with the nonselection hazard m. Thus, the regressors become [X m] and we obtain the additional parameter estimate/3,a on the variable containing the nonselection hazard. A consistent estimate of the regression disturbance variance is obtained using the residuals from the augmented regression and the parameter estimate on the nonselecfion hazard. e'e +/3_ _--]j=l N N

5j

The two-step estimate of p is then _ = /3r,L c3 Heckman derived consistent estimates of the coefficient covariance matrix based on the augmented regression.

]

.......

--.o_..

.,vv,-..,t,ua|

.O_I¢_I_.I.I1JII

IIIUUI_I

Let W = [X m] and D be a square diagonal matrix of rank N with (1 _ P^2o_ j) on the diagonal elements.

Vtwostep - 2(W'W)-I(W'DW + Q)(W'W)-1 where q = _2(W'DW)Vp(W'DW) and Vp is the variance-covariance

estimate from the probit estimation of the selection equation.

References Greene. W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River. NJ: Prentice-Hall. Gronau. R. 1974. Wage comparisons: A selectivity bias. Joumat of Political Economy 82: 1119-1155. Heckman, J. 1976. The common structure of statistical models of truncation, sample selection, and limited dependent variables and a simple estimator for such models. The Annals of Economic and Social Measurement 5: 475--492. 1979. Sample selection bias as a specificationerror. Econometrica47: 153-16t. Johnston. J, and J. DiNardo. 1997. EconometricMethods. 4th ed. New York: McGraw-Hill. Lewis, H. 1974. Comments on selectivity biases in wage comparisons. Journal of Political Economy 82: 1119-1155. Manning, W. G., N. Duan. and W. H. Rogers. 1987, Monte Carlo evidence on the choice between sample selection and two-part models. Journal of Econometrics 35: 59-82.

Also See Complementary:

[R] adjust, [R] constraint, JR] lincom, [R] lrtest, [R] mfx, [R] predict, [R] test, [R] testnl, JR] vce, [R] xi

Related:

[R] heckprob,

Background:

[U] [u] [U] [U]

[R] regress,

[R] tobit, [R] treatreg

16.5 Accessing coefficients and standard errors. 23 Estimation and post-estimation commands, 23.11 Obtaining robust variance estimates. 23.12 Obtaining scores

---Title heckprob -- Maximumd_kelihood probit estimation with selection

Syntax heckprob

dewar

[vartist] [,,eight]

[if

exp] [in

range],

select( [ depva,'s = ] varlists [, ,gffse_(varname) } [ robust

] )

cl__uster (vamame)

constraints by .,.

noconstant

(numlist)

s qcore(newv_rlist) first noconstant i noskip level(#) _ffset (varname) maximize_options

: may be used with heckprob; and i_eights

see [R] by.

_eights,

f_eights,

are allowed; see [U] 1_1.6

weight.

heckprob

shares the featuresof all estimationcommands;see [U] 23 Estimationand post-estimationcommands.

Syntaxforpredict predict

[type] newvarname

[if exp] [in range]

[, statistic nooffset

]

where statistic is /

pmargin

q'(xjb),

success probability (th_ default)

pll

_2(xjb,

_/.probit = 1, yj _ select zjg, p), predicted prolJability P'tyj

plO

_52(xjb,-z/g,-p),

predicted ,robability Pr(_/pr°bit = 1,_/;elect = O)

pO1

_52(-x3b,

predicted _robability P_yj

pO0

_2(-xjb,--zjg,

psel pcond

_(zjg), selection probability _52(xjb, zig, p)/_5(zjg), prob_ility

xb stdp

xyb, fitted values standard error of fitted values

xbsel

linear prediction for selection equation

stdpsel

standard error of the linear prediction for selection equation

zjg,-p),

_/_ probit

p),

predicted )robability

Pr(y

= l)

= O, yj

_ select

p.r°bit = O, y;elect

= 1) = O)

of success conditional on selection

q)() is the standard normal distribution function and q52() is the bivariate normal distribution function. These statistics are available both in and out of sample; type predict the estimation

...

if

e(sample)

...

sample.

Description heckprob

estimates maximum-likelihood probit models with sample selection. 31

if wanted only for

Options select(...) specifiesthe va_ables and optionsfor the selectionequation. It is an integral part of specifying a selection model and is not optional. robust specifies that the Huber/White/sandwich estimator of the variance is to be used in place of the conventional MLE variance estimator, robust combined with cluster() further allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights, robustis implied; see [U] 23.11 Obtaining robust variance estimates. clust;er (varname) specifies that the observations are independent across groups (clusters) but are not necessarily independent within groups, varname specifies to which group each observation belongs, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE),but not the estimated coefficients, cluster() can be used with pweights to produce estimates for unstratified cluster-sampled data. cluster() cluster()

implies robust; by itself.

that is, specifying robust

cluster()

is equivalent to typing

score(newvarlist) creates a new variable, or a set of new variables, containing the contributions to the scores for each equation and the ancillary parameter in the model. The first new variable specified will contain ul_ -- OtnLj/O(xj/3) for each observation j in the sample, where lnLj is the 3th observation's contribution to the log likelihood. The second new variable: u2j = OlnLj/O(zj_') The third: u3j = OlnLj / O(atanh p) If only one variable is specified, only the first score is computed; if two variables are specified, only the first two scores are computed; and so on. The jth observation's contribution to the score vector is { OtnLj/Ol30lnLj/O("/)

OlnLj/O(atanhp)}

= (UljXj

u2jzj

u3j)

The score vector can be obtained by summing over j; see [U] 23.12 Obtaining scores. first specifies that the first-step probit estimates of the selection equation be displayed prior to estimation. noconstant omits the constant term from the equation. This option may be specified on the regression equation, the selection equation, or both. constraints (numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command; see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. noskS.p specifies that a full maximum likelihood model with only a constant for the regression equation be estimated. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. level (#) specifies the confidence level, in percent, for confidence intervals The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. ogfset (w_rname) is a rarely used option that specifies a variable to be added directly to Xb. This option may be specified on the regression equation, the selection equation, or both.

hect(prob--:Maximum-likelih_od_pmbit estimationwith selection

33

_,

mca:imize_.options control the maximization process; see [R] maximize. With the possible exception of iterate (0) and trace, you should never ha_e to specify them.

Optionsfor predict pma.rgin, the default, calculates the univariate (ma@nal) predicted probability of success _e, probit 1). ptty j _. calculates the bivariate predicted probability P_yj

probit

=

__ probit plO calculates the bivariate predicted probability P,_y)

--

pll

i

1

- 1, _yjselect

_}_,probit

p01 calculates the bivariate predicted probability

P_yj

._. probit pO0 calculates the bivariate predicted probability P_(yj

psel

_ select , yj

=

l).

___ 0).

o select

= O, yj

-----1).

. select = O, yj

:

0).

calculates the univariate (marginal) predicted probability of selection Pr(y_ elect = l).

pcond calculates the conditional (on selection) predicted probability of success P_tYj-'" probit = l, yj-select = t)/Pr(y_ elect = 1). xb calculates the probit linear prediction xjb. stdp calculates the standard error of the prediction, it can be thought of as the standard error of the predicted expected value or mean for the obsel_'_tion's covariate patfern. This is also referred to as the standard error of the fitted value. xbsel

calculates the linear prediction for the selectibn equation.

stdpsel

calculates the standard error of the linear irediction for the selection equation.

nooffset is relevant only if you specified offset (varname) for heckprob. It modifies the calculations made by predict so that they ignore th_ offset variable; the linear prediction is treated as xj b rather than xj b + offsetj.

Remarks The probit model with sample selection (Van d_ Yen and Van Pragg 1981) assumes that there exists an underlying relationship y_ = xjj3 + ulj

latent

equation

probit

equation

such that we observe only the binary outcome , probit

yj

= (y_ > O)

The dependent variable, however, is not always observed. Rather, the dependent variable for observation j is observed if ffselect j

I =-

(Zjy-4:-U2j

>

0)

selection

where ul _ 2?{0, 1)

Nio,1) corr(ul,

=p

equation

i

F

F

o,_

necKproo

m MaxlmumqlKellhood

probit estimation

with

selection

When p _ 0, standard probit techniques applied to the first equation yield biased results, heckprob provides consistent, asymptotically efficient estimates for _t the parameters in such models.

> Example We use the data from Pindyck and Rubinfeld (1998). In this dataset, the variables are whether children a_end private school (private), number of years the family has been at the present residence (years), log of property tax (logptax), log of income (loginc), and whether one voted for an increase in prope_y taxes (vote). In this example, we alter the meaning of the data. Here we assume that we observe whether children attend private school only if the family votes for increasing the property taxes. This is not true in the dataset and we make this untrue assumption only to illustrate the use of this command. We observe whether children attend private school only if the head of household voted for an increase in property taxes. We assume that the vote is affected by the number of years in residence, the current prope_y taxes p_d. and the household income. We wish to model whether children are sent to private school based on the number of years spent in the current residence and the cu_ent prope_y taxes paid. . heckprob Fitting

private

probit

years

logptax,

sel(vote=years

Iteration

O:

log

likelihood

Iteration

I:

log

likelihood

= -18.407192

Iteration

2:

log

likelihood

= -16.1412S4

Iteration

3:

log

likelihood

= -15.953354

Iteration

4:

log

likelihood

= -15.887269

Iteration

5:

log

likelihood

= -15.883886

Iteration

6:

log

likelihood

= -15.883655

Fitting

selection

model:

O:

log likelihood

= -63.036914

Iteration

I:

log likelihood

= -58.581911

Iteration

2:

log

likelihood

= -58.497419

Iteration

3:

log

likelihood

= -58.497288

log

likelihood

= -74.380943

Comparlson: starting

values:

Iteration

O:

log

likelihood

Iteration

I:

log

likelihood

= -17.920826

Iteration

2:

log

likelihood

= -18.375362

= -40.895884

Iteration

3:

log likelihood

= -16.067451

Iteration

4:

log

likelihood

=

Iteration

5:

log

likelihood

= -15.760354

Iteration

6:

log

likelihood

= -15.753805

Iteration

7:

log

likelihood

= -15.753785

full

model

Iteration

0:

log

likelihood

= -75.010619

Iteration

I:

log

likelihood

= -74.287753

Fitting

-15.84787

Iteration

2:

log

likelihood

= -74.250148

Iteration

3:

log

likelihood

= -74.245088

Iteration

4:

log

likelihood

= -74.244973

Iteration

5:

log

likelihood

= -74.244973

Probit

Log

model

likelihood

logptax)

= -IZ.122381

Iteration

Fitting

loginc

model:

with

sample

= -74.24497

selection

(not

concave)

Number

=

95

Censored obs Uncensored obs

of obs

= =

36 59

Wald

chii(2)

=

Prob

> chi2

=

1.04 0.5935

heckprob- Maximum-likelihoodprobitestimationwith selection

Coef.

Std.

Err.

P> lzl

!z

[95_.

Conf.

35

Interval]

)

private years logptax _cons

-. 1142597

.1461711

-0 i78

0.434

-.4007498

.1722304

.3516095 -2,780664

1.01648 6.905814

0 i 35 -0_40

O. 729 0.687

-1.640655 -16.31581

2.343874 10.75448

-,0167511

.0147735

-li13

0.257

-.0457067

vote years

.0122045

loginc

.9923025

.4430003

2 i24

O.025

,1240378

I.860567

logptax _cons

-1. 278783 -.5458214

.5717544 4.070415

-2 _24 -0.13

O.025 O.893

-2.399401 -8,523689

-. 1581649 7.432046

/athrho

-.8663147

1.449995

-0.60

O.550

-3. 708253

1.975623

-. 6994969

.7405184

-. 9987982

.9622642

rho LR test

of indep,

eqns.

(rho = 0):

chi2(_)

=

0.27

Prob

> chi2

= 0.6020

J

The output shows several iteration logs. The first iteration log corresponds to running the probit model for those observations in the sample where we hav< observed the outcome. The second iteration log corresponds to running the selection probit model, _hich models whether we observe our outcome of interest. If p = 0, then the sum of the log likelihoods from these two models will equal the log likelihood of the probit model with sample selectioh; this sum is printed in the iteration log as the comparison log likelihood. The third iteration log shows starting values 'for the iterations. The finn iteration log is for estimating the full probit model with sample selection. A likelihoodratio test of the log likelihood for this model and the comparison log likelihood is presented at the end of the output. If we had specified the robust option, then this test would be presented as a Wald test instead of a likelihood-ratio test.

q

Example In the previous example, we could obtain robust standard errors by specifying the robust We also eIiminate the iteration logs using the nolog option. • heckprob Probit

private

model

Log likelihood

with

years

lo_ptax,

sample

selection

eel(vote=years

= -74.24497

loginc

lo_q_tax) nolog

robust

Number of obs Censored obs

= =

95 36

Uncensored

=

59

obs

Wald

chi2(2)

=

Prob

> chi2

=

2,55 0.2798

Kobust Coef.

Std. Err.

2

P>Iz[

[95_ Conf.

Interval]

i

private years

-.1142597

.1113949

i -1.03

0.305

-.3325896

•1040702

logptax _cons

.3516095 -2.780664

.735811 4.786602

0.48 -0.58

0.633 0.561

-I.090553 -12.16223

1.793773 6.600903

)

Vote

! i

years

-.0167511

.0173344

-0.97

0.334

-.0507258

.0172237

loginc

.9923025

.4228035

2.35

0.019

.1636229

1.820982

lo_ptax _cons

-1.2T8783 .5458214

.5095157 4.543884

-2._1 -0._2

0.012 0.904

-_.277416 -9.45167

-.280t505 8.360027

option.

!L

t_ir_

_

,=©_Rp, uu --

/athrho rho

maA..um-.Ke.nooa

-.8663147

1.630569

-. 6994969

.8327381

prODlI estimation

-0,53

Wald test of indep, eqns, (rho = 0): chi2(1) =

with

O.595

0.28

selection

-4,062171

2.329541

-, 9994077

.9812276

Prob > chi2 = 0.5952

Regardless of whether we specify the robustoption, it is clear that the outcome is not significantly different from the outcome obtained by estimating the probit and selection models separately. This is not surprising since the selection mechanism estimated was invented for the example rather than born from any economic theory.
> Example It is instructive to compare the marginal predicted probabilities with the predicted probabilities we would obtain ignoring the selection mechanism. To compare the two approaches, we will synthesize data so that we know the "true" predicted probabilities. First, we need to generate correlated error terms, which we can do using a standard Cholesky decomposition approach. For oar example, we will clear any data from memory and then generate errors that have correlation of .5 using the following commands. Note that we set the seed so that interested readers might type in these same commands and obtain the same results. clear set

seed

set

obs 5000

gen ci

12309

= invnorm(uniform())

gen c2 = invnorm(uniform()) matrix P = (1,.5\.5,1) matrix A = cholesky(P) local facl = A[2,1] local fac2 = A[2,2] gen ul = cl gen u2 = "facl"*cl + "fac2"*c2

We can check that the errors have the correct correlation using the corr command. We will also normalize the errors such that they have a standard deviation of one so that we can generate a bivariate probit model with known coefficients. We do that with the following commands. summarize ul replace ul = ul/sqrt(r(Var)) summarize u2 replace u2 = u2/sqrt(r(Var)) drop cl c2 gen xl = u_uifomn{)-.5 gen x2 = uniform()+i/3 gen yls = 0.5 gen

y2s

gen yl gen

+ 4.x1

= 3 - 3.x2 = (yls>0)

y2 = (y2s>0)

+ ul + .5*xl

+ u2

heckpmb -- Maximum-likelihood, , probit estimationwith selection

37

We have now created two dependent variables yl aM y2 that are defined by our specified coefficients. We also included error terms for each equation and ]he error terms are correlated. We run heckprob to verify that the data have been correctly _oenerate(]according to the model Yl -- .5 -}-4Xl _ ul Y2 = 3 + .5xl _ 3x2 + u2 where we assume that Yl is observed only if Y2 = J. • heckprob yl xl, sel(y2 = xl x2) nolog Probit model with sample selection

Log likelihood = -3600.854 Coef.

Std. Err.

Number of obs Censored obs Uncemsored obs

= = =

5000 1790 3210

Wald chi2(1) Prob > chi2

= =

941.68 0.0000

P>Iz]

[95_ Conf. Interval]

xl _cons

3.985923 .4852946

.1298904 .0464037

30._9 i0•_6

0.000 0.000

3.73i342 •3943449

4.240503 .5762442

xl x2 _cons

.5998148 -3.004937 3.0_1587

.0716655 .0829469 .0782817

8.37 -36•23 38.47 i

0.000 0.000 0.000

.4593531 -3.1/6751 2.858157

.7402765 -2.842364 3.165016

0.000

.4053964

.7427295

.3845569

.6307914

y2

/athrho rho

,574063

.0860559

,5183369

.062935

LR test of indep, eqns. (rho = 0):

6._7

chi2(I) =

46.58

Prob > chi2 = 0.0000

Now that we have verified that we generated data according to a known model, we can obtain and then compare predicted probabilities from the pi'obit model with sample selection and a (usual) probit model. predict pmarg (option pmargin assumed; Pr(yl=l)) probit yl xl if y2==l

(outputomitted) predict phat (option p assumed;

Pr(yl))

Using the (marginal) predicted probabilities from the probit model with sample selection (pmarg) and the predicted probabilities from the (usual) prob!t model (phat), we can also generate the "true" predicted probabilities from the synthesized yls variOble and then compare the predicted probabilities: • gen ptrue

= norm(yls)

• summarize pmarg ptrue phat Variable Obs

Mean

Std. Dev. i

Min

Max

.0658356 1.02e-06

.9933942 1

pmarg ptrue

5000 5000

.619436 .6090161

.32094_4 .34990_5

phat

5000

.6723897

.30632_)7

.096498

.9971064

i

_

38

heckprob m Maximum-likelihood

Here

we see that ignoring

the selection

probit estimation with selection

mechanism

(comparing

the phat

variable

with the true

ptrue variable) results in predicted probabilities that are much higher than the true values. Looking at the marginal predicted probabilities from the model with sample selection, however, results in more accurate

predictions.

<1

Saved Results in e():

heckprob saves Scalars e (N)

number of observations

e (re)

return code

e (k)

number of variables

e (chi2)

x_

e(k_eq)

number of equations

e (ch±2_c)

X_ for comparison test

e (k_dv) e (N_cens)

number of dependent variables number of censored observations

e (p_c) e (p)

p-value for comparison test significance of comparison test

e (df_.m) e(ll)

model degrees of freedom log likelihood

e (rho) e(ic)

p number of iterations

e(ll_O)

log likelihood, constant-only model

e (rank)

rank of

e(ll_c)

log likelihood, comparison model number of clusters

e(rank0)

rank of e(V) for constant-only model

e (cmd)

heckprob

e (opt)

e(depvar)

name(s) of dependent variable(s)

e(chi2type)

type of optimization Wald or LR; type of model x 2 test

e(title)

title in estimation output weight type

e(chi2_ct)

type of comparison X_ test

e (wtype)

e (offset)

offset

e(uexp) e(clustvar)

weight expression name of cluster variable

e(predict) e(cnslist)

program used to implement predict constraint numbers

e(vcetype)

covariance estimation method

e (user)

name of likelihood-evaluator program e (V)

variance-covariance

e (N_clust)

e (V)

Macros

Matrices e (b)

coefficient vector

e(ilog)

iteration log (up to 20 iterations)

matrix

of the estimators

Functions e(sample)

marks estimation sample

Methods and Formulas heckprob is implementedas an ado-file. Vande Venand VanPragg (1981)provide an introduction and an explanation The

probit

of this model.

equation

is

vj = (x9 + ulj > 0) The selection equation is zj'T + u2i

> 0

where

ul _ N(0, 1) uz _ N(O. 1)

corr(ul,u2)- p

heckprob-- Maximum-!ikelih0odprobit estimationwith selection

39

The log likelihood is

IES

_ti=0

+

{1- (z,-y+ offsetT)}

where S is the set of observations for which 9i is observed, (1)20 is the cumulative bivariate normal distribution function (with mean [0 0 ]'), _0 is the standard cumulative normal, and wi is an optional weight for observation i. In the maximum likelihood estimation, p is not directly estimated. Directly estimated is atanh p: i

7

From the form of the likelihood, it is clear that if p = 0, then the log likelihood for the probit model with sample selection is equal to the sum of the probit model for the outcome 9 and the selection model. A likelihood-ratio test may therefore be performed by comparing the likelihood of the full model with the sum of the log likelihoods fo_ the probit and selection models,

References Greene,W. H. 2000. EconometricAnalysis.4th ed. Upper Sa_le River, NJ: Prentice-Hall. Beckman.J. t979. Sampleselectionbias as a specificationerror. Economettica47: 153-161. Pindyek.R. and D. Rubinfeld.1998. EconometricModelsand EconomicForecasts.4th ed. New York:McGraw-Hill. Vande Ven,W. R M. M. and B. M. S. VanPragg. 1981.The demandfor deductiblesin private health insurance:A probit modelwith sample selection.JournaIof Econometric_17: 229-252.

Also See Complementary:

[R] adjust, [R] constraint, [R] l_com, [R] lrtest, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, [R] Xi

Related:

[R] heckman, JR] probit, [R] treatreg

Background:

[u] [U] Iv] [u]

16.5 Accessing coefficients and standard errors, 23 Estimation and post-est_ation commands, 23.H Obtaining robust var] ance estimates, 23.12 Obta_iningscores

vw __'° ti"

Title he

u Obtain on-line help In

I

I

I

Syntax Windows, Macintosh, and Unix(GUI): help [ command or topic name ]

whelp[command or fopicname] Unix(console & GUI):

{help lma.n} [command or topic name ]

Description The help command displays help information on the specified command or topic. If help is not followed by a command or a topic name, a description of how to use the help system is displayed. Stata for Unix users may type help or mmamthey mean the same thing. Stata for Windows, Stata for Macintosh, and Stata for Unix(GUI) users may click on the Help menu. They may also type whelp something to display the help topic for something in Stata's Viewer. whelp typed by itself displays the table of contents for the on-line help.

Remarks See [U] 8 Stata's on-line help and search facilities for a complete description of how to use help. Q Technical Note When you type help something, Stata first looks along the S_ADOpath for something.hip; [U] 20.5 Where does Stata look for ado-files?. If nothing is found, it then looks in state.hip the topic.

Also See Complementary:

[R] search

Related:

[R] net search

Background:

[GSM]4 Help, [GSW]4 Help, [GSU] 4 Help, [U] 8 Stata's on-line help and search facilities

40

see for vl

Title i [ hetprOb - llliMaximum-l_etih°°d r 1 _ heter°skedastic "'i!_pr°bit estimati°n l llllllll I ,I I I

II I

i

i

ntax

eerar het(varlist,

[offset(varname)

c!luster(varname) nolrtest

'd [noconstant

level(#)

asis_robust

score (newvarl [newvar2:]) noskip offset (varname)

constraints

by ... : may be used with hetFob; fweights, iweights,

])

(numlist) nolqg maximize_options ] see [R] by.

and pweights are allowed; see [U] 14il.6 weight.

This command shares the features of all estimation commands;see [U] 23 Estimation and post-estimation commands.

Syntaxforpredict

i

predict[O?e]newvarname[ifexp][in r_ge] [, { p I xb [ sigma } nooffset] These statistics are available both in and out of sample; type predict the estimation sample.

...

if e(samp!e)

...

i

if wanted only for

scription hetprob

estimates a maximum-likelihoodhetero_kedasticprobit model.

See [R] logistic for a list of related estimation commands.

Options het(varlist [, of.fset(varname)]) specifies the independent variables and the offset variable, if there is one, in the variance function, her () is not optional. noconstant

suppresses the constant term (intercept}in the model.

level (#) specifiesthe confidencelevel, in percent, foi confidenceintervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying !he width of confidence intervals. as is forces retention of perfect predictor variablesand their associatedperfectly predictedobservations and may produce instabilities in maximization; see [R] probit. robust specifies that the Huber/White/sandwichestinmtor of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust varianee estimates, robust combined with cluster () allows observations which are not independent within cluster (although the)' must be independent between clusters). If you specify pweights, robust is implied: see[U] 23.13 Weighted estimation, 41

i

42

hetprob -- Maximum-likelihood heteroskedastic probit estimation

cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups, varname specifies to which group each observation belongs; e,g., cluster(personid) in data with repeated observations on individuals, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U] 23.11 Obtaining robust variance estimates, cluster() can be used with pweights to produce estimates for unstratified cluster-sampled data. but see the svyprobit command in [r<] svy estimators for a command designed especially for survey data. cluster() by itself.

implies robust;

specifying

robust

cluster()

is equivalent

to typing cluster()

score (newvarl [newvar2 ] ) creates newvarl containing uj = OlnLj/0(xj b) for each observation j in the sample. The score" vector is _ OlnLj/Ob = _ wjujxj; i.e.. the product of newvar with each covariate summed over observations. The second new variable, newvar2, contains vj = OlnLj/0(zjq,).

See [U] 23.12 Obtaining

scores.

noskip requests fitting of the constant-only model and calculation of the corresponding likelihood-ratio X 2 statistic for testing significance of the full model. By default, a Wald X 2 statistic is computed for testing the significance of the full model. offset (varname) to be 1.

specifies that varname is to be included in the model with coefficient constrained

nolrtest specifies that a Wald test of whether lnsigma2 LR test.

-- 0 should be performed

instead of the

constraints(numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command; see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. nolog

suppresses

maximize_options specify them.

the iteration log. control the maximization

process:

see [R] maximize.

You should never have to

Options for predict p, the default, calculates

the probability

of a positive outcome.

xb calculates the linear prediction. sigma

calculates the standard deviation.

noof fset is relevant only if you specified off set (varname) for het prob. It modifies the calculations made by predict so that they ignore the offset variable: the linear prediction is treated as xjb rather than xjb + offsetj.

Remarks hetprob performs maximum likelihood estimation of the heteroskedastic probit model, a generalization of the probit model. Let yj, j = 1,.., N be a binary outcome variable taking on the value 0 (failure) or 1 (success). In the probit model, the probability that yj takes on the value 1 is modeled as a nonlinear function of a linear combination of thc k independent variables xj -- (z 1j, x2_ .... , xkj): Pr(yj

- 1) - _b(xjb)

hetprob- Maximum-likelihoo_heteroskedasticprobitestimation

43

in which _0 is the cumulative distribution function (CDF) of a standard normal random variable. that is, a normally distributed (Gaussian) random varihble with mean 0 and variance 1. The linear combination of the independent variables, xjb, is cdmmonly called the index fimction or index. Heteroskedastic probit generalizes the probit model b_ generalizing _I,0 to a normal CDF with a variance no longer fixed at 1 but allowed to vary as a h_nction of the independent variables, hetprob models the variance as a multiplicative function of _ese m variables zj = (zlj, z2j,..., Zmj), following Harvey (1976): 2 i 2

={exp(zj )} :i I

Thus, the probability of success as a function of all the independent variables is Pr(yj=l

=@

xj

expzj-y Z

From this expression it is clear that, unlike the index Xjb, no constant term can be present in zj7 ]f the model is to be identifiable. i Suppose the binary outcomes yj are generated by tltresholding an unobserved random variable t_, which is normally distributed with mean xjb and varihnce 1 such that

YJ=

01

ifw_ 0

;This process gives the probit model: Pr(yj = 1) = Pr(wj

N 0) = _(xjb)

Now suppose that the unobserved wj are heteroskedasiic with variance crj2= {exp(zjb)}

2

Relaxing the homoskedastic assumption of the probit rhodel in this manner yields our muItiplicative heteroskedastic probit model:

Pr(yj

= 1)=

_{xj_/exp(zj'7)}

Z

> Example For this example, we generate simulated data for a simple heteroskedastic probit model and then estimate the coefficients using hetprob: • set obs

obs

was

O,

1000 now

1000

set

seed

1234567

gen

X = l-2*uniform()

gen

xhet

= uniform()

gen

sigma

gen

p = norm((O.3+2*x)/s±gma)

= exp(1.5*xhet)

gen

y = cond(un±form()<=p,l,O)

_,

_t;'

44

hetprob -- Maximum-likelihood heteroskedastic probit estimation

• hetprob Fitting

y x, het(xhet) comparison

model:

Iteration

O:

log

likelihood

= -688.53208

Iteration

1 :

log likelihood

= -592.23614

Iteration

2:

log likelihood

= -591.50687

Iteration

3:

log likelihood

= -591.50674

Fitting

full

model:

Iteration

O:

log

likelihood

= -591.50674

Iteration

I:

log

likelihood

= -572. 12219

Iteration

2:

log

likelihood

=

Iteration

3:

log

likelihood

= -569.48921

Iteration

4:

log

likelihood

= -569.47828

Iteration

5:

log

likelihood

= -569.47827

Heteroskedastic

probit

-570.7742

model

Number of obs Zero outcomes Nonzero

Log

likelihood

= -569.4783

y

Coef.

x

Std.

Err.

z

= =

outcomes

1000 452

=

548

Wald

chi2 (I)

=

78.66

Prob

> chi2

=

0.0000

P>Izl

[95Y, Conf.

Interval]

Y 2. 228031

.2512073

8,87

O. 000

i, 735673

2. 720388

_cons

.2493822

.0862833

2.89

O. 004

.08027

.4184943

xhet

I. 602537

.2640131

6.07

O. 000

I.085081

2. 119993

Prob

= 0.0000

insiEma2

Likelihood

ratio

test

of Insigma2=O:

chi2(1)

=

44.06

Above we created two variables, x and xhet,and then simulated

> chi2

the model

Pr(y=11=F{(80+ ,x)/exp( lxhet)} for/3o = 0.3,/31 = 2, and 71 -- 1.5. According to hetprob's output, all coefficients are significant and, as we would expect, the Wald test of the full model versus the constant-only model, e.g., the index consisting of/3o + filx versus that of just /30, is significant with X2(1) = 79. Likewise, the likelihood-ratio test of heteroskedasticity which tests the full model with heteroskedasticity against the full model without is significant with X2(1) = 44. See [R] maximize for further explanation of the output. Note that for this simple model hetprob took five iterations to converge. As stated elsewhere (Greene 2000, 829), this is a difficult model to fit and it is not uncommon for it to require many iterations or for the optimizer to print out warning and informative messages during the optimization. Slow convergence is especially common for models in which one or more of the independent variables appear in both the index and variance functions.

<1

Q Technical Note Stata interprets a value of 0 as a negative outcome (failure) and missing) as positive outcomes (successes). Thus, if your dependent and 1, 0 is interpreted as failure and 1 as success. If your dependent 1, and 2, 0 is still interpreted as failure, but both 1 and 2 are treated

treats all other values (except variable takes on the values 0 variable takes on the values 0, as successes.

O

.............

S'

RObust standard errors If you [tJ] 23.11 re-estimate robust to

Maximum-likelihooq heteroskedasticprobit estimation specify the hetprob robust --option, hetprob reports robust standard errors as described 45 in Obtaining robust variance estimates. TO illustrate the effect of this option we will our coefficients using the same model and data in our example, only this time adding our hetprob command:

Example • hetprob

y x,

het(xhet)

robust

nolog

Heteroskedastic probit model

Log likelihood = -569.4783

Number of obs Zero outcomes Nonzero outcomes

= = =

I000 452 548

Wald chi2(I) Prob > chi2

= =

65.23 0.0000

Robust y

Coef.

x _cons

2. 22803 .249382 t

Std. Err.

z

P>Izl

[95_,Conf. Interval]

8.08 2.96

O,000 O.003

l. 687355 .0840853

Y .2758597 .0843367

2.768705 .4146789

i insigma2 xhet

i 1. 602537

Wald test of insigma2=O:

i

.2671326

6. O0 chi2(1) =

O.000 35.99

1.078967

2. 126107

Prob > chi2 = 0.0000

The robust standard errors for two of the three parameters are larger than the previously reported conventional standard errors. This is to be expected, even though (by construction) we have perfect model specification, since this option trades off efficient estimation of the coefficient variancecovariance matrix for robustness against misspecificat_on. 4 Specifying the cluster() option relaxes the usual assumption of independence between observations to the weaker assumption of independence jusi between clusters, that is, hetprob, robust cluster() is robust with respect to within-cluster coffelation, There is a cost in terms of efficiency with this option, since in this case hetprob inefficiefitly sums within cluster for the standard error calculation rather than attempting to exploit what might be assumed about the within-cluster correlation (as do the xtgee population-averaged models).

Obtaining predicted values

]

Once you have estimated a model, you can use the predict command to obtain the predicted probabilities for both the estimation sample and other samples; see [U] 23 Estimation and postestimation commands and [R] predict, predict without arguments calculates the predicted probability of a positive outcome. With the xb option, it calculates the index function combination xjb, where x7 are the independent variables in the jth observation and b is the estimated parameter vector. With the sigma option, predict calculates the predicted standard deviation oj = exp(zj2().

_;:,,,-

46

hetprob -- Maximum-likelihood

heteroskedastic

probit estimation

> Example We use predict to corapute the predicted model in order to compare these with the actual

probabilities and standard

deviations

based

on our

values:

predict phat (optionp assumed; Pr(y)) gen diff_p = phat - p • summarize diff_p Variable I

Obs

Mean

diff_p ]

1000

-.0107081

Std. Dev,

Min

Max

.0131869 -.0466331

.010482

• predict sigmahat, sigma gen diff_s = sigmahat - sigma . summarize diff s Variable

Ohs

Mean

Std. Dev.

diff_s

i000

.1558882

.1363698

Min

Max

,0000417

.4819107

Saved Results hetprob

saves

in e():

Scalars e (N)

number of observations

e (re)

return code

e (k)

number of variables

e (chi2)

2

e(k_eq)

number of equations

e(chi2_c)

X 2 for heteroskedasticity LR test

e(k_dv)

number of dependent variables

e (p_c)

p-value for heteroskedasticity LR test

e(N..f) e(N_s)

number of zero outcomes number of nonzero outcomes

e(df..m_c)

degrees of freedom for heteroskedasticity LR test

e(df..m) e (11)

model degrees of freedom log likelihood

e(p) e (ic)

significance number of iterations

e(ll_O) e(ll_c) e(N_clust)

log likelihood, constant-only model log likelihood, comparison model number of clusters

e(rank) e(rankO)

rank of e(V) rank of e(V) for constant-only model

e (cmd)

hetprob

e (opt)

type of optimization

e(depvar) e(title)

name of dependent variable title in estimation output

e(chi2type) e(chi2_ct)

Watd or LR; type of model x 2 test Wald or LR; type of model x 2 test

e(clustvar)

name of cluster variable

e(method) e(vcetype) e(user)

requested estimation method covariance estimation method name of iike]ihood-evaluator program

e(offset#) e(predict) e(cnslist)

offset for equation # program used to implement predict constraint numbers

e(b)

coefficient vector

e(V)

variance-covariance

e(ilog)

iteration log (up to 20 iterations)

Macros

corresponding to e(chi2_c)

Matrices

Functions e(sample)

marks estimation sample

the estimators

matrix of

I i i

hetprob--Maximum-likelih¢od heteroskedasticprobit estimation

47

i

Methodsand Formulas ?

hetprobis implemented as an ado-file. !

The heteroskedastic probit model is a generalizaiion of the probit model since it allows the scale

t

variables. of the inverse link function to vary from observation to observation as a function of the independent The log-likelihood function for the heteroskedas|ic probit model is

lnL = _ wj In_;{xjb/exp(zT)}+ }-'_wj In [1- _{xjb/exp(zT)}] jeS

jffS

where S is the set of all observations j such that yj -¢ 0 and is maximized as described in IN] maximize.

Wj

denotes the optional weights. In L

References Greene, W. H. 2000. Econometric Harvey. A. 1976. Estimating

Analysis. 4th ed. Upper _addle River. NJ: Prentice-Hall. i regression models with multiplicative heteroscedasticity. Ecooometrica

44: 461-465,

AlsoSee Complementary:

[R] adjust, [R] constraint, [R! lincom, [R] linktest, [R] lrtest, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, [R] xi

Related:

[R] biprobit, [R] ciogit, [R] casum, [R] glm, [R] glogit, JR] logistic, [R] logit, [R] mlogit, [R] olog!t, [R] probit, [R] scobit, [R] xtprobit

Background:

[U] 16.5 Accessing coefficienb and standard errors, [U] 23 Estimation and post-_timation commands, [u] 23.11 Obtaining robust variance estimates, [u] 23.12 Obtaining scores, [R] maximize

i

....... "_'"',F J

Title hilite -- Highlight a subset of points in a two-way scatterplot

Syntax hilito

yvar xvar [ifexp] [in range], hilite(exp2)

[ graph_options ]

Description The hilitecommand draws a two-way scatterplot highlighting the observations selected by exp2.

Options hilite(exp2) is not optional. It specifies an expression identifying the observations to be highlighted. graph_options are any of the options allowed with graph, twoway; see [G] graph options.

Remarks > Example You have data on 956 U.S. cities, including average January temperature, average July temperature, and region. The region variable is coded 1, 2, 3, and 4, with 4 standing for the West. You wish to make a graph showing the relationship between January and July temperatures, highlighting the fourth region: . hilite tempjan tempjuly, hilite(region==4)

region==4

ylabel xlabel

highlighted

80=

= =

©

60"

'

_ :,"'.3 =_ 4e

._.¢'""

C

•

g (g _'

",

20

io_ a •

.....

g=*

" .

i"_r

• J _

". ,tNi'_': " : "; _'.2"_.: , "-.:

=

,:

4

.; .

e

0

1 Average

July

48

Temperalure

,, :_

hilite -- Highlighta subset of pointsin a two-way scatterplot It is possible to use graph

to product graphs like ffiis, but hilite is often more convenient.

49

q

[3Technical Note By default, hilite uses'.' for the plotting Lvmbbl and additionally highlights using the o symbol. Its default is equivalent to specifying sTabol(.o)as one of the graph_options. You can vary the symbols used, but you must specify exactly two symbols. The first is used to plot all the data and the second is used for overplotting the highlighted Subset.

Methodsand Formulas hilite is implemented as an ado-file.

References Weesie,J. 1999.dr38: Enhancementto the hilite command.Stata TechnicalBulletin 50: 17-20. Reprintedin Stara TechnicalBulletin Reprints,vol. 9, pp. 98-101.

AlsoSee Related:

[R] separate

Background:

Stata Graphics Manual

'"::"

Title I

hist-

Categorical

variable histogram

[

II

Syntax hist

varna,ne [weight]

[if exp] [in range]

[. i._ncr(#) graph_options

]

fweights are allowed; see [U] 14.1.6 weight.

Description hist graphs a histogram of varname, the result being quite similar to graph hist, however, is intended for use with integer-coded categorical variables. hist determines the number of bins automatically, labels are centered below the corresponding bar.

the z-axis

hist may only be used with categorical variables maximum(varname) - minimum(varname) < 50.

with

varname,

is automatically a range

of less

histogram.

labeled, and the than

50;

i.e.,

Options incr(#) specifies how the z-axis is to be labeled, incr(1), the default if varname reflects 25 or fewer categories, labels the minimum, minimum + 1, minimum 4- 2..... maximum, incr (2), the default if there are more than 25 categories, would label the minimum, minimum + 2, ..., etc. graph_options xlabel(), saving().

refers to any of the options allowed with graph's histogram style excluding bin (), and xscale(). These do include, for instance, freq, ylabel(), by(), total, and See [G] histogram.

Remarks b, Example You have a categorical variable rep78 reflecting the repair records of automobiles. 1 = Poor, 2 = Fair, 3 = Average, 4 = Good, and 5 - Excellent. You could type

(Continued

on next page)

5O

It is coded

h_t -- Categoricalvariablehistogram

51

• graph rep78, histogram bin(5)

to obtain a histogram. Youshould specie, bin(5) because your categorical variable takes on 5 values and you want one bar per value. (You could omit the option in this case, but only because the default value of bin() is 5; if you had 4 or 6 bars, you would have to specify it; see [G]histogram.) In any case, the resulting graph, while technically correct, ii aesthetically displdasing because the numeric code 1 is on the left edge of the first bar while the numeric code 5 is on the fight edge of the last bar. Using hist; is better: • hist rep78

434783

0

Repair

Record

1878

not only centers the numeric codes underneath the corresponding bar, it also automatically labels all the bars. hist

You are cautioned: hist is not a general replacement for graph, histogram, hist is intended for use with categorical data only, which is to say, floncontinuousdata. If you wanted a histogram of automobile

prices,

for instance,

you

would

still want

_o use the graph,

histogram

command.

;:

,_r,.

52

hist -- Categorical variable histogram

I> Example You may use any of Research and Survevs on election day data in Lipset (1993)--you • hist

candi

the options you would with graph, histogram. Using data collected by Voter based on questionnaires completed by 15,490 voters from 300 polling places originally printed in the New York Times• November 5. 1992 and reprinted draw the following graph:

[freq=pop],

by(inc) total ylab ytine noaxis title (Exit Polling By Family Income)

$50-75k

$75k+

6

= o o It.

.6

o,,_' .... L, ' Candidale

Exit Polling

voted

for,

1992

by Family

Income

[] Technical Note In both of these examples, each bar is labeled; if vour categorical you may not want to label them all. Typing

variable takes on many values,

hist myvar, incr(2)

would labeleveryotherbar.Specifying incr(3) would labeleverythirdbar,and so on,

Methods and Formulas hist is implemented

as an ado-file.

References Lipset, S. M. ]993. The significance

of the I992

Also See Related:

[R] spikeplot, [G] histogram

Background:

Stata Graphics Manual

election.

Political

Science

and Politic,_ 26_1_: 7-16,

Title hotel -- Ho_lling's

generalized means test

Syntax hotel varlist [weigh_ [iJ exp] [in range] [, by(varname)notable aweights

and fweights

]

are allow _d: see [U] 14,1.6 weight

DescriptiOn hotelperforms Hotelling's T-squared test for testing whether a set of means is zero or, alternatively, equal between two groups

i

Options by(varname) specifies a var able identifying two groups; the test of equality of means between groups is performed. If by '.) is not specified, a test of means being jointly zero is performed. i

t

notablesuppresses printing

table of the means being compared.

Remarks hotel performs Hotelling's T-squared test of whether a set of means is zero, or two sets of means are equal. It is a multivariate est that reduces to a standard t test if only one variable is specified.

I, t i I

_ Example You wi_h to _est whether a new fuel additive improves gas mileage in both stop-and-go and highway situatiotls. Taking tw_lye cars, you fill them with gas and run them on a highway-style track, recordingtheir gas mileage. Y_)uthen refill them and run them on a stop-and-go style track. Finally, you repeat the two runs but this time use fuel with the additive. Your dataset:is . describe Contains d_ta from gasexp.dta obS : 12 vats : size:

i

variable n_me

i ! i

id bmpgl ampgl bmpg2 ampg2

_

5 288 (c _.9% of memory free) storage type float float float float float;

lisplay _ormat

value label

/,9. Og /,9.Og /,9.0g /,9.0g /,9.0g

13 Jul

2000

13:22

variable label car id trackl before additiv_ trackl after additive_ track 2 before additive track 2 after additive

Sortdd by :

53

r'_

54

hotel -- Hoteiling's T-squared generalized means test [

To perfor_ zero:

the statistical test, you jointly test whether the differences

• g_n |

diffl

= ampgl

- bmpgl

g_n dill2 = ampg2 | hgtel diffl dill2

bmpg2

Variable

0bs

diffl dill2

12 12

I

Mean

Std.

1.75 2.083333

Dev.

2.70101 2.906367

1-g_oup Hotelling's T-squared = 9.6980676 F t, st statistic: ((12-2)/(12-1)(2)) x 9.6980676 H0:

Vector

of means is equal ¢o a vector F(2,10) = 4.4082

Prob

The meat

> F(2,10)

=

in before-and-after

Min

Max

-3 -3.5

5.5

results are

5

= 4.4082126

of zeros

0.0424

are different at the 4.24% significance level.

[] Technical _ote We used Hotelling's T-squared test because we were testing two differences jointly. Had there been onlvlone difference, we could have used a standard t test which would have yielded the same results as_totelling's''

test:

* W_ could have performed ampgl = bmpgl

the

test

like

this:

t test

Vari

Obs

Mean 22.75

12

• tt_st

20.68449

24.81551

2.730301

19.26525

22.73475

12

1.75

•7797144

2.70101

.0338602

3.46614

mean(ampgl

- bmpgl)

Ha:

= mean(diff)

mean(dill) ~= 0 t = 2,2444

P >

Itl =

mean(dill) > 0 _ = 2.2444

P > t =

0,0232

this: = 0

t test Std.

dlffl

12

1.75

.7797144

of freedom:

Ha: mean < 0 t = 2.2444 0.9768 this:

Err.

Std.

Dev.

2.70101

[95_

Conf,

.0338602

Interval] 3.46614

11 He:

Or like

= 0 Ha:

0.0463

Mean

t =

Interval]

3.250874

Obs

PI<

Conf,

.7881701

Variable

Degrees

[95_

.9384465

0,9768

dill1

One-_ample [

Dev.

21

Ha_ mean(dill) < 0 t = 2.2444

* Or like

Std.

Err.

12

Ho:

P < t =

Std.

meam(diffl) Ha:

P >

mean t =

Itl =

= 0

-= 0 2.2444 0.0463

Ha: mean > 0 t = 2.2444 P > t =

0.0232

""---

hotel-- Hotellin_sT-squaredgeneralizedmeanstest

55

. _otel dill1 i

Variable

i

[

0

Mean

diffl

Std.

1.75

Dev.

Min

Max

2.70101

-3

5

1-Sroup _otelling's T squared = 5.0373832 F _est statistic: ((i!2-I)/(12-I)(I))x 5.0373832 = 5.0373832

I

HO{ Vecter of means i 3 equal to a vector of zeros F(I,II) = 5.0374 Prob > F(I,II) = 0.0463

> Example Now"donsider a variation _n the experiment: rather than using 12 cars and running each car with and without the fuel additiv, you run 24 cars, 12 with the additive and 12 without. You have the following!dataset: . d_scribe

! i

I i

I

Contains data from ga: _xp2.dta o_s: 24 vats: 4 size: 480 97.4_ of memory free)

8 Sep 2000 12:19

[

! storage variable name type

display format

id mpgl mpg2 additive

_9.Og Z9.0g _9.0g _9.0g

float float float float

value label

Variable label

yesno

car id _rack 1 track 2 additive?

Sorted by: • tab additive additive?

Fr_q.

Percent

_um. i

no

12

50.:00

50.00

yes

12

50.00

100.00

Total

24

I00,00

jr

I

i

This is an_unpaired expefime_t because there is no natural pairing of the cars; we want to test that the rneanslof mpgl are equal for the two groups specified by additive as are the means of mpg2:

{

(Continued on next page)

_

=_

60

d9 -- ICD-9-CM diagnostic and procedure codes

] !

t ICD-9 codes and, if it does, icd9_] clean modifies the _ariable to contain the codes in either of two standard formats. Use of icd9[p]_clean is optional: all icd9[p] commands work equally well with cleaned or uncleaned codes. 'I_e_e are numerous ways of writing the same ICD-9 code, and icd9[p] clean is designed (1) to ensure c insistency, and (2) to make subsequent output look better. icd9[p] uncleaned) icd9[p] ge code. icd9 ICD-9 code

generate produces new variables based on existing variables containing (cleaned or ICD-9 codes, icd9[p] generate, main produces newvar containing the main code. aerate, description produces newvar containing a textual description of the ICD-9 p] generate, range() produces numeric newvar containing I if varname records an in the range listed and 0 otherwise.

icd9_p] lookup and icd9[p] search are utility routines that are useful interactively, icd9[p] lookup sin ply displays descriptions of the codes specified on the command line, so to find out what diagnostic _:913.1 means, you can type icd9 lookup e913.1. The data you have in memory are . I lrrelevant-_and remain unchanged when using icd9[p] lookup, icd9[p] search is like icd9[p] lookup ex!:ept that it turns the problem around; icd9[p] search looks for relevant ICD-9 codes from the dd_cription given on the command line. For instance, you could type icd9 search liver or icd9p

s._arch

liver

to obtain a list of codes containing the word "liver".

icd9[p] query displays the identity of the source from which the leD-9 codes were obtained and the textual _escription that icdg[p] uses. Note that! ICD-9 codes are commonly written in two bays,, with and without periods. For instance, with diagnostic codes, you can write 001, 86221. E8008. and V822, or you can write 001., 862.21, E800.8, and V82.2. With procedure codes, you can write 01, 50. 502. and 502l, or 0l., 50., 50.2. and 50.21. _he icd9[p] command does not care which syntax you use or even whether you are consistent. _ase also is irrelevant: v822, v82.2, v822. and v82.2 are all equivalent. Codes may be recorded w_h or without leading and trailing blanks.

Optionsfor use with icd9[p]check any tells ic the code, 230.52 option, conside list

59[p] check to verify that the codes fit the format of leD-9 codes but not to check whether are actually valid. This makes icd9_p] check run faster. For instance, diagnostic code _r 23052 if you prefer) looks valid, but there is no such ICD-9 code. Without the any 30.52 (or 23052) would be flagged as an error. With any. 230.52 (and 23052) is not d an error.

tells cd9[p] chock that invalid codes were found in the data. 1. 1.1.1. and perhaps 230.52 if any is n )t specified, are to be individually listed.

genaratet

ewvar) specifies that icd9[p]

check

is to create new variable newvar containing,

for

each observation, 0 if the code is valid and a number from 1 to I0 otherwise. The positive numberslindicate the kind of problem and correspond to the listing produced by icd9[p] check. For instance. 10 means the code could be valid, but it turns out not to be on the official list.

Options for use with icd9[p] clean dots specifies whether periods are to be included in the final format. Do you want the diagnostic codes recorded, for instance, as 86221 or 862.21? Without the dots option, the 86221 format would b_ used. With the dots option, the 862.21 format would be used. pad specifids that the codes are to be padded with spaces, front and back. to make the codes line up vertically.: in listings. Specifying pad makes the resulting codes look better when used with most other Stata commands.

i

icd9 -- ICD-9-CM diagnoSticand procedure codes

Options fOr i

61

with icd )[p]generate

main,descrip}ion,and ra _ge(icd9rangelist) specify what icd9[p] generate is to calculate. In all cases, varname specifies variable containing ICD,9 codes. main specifies ihat the malt .'ode is to be extracted from the IED-9 code. For procedure codes, the

i

i i

main code i_ the first tw_ characters. For diagnostic codes, the main Code is usually the first three or four characters (tlie characters before the dot if the code has dots) In any case, icdg[p] generate &Ses not care _hether the code is padded With blanks in front or how strangely it might be written; :icd9[p] gene rate will find the main code and extract it. The resulting variable is itself an ICD-Ocode and may be used with the other icd9[p] subcommands. This includes icd9[p] generate, ilain.

i ! i i_

descriptiondreates newva, containing descriptions of the ICD-9 codes.

I i{ i I

long is for _e with desc: iption.It specifies thai the new variable, in addition to containing the text describing the code, is to contain the code, too. Without long, newvar in an observation might contain "bro1_chus injury-closed". With long, it would contain " 862.21 t_ronchus injury-closed". end modifie_ long (speci _'ying end implies long) and places the code at the end of the string: "bronchus injury-closed 8I 2.21".

! i

i

range(icd9ran_elist) allows you to create indicator variables equal to l when the ICD-9 code is in the inclusive! range specifi _d.

Optionsfor usewith icd )[p]search

!

i

or specifies thai ICD-9 codes are to be searched for entries that contain any of the words specified after icd9[p I search,Th, default is to list only entries that contain all the words specified.

i

Remarks

I

code is

Let us begin _withdiagnost

codes--the

codes icd9 processes. The format of an ICD-9 diagnostic

[blanks { O-9,V,v} {0-9} {0-9} [. ] _0-9 r [0-9] jl [blanks] or

I

i i ; .

i

[blanks! E.e } { 0-9 } { 0-9 } {0-9 }[. ][0-9 [0-91 ] [blanl_s1 icd9 can dell with ICD-9 tiagnOstic codes written in any of the ways the above allows. Items in square brackets tare optional, the code might start with some number of blanks. Braces { } indicate required items. _he code the_ either has a digit from 0 to 9 the letter V (utmercase or lowercase) (first hne), or thei letter E (upl_ercase or lowercase ' '_.. second line).' After that, it has--two or mo re d_'g"_ts s, perhaps followed b a enod and th v u to tw e dvm ha s tollowed b ' more b' " i ! Y P " _t en it may ha "e p omor "_'siper p. " y

,anks .

l

_;;::

62

icd9 --

ICD-9-CM diagnostic and procedure codes

All of the following

meet the above

definition:

00: ool

')ol 001,9 O019 862 ._2 862,22 E80_). 2 e8092 V82|

Meeting t_ae above the above[definition, Examl_les|

definition of which

of currently

does not make the code valid. 15,186

defined

are currently

diagnostic

codes

There

are 233.100

possible

codes meeting

defined include

l

code

I

t i

I

description

001 001.0

cholera* cholera d/t vib cholerae

001.9 001.1

cholera cholera nos d/t vib el tot

999

complic medical care nec*

VOl

communicable dis contact*

V01. i VOI. 701.20 VOI. 3 VOl. 4 VOI. 5 VO1.6 VOl. 7 YO1.8 V01.9

tuberculosis contact cholera contactcontact poliomyelitis smallpox contact rubella contact rabies contact venereal dis contact viral dis contact nec communic dis contact nec communic dis contact nos

. . .

E800 E800.0 E800.1 E800.2 E800.3 E800.8 E800.9

rr rr rr rr rr rr rr

collision nos* collision nos-employ coll nos-passenger coll nos-pedestrian coll nos-ped cyclist coil nos-person nec coil nos-person nos

o , o

"Main ,eodes" refer to the part of the code to the left of the period. v82, and !_800 ..... E999 are main codes. There are 1.182 diagnostic

001,002 ..... main codes.

999. v0] .....

The m'Ain code corresponding to a detailed code can be obtained by taking the part of the code to the left lof the period, except for codes beginning with 176. 764. 765. v29. and v69. Those main codes are not defined, yet there are more detailed codes under them:

icd9 -- ICD-9-CM diagnostic and procedure codes I

cdde

d,:scription

[

1}'6 if(6.0 176,1

CDDE DOES NOT EXIST, but $ codes starting with 176 do exist: sl in - ka_si's sarcoma sl tisue - kpsi's srcma

764 754.0 754. O0

C )DE DOES NOT EXIST, but 44 codes starting with 7i54 do exist: It for-dates w/o let real* li :ht-for-dates winos

63

i

.5.

755 7_5.0 7_5. O0

C )DE DOES NOT EXIST, but 22 codes starting with %5 do exist: extreme immaturity* e_treme immatur wtnos

I

V_9 V:_9.0

O )DES DOES NOT EXIST, but 6 codes stating with V29 do exist: nt obsrv suspct infect

i

V_9.1

nt obsrv

I

V69 V_9.0 V619.1

O )DE DOES NOT EXIST, but 6 codes starting with V69 do exist: la k of physical exercse inirpprt diet eat habits

suspct

neurlgcl

"'_"

!

Our solution iis to define f)ur new codes:

t ! i

!

code

description

176 764 765 729 g69

kaposi's sarcoma (Stata)* light-for-dates (Stata)* immat & preterm (Stata)* nb suspct cnd (St,am)* lifestyle (stata)*

Thus, there are 15,186 + 5 = 15,191 diagnostic

i

Things are less confusing format of I CD-9iprocedure

!

I I

I

'

codes, of which

_'ith respect to procedure

codes_the

1,181 + 5 = 1,186 are main codes. codes processed

by icd9p.

The

co :les is [banks]

{0-9}{0-9}

[. ] [0-9 [0-9]]

[blanks]

Thus, there are i0,000 possil: e procedure codes, of which 4,275 are currently valid. The first two digits represent _e main co& of which 100 are feasible and 98 are currently used (00 and 17 are not used).

Descriptions The degcriptidns given for each of the codes is as found in the original source. The procedure codes • contain' the addition of fve new codes b_, us. An asterisk on the end of a description n_ d_cate_ "" _

i_ I

that the c°trespoiding ICD-9 tiagnostic code has subcategories. icd9[pJi quer_ reports thebriginal source of the information

1

on the codes:

r F J

64

icd9 -- ICD-9-CM diagnostic and procedure codes

• _cd9

query

_dta: I

i 1

Dataset from http://www.hcfa.gov/stats/pufiles.htm obtained 24aug1999 file http://www.hcfa,gov/stats/icdgv16.exe Codes 176, 764, 765, V29, and V69 defined

I

-- 765 176 kaposi's immat _ preterm sarcoma (Stata)* (Stata)* V29 nb suspct end (Stata)* V69 lifestyle (Stata)* cd9

query

J _d_a: Dataset obtained 24aug1999 • from http://www.hcfa.gov/stats/pufiles.htm file http://www.hcfa.gov/stats/icd9vl6.exe

> Example You t_ve a dataset containing up to three diagnostic codes and up to two procedures on a sample of 1,000 Ipatients: _se patients, _ist in 1/10 7.

patid I

I:

I_.

clear diagl 65450

diag2

3 2

710.02 23v.6

5 6 7 8 9

861.01 38601 705 v53.32 20200

procl 9383

proc2

37456

8383

17

2969

9337 7309 7878 0479

8385 951

i0

464.11

7548

diag3

E8247

20197

!

4641

Do not tD, to make sense of these data because, in constructing procedure codes were randomly selected.

this example,

the diagnostic

and

- _,-_Be_inlbvnoting that variable diagl is recorded sloppily--sometimes the dot notation is used and sometimes not, and sometimes there are leading blanks. That does not matter. We decide to begin by using icd9 cd9

clean clean

to clean up this variable:

diagl

di_gl contains invalid ICD-9 codes r (459) ;

icd9 clean refused because there are invalid codes among the 1.000 observations, check to find and flag the problem observations (or observation, as in this easel:

We can use icd9 :

i-|

!

[

) ) i

i

_....

icd9-_-ICD-9-CMdiagnostic and proce
. icd9 check diagl, gen(prob) diagllcontains i invalid ¢odes: I. Invalid placemer_ of period 2_ Too)many periods

t

0 0

I

3, 4_ 5i

Cod_ too short Cod# too long Invalid let chaz (not 0-9, E, or V)

0 0 0

i

6_

Invalid 2nd chax (not 0-9)

0

I

81 7_ 9.

Invalid 4th chat (not 0-9) Invalid Invalid 3rd 5th chat chat (not (not 0-9) 0-9)

0 0i

I

10,

Cod_ not defined

0

,ot.

i

i

. list pati_ diagl prob Lf prob

I

[

2.

patid 2

diagl 23v. 6

65

prob 7

Let's assume that _ve go back t_ the patient records and determine that this should have been coded 230.6: • replace d_agl = "230.6 (i re_l change made) . drop prob

if patid==2

We now tD,_againlto clean up t e formatting of the variable: • icd9 cleam diagl (643 dhange_ made) • lis_ in 1/10

_;

1, 2. 3. 4.

patid 1 2 3 4

diagl 65450 2306 71002 1026

diag2

I

5.

5

86101

6 7 8 9

38601 705 V5332 20200

2969

I )

6. 7. 8. 9.

[

10.

10

46411

20197

diag3

37456

procl 9383 8383

proc2 17

629

7548

E8247

9337 7309 7878 0479

8385 951

4641

)

Perhaps we prefer!the dot notati )n. icd9 clean can be used again on diagl, and now we will clean

l

Up diag2

and diag3:

• ted9 clea_ diagl, dots (936 _he/Ige_made) • icd9 clean diag2, dot_ (551 Changes made) • icd9 clea_ diag3, dote (i00 Changes made)

i

)

(Continued on next page)

_"

! i_d9 -- ICD-9-CM diagnostic and procedure codes

66

• lit

in

1

|

!

1/10

1

diagl 654.50

diag2

i.

patid

diag3

procl 9383

proc2

2.

2

230.6

374.56

3.

3

V10.02

8383

17

4.

4

102.6

5.

5

861.01

6.

6

386.01

7.

7

705

7309

8385

8.

8

V53.32

7878

951

9.

9

202.00

754.8

10.

10

464.11

201.97

629

We now turn to cleaning codes:

296.9

9337

E824.7

0479 4641

the procedure

codes.

We use icd9p

diag3

procl 93.83 83.83

(emphasis

on the p) to clean

these

l

. iccl9P clean procl, (816|changes made) |

dots

. ic_9p clean proc2, (140|changes made) !

dots

• li_t

in

1/10

I. i 2.

patid 1 2

diagl 654.50 230.6

diag2

3.

3

V10.02

4.

4

102.6

5.

5

861.01

6.

6

386.01

7.

7

705

73,09

83.85

8.

8

V53.32

78.78

95.1

9.

9

202.00

754.8

10.

10

464.11

201.97

374.56

296.9

icdl p check

E824.7

04.79 46.41

rules

clean

and icdgp

for the code:

clean

il does

not check

valid

ICD-9

procedure

codes;

168 missing

the code

is itself

proc2

contains

invalid

codes:

Invalid

2 3

Too many Code too

4 5

Code too long Invalid 1st char

(not

0-9)

0 0

6

Invalid

2nd

char

(not

0-9)

0

7

Invalid

3rd

char

(not

0-9)

0

8

Invalid

4th

char

(not

0-9)

0

Code

that

that the variable

values)

1

10

only verify

procl

contains

icd p check proc2

93,37

that both icd9

cleaned follows the construction icd_[p] check does that:

(proc

17

62.9

It is imDDKant to understand being valid.

proc2

not

placement

of period

0

periods short

0 0

1

defined

1

Total

Note that d_ag2 has an invalid generate( code. We;ould did above _ith icd9 check, .

find it using

icdg_] han create codes. For _nstance.

textual

new variables

containing

icd9p

descriptions

check,

generate(), just as we

of our diagnostic

and procedure

_

icd9 -- ICD-9-cM

i

diagnostic

and

proc_lure codes

67

• icd9 gen!tdl = diagl, desc . sort pared • list pared diagl tdl m 1/10 1. 2. 3. 4. 5 6

pa_id 1 2 3 4 5 6

diagl 654.50 230.6 VlO.02 102.6 861.01 386.01

tdl cerv incompet preg-unsp ca in situlanus nos hx-oral/ph_aTnxmalg nec yaws of bo_e _ joint heart contUsion-closed meniere dis co¢hlvestib

7 8 9

705 V53.32 202.00

disorders of sweat gland* ftng autmt¢ dfibrilla_or ndlr lymunsp xtrndl ors

7 8 9

I !

I

10 iCd9_] I0 464.11 ac tracheitis w obstruct Notethat _enerate, escription does notpreserve thesort order ofthedata(andneither does icdg[p] cheek unless you specify the any option),

Recall that pro6edure-codep:ioc2 had an invalid codel Even so, icdgp generate, is willing to_create a textual de: cription variable:

descript

. iedgp gen!tp2 : proc2, desc (I nonmissidg value inva!_idand so could not be labeled) . sortlpatid listipatid proc2 tp2 i:I1/10 i

pared

proc2

i

I.

' _:

2. 3,

i2 17 _3 5

i

5. 6. 7. 8.

_i

83.85 95.1

Y B

9.

itp2

musc/tend Img change nec form _ structur eye exam*

D

lo.

io

tp2 contains nothing when F "oc2 is 17 because 17 is not a valid procedure code. icdg[p] g*ner_te

can also reate variables containing main codes:

. icdg!gen main1 : diagl

I

main

listlpatid!diagl pati_ dinE1mainl [n I/I0 mainl 1. 1 654.50 654

I

3.

3 vtoo2 2 4

230.6 102.6

rio

5, 6,

_

861.01 386.01

861 386

7. 8.

7 8

705 V53.32

705 V53

10. 9.

109

464.11 202.00

464 202

2. 4. i

i

icdgp generate, :

230 102

_ain can sit ilarlygenerate main procedure codes. i

Sometimeslone i_ merely exa_fining an observation: • list

dins*

_f patid==56_

ion

_-

68

icd9-

ICD-9-CM diagnostic and procedure codes

I ...........

diagl 56 I.

diag2

diag3

526.4

If we woladered what 526.4 was, we could type ! . i_d9 |

lookup

1 m_tch

found:

526.4

icd9[p]

526.4

inflammation of jaw

]_ookup has the ability to list ranges of codes: I • i_d9

lookup

526/527

12 _atches found: 526 jaw diseases* 526.0 devel odontogenic cysts 526.1 fissural cysts of jaw 526.2 cysts of jaws nec 526.3 cent giant cell granulom 526.4 infl_mmation of jaw 526,5 alveolitis of jaw 526.8 other jaw diseases* 526.81 exostosis of jaw 526.89 [526.9 527

icd9[p]

st_arch /

jaw disease nec jaw disease nos salivary gland diseases*

has the ability to go from description to code:

• i_d9 search jaw

disease

|

4 m_tches found: |526 jaw diseases* 1526.8 other jaw diseases* 526.89 jaw disease nec

q 526.9

jaw disease nos

I

Saved Results icd9

ahd icd9p save in r()" Scalars r(e#) r(esum)

number of errors of type # total number

of errors

References Gould, W. 2p00. din76:ICD-9 diagnostic and procedure Technica I Bulletin Reprints. vol. 9. pp. 77-87. t l

codes.

Slate

Technical

Bulletir_ 54: 8-16.

Reprinted

in State

impute--- Predictmissingvalues

i

F

Syntax imputedepvdr

varlist [w_'ght]

Iif

exp] [in range],

g_enerate(newva,'l)

[ _ rp(nevar2)] aweights

and fweiShts

are allow_

see [V] 14.L6 weight.

Description impute mils in

missing vail _s; depvar is the variable whose missing values are to he imputed. vartist is tile list of variables m which the imputations are to be based and newvarl is the new '

i

i

variable to contain the imputations.

!

conducted dfficientty: this nece sitates a liner of 31 variables in varlist.

impute organi_zesthe casesby pa_ternsof missingdata so the missing-valheregressionscan be

Options generate(t/ewva'r

i [

I i

1) specifies he name of the new' variable to be created and is not optional.

varp(newvar2) specifies the n_me of a new variable to contain the variance (not the standard error'_ of the pr_edicti_n.

Remarks

i

In observation s where depva F is not equal to missing; newvarl is set equal to depvar and newvar2 (if specified) is set to zero. Whlere depvar is missing, neuvarl is imputed using the prediction from

i ! !

the best available subset of othelrwise present data. r_ewVar2(if specified) is setto the variance of the prediction. This variance is in tl_esense of predicts stdp option, although squared: see [R] predict. It is an estimate df how far thel prediction of the mean would differ from the actual data point were

l

_t known. This is not the only method or coping with missing data, but it is often much better than deleting cases with any migsing data, wl_ich is the default. For a discussion of different methods of imputation. see, for example, Little and Rubin (1987).

! i

Example imputemay be used in conjunction with, for instance, regression (or an5' estimation technique) to avoid the lo_s of an unacceptal:le number of cases due to missing data. Bear in mind, however, lhat the subsequent estimates may b : biased because any variable imputed by impute is only an estimate of the unknown, true value. In he case of linear regression, a reasonable bound (in fractional terms) for the bias!is gix,!enby the ratio of the mean of newvar2 to the variance of newvarl. Usualh'. the bias is toward zerO, meaning tl-at the effect of the variable will be underestimated. 69

.....

,v

,,,,_ut_

u

rr_olc(

missing

values

You have been hired by the restaurant industry to study expenditures on eating and drinking. You, have &ta on 898 U,S. cities: describe C_ntains data from emd.dta obs: _ars:

898 9 34,124 (96.6_

ize: v

1980 City Data 13 Jul 2000 14:00

iable name

type storage long float int float float

f_s it.eat iz:ome_pc l_.rsales_pc jaltemp

of memory free)

format

label

display _10.Og _9.0g _8.0g _9.0g _9.0g

value

variable label

precipitation

float

_9.0g

state/place code In(Dining sales per capita) Per capita money income in(retail sales per capita) Median January temperature (Fahrenheit) Annual precipitation (in.)

in,income me_ian_age hh_ize

float float float

_9.0g _9.0g _9.0g

in(median per capita income) Median age Persons per household

...... 1

So,ted by: !

You

beglnby running theregression: | • _egress In_eat In_rsales jantemp precip in_inc median_age Source

I

SS

df

MS

I Model I Residual ]

87.7285014 45.1948678

6 657

14.6214169 .068789753

E 132.923369

663

.200487736

Total I In_eat

Coef.

Std. Err.

t

P>It[

hhsize

Number of obs = F( 6, 657) =

664 212.55

Prob > F R-squared Adj R-squared Root MSE

0.0000 0.6600 0.6569 .26228

=

[95Z Conf. Interval]

In_:sales_pc jantemp pre:ipitat~n i.n_income

.6611241 .0019624 -.0014811 .I158486

.026623 .0007601 .0008433 .056352

24.83 2.58 -1.70 2.06

0,000 0.010 0.090 0-040

.6088476 .0004698 -.0030869 .0051969

.7134006 .003455 .000224Z .2265003

m,_dian_age hhsize .cons

-.0010863 -.0050407 -I.377592

.0002823 .0004243 .4777641

-3.85 -11.88 -2.88

0.000 0.000 0.004

-.0016407 -.0058739 -2.31572

-.0005319 -.0042076 -.459463

Despit, having data on 898 cities, your regression was estimated on only 664 cities_74% of the original 8 )8. Some 234 observations were unused due to missing data. In this case, when you type snrnmari: e, you discover that each of the independent variables has missing values, so the problem is not that one variable is missing in 26% of the observations, but that each of the variables is missing in some _servations. In fact, summarize revealed that each of the variables is missing in roughly 5% of th_ observations. We lost 26% of our data because, in aggregate. 26% of the observations have one )r more missing variables. Thus, |"+'eimpute each independent variable on the basis of the other independent variables: . i_pute In_rtl jantemp precip In_inc medage hhsize, gen(i_In_rtl) 4._0Y, (44) observations imputed impute } jantemp in rtl precip In_inc medage hhsize, gen(i_]antmp) 5.B0_, (53) observations imputed

f

impute -- Predict missing values

71

. impute _recip In rtL jantemp In_inc medage hhsize, gen(i_precip) 4.i56_(41) observati)ns imputed . impute In_inc In rtL jantemp precip medage!hl_size,gen(i_in_inc) 4.!34_ (39) observati)ns imputed . impute Medage In rt jantemp precip In inc hhsize, gen(i_medage) 4._5% (40) observati .ns imputed . impute lihsize In rt jantemp precip in_inc medage, gen(i_hhsize) 5._3_ (47) observati ,nsimputed

Thatdone,we can now re-estmatethe recession on the imputedvariables: • regress !in,eat i_Injsales Soul4ce Mod_l Residual ! _ Total

: i

In_eat

!

i _

S_

df

108.8_923 63.792_

i_jantmp i_precip i_in_inc i_median_ase i hhsize

.45

172.65:i145 Conf.

MS

6

18.1432051

891

,071596986

897

.192477308

Std. Err.

t

Number of obs = F( 6, 89_) = Prob > F =

898 253.41 0.0000

R-squared = Adj R-squexed = Root MSE =

0.6305 0.6280 .26758

P>ItI

[95% Conf. Interval]

i_im_rsales i_jantmp i_precip i_in_inc

.660_ )6 .0021G .9 -.0013_88 .095883

.0245827 .0006932 .0007646 .0510231

26,89 3.03 -1.74 1.88

0.000 0.002 0.083 0,061

.6126593 .0007414 -.0028275 -.0042764

.7091528 .0034625 .0001739 .1960024

i_median_age i_khsize _cons

-,0011_34 -.0052508 -I.143142

.0002584 .0003953 .4304284

-4.35 -13.28 -2.66

0.000 0.000 0,008

-.0016304 .0060267 -1.987914

-.0006163 -,004475 -.2983702

Notethat the regressionis no_ estimatedon all 898observations. <1

> Example impute canalsobe used 4th factor to extend[actorscoreestimatesto caseswith missing data.Forinstance, we havea /afiantof the automobile dataset (see[U] 9 Stata'son-linetutorials and sampledataSets)that conltinsa few additionalvariables.Wewill begin by factoringall but the price

vadable;

see [R] factor

• factor m_g-foreign, f ctors(4) (obs=66) (principal _actors; 4 factors retained) Eigenvalue Difference Proportion

Factor

Cumulative

3

i

1 2 3 4 5 6 7 8 9 I0

6. 99066 1.39528 O. 58952 O. 29870 0.24252 O. 12598 0.03628 -0.01457 -0.02732 -0.05591

5. 59588 0.80576 O. 29082 O. 05618 0.11654 O. 08970 0.05085 0.01275 O. 02860 0.05736

O. 7596 0.1516 O. 0641 O. 0325 O. 0264 O. 0137 0.0039 -0.0016 -0.0030 -0.006i

O.7596 0.9112 O. 9753 1. 0077 1. 0341 1.0478 1.0517 1.0501 1.0472 1.0411

tt I2 13

-0.11327 -0.11891 -0. 14605

O.00564 0,02714

-0.0i23 -0.0129 -0.0i59

1. 0288 1.0159 1.0000

r

72

impute

--

Predict missing values Factor Loadings

i Variable

1

mpg rep78 I rep77 headroom Irear_seat trunk

2

3

4

Uniqueness

-0.78200 -0.51076 -0.27332 0.56480 0.66135 0.72935

-0.02985 0.68322 0.70653 0.26549 0.20473 0.37095

-0.06546 -0.1i181 -0.32005 0.29651 0.36471 0.28176

0.33951 -0.01428 0.04710 0.16485 0.02062 0.12140

0,26803 0.25963 0.32145 0.49542 0.38727 0.23633

weight length turn ! displacement

0.95127 0.94621 0.88264 0.92199

0.10135 0.19595 -0.05607 0.06333

-0.18056 -0.05372 -0.08502 -0.17349

-0.09179 -0.10325 0.01169 -0.02554

0.04378 0.05274 0.21043 0.11518

_ar_ratio | order |foreign

-0.82782 -0.25907 -0.75728

0.06672 0.15344 0.30756

0.24558 0.01622 0.19130

-0.10994 0.14668 -0.29188

0.23787 0.88756 0.21014

I

There appear interpreta_on

to

be two we might

factors interpret

here. the

Let's pretend that we have given first factor as size. We now obtain

the first the factor

two factors scores:

an

||

. s_ore fl f2 (based on unrotated factors) (2 scorings not used) Scoring Coefficients 1 2

Variable I

mpg rep78 rep77 headroom _ear_seat I trunk

-0.02094 -0.03224 -0.11150 0.05530 0,03355 0.04603

0.11107 0.44562 0.27942 0.10017 0.02812 0.20622

I

0.12250 0.39997 0.04562 0.19281 -0.08534 0.00638

-0.13040 0.60223 -0.12825 0.11611 0.03528 0.06433

weight length turn displacement g_ar_ratio order

-0.06469

foreign Although nfissing observati(

is not v ]ues is:

(we

0.28292

revealed

by

this

output,

in 8 cases

would

see

that

if we typed

the

scores

summarize).

could

To

not

impute

. i_ _ute fl mpg-foreign, gen(i_fl) 10.91_ (8) observations imputed i, _ute f2 mpg-foreign, gen(i_f2) I0._1Z (8) observations imputed And

we _ ight

now

run

a regression

of price

(Continued

in terms

on next

of the

page)

two

thctors:

the

be calculated factor

scores

because

of

to all the

impute -- Predict imissingvalues i

. regre_s

price

i_f3

Source

i_f2 SS

df

MS

Number

of obs =

F( 2, Model

15¢_._23103

Residual

47£ 342293

Total

63_ )65396

price i_fl

t

73

74

71)=

2

79611551.5

Prob

71

6702004.13

R-squared:

=

0.2507

8699525.97

kdj R-squ_red Root MSE

= =

0.2296 2588.8

73

Err.

t

P>lt

> F

3oef.

Std.

3.347

315.7177

3.88

0.000

595.8234

{

[95Y. Conf.

=

ti.88 0.0000

Interval] 1854.87

i_f2

I

911:2878

339.9821

2.68

0.009

233.3827

1589.193

icons

J

626 1.285

301.7093

20.76

0.000

5660.694

6863,877

Methodsand Formulas imputeis implemented

Lsan ado-file.

Consider the command

repute y xl X2 ... Xk, gen(_)

When y is not missing,

varp(_).

=yand_=0.

Let y9 be an observatiol br which y is missing. A regressor list is formed containirig all x's for whic_ xij is not missing. If _e resulting list is empty, missing. !OtherWise a regres!iion of y on the list is dsdmated (see [R] regress) the predicted Value of yj (si,'e IN] predict), t,"j is defined as the square of the prediction, as Calculated by _redict, stdp; see [Ri predict.

from xl, x2 ..... xk _.3 and _j are set to and _j is defined as standard error of the

References i

Goldstein,R. 1996.sedl0: Patters of missingdata, Stata TechnicalBulletin32: 12-13, Reprintedin Stata Technical Bulletin Reprints.vol. 6. p. I 5.

i[

_.

1996. sedl0.I: Patternsof!missing data. update. Stata TechnicalBulletin 33: 2, Reprintedin Stata Technical BulletinReprints,vol. 6, pp. 15-116.

l,ittle.R. i. A. and D. B. Rubin. 1987.StatisticalAnalysis u4OJMissingData. New York:John Wiley& Sons. }

.

Mander.Ai asd D. Clayton.I999 sg116:Hotdeckimputation.Stata TechnicalBulletin 51: 32-34. Reprintedin Srata TechniralBulletin Reprints, v,)t.9, pp. 196-199. -_.

2000. sgll6A: Update to hotdeck imputation.Stata TechnicalBulletin 54: 26. Reprinted in Smta Technical BulletirtReprints, vol. 9, p. )9,

AlsoSee

i

Complementary:

[R] pr, diet

Related:

[R] ipelate, JR]regress

..... Title

,t ;

i

Quick reference for reading data into Stata

Description This er_try provides a quick reference for determining which method to use for reading non-Stata data into _hemoD,. See [U] 24 Commands to input clam for more details.

Remarks Summary bfthe different methods insheet o inshe, t reads text (ASCII) files created by a spreadsheet o The da a must be tab-separated space-s ,_parated. o A sing] _ observation

or comma-separated,

or a database program.

but not both simultaneously,

nor can it be

must be on only one line.

o The fir t line in the file can optionally contain the names of the variables. infile (fre_ format)--infile without a dictionary o The da

can be space-separated,

o Strings with embedded separat, d). o A singl _observation line.

tab-separated,

or comma-separated.

spaces or commas must be enclosed

in quotes (even if tab- or comma-

can be on more than one line or there can even be multiple observations

infix (fixe(_ format) o The da!

must be in fixed-column

format.

o A singl

observation

o infix

as simpler syntax than infile

can be on more than one line. (fixed format).

infile (fixe 1 format)--infile with a dictionary o The daa may be in fixed-column o A singl _ observation o infil_

format.

can be on more than one line.

(fixed format) has the most capabilities

74

for reading data.

per

infile-- Quicl_referencefor readingdata intoStata

75

I

Examples

I

l

> Example

topof exampl.raw i

1 0 0 0

0 0 I O

1 I 0

John Smith Paul Lin Jan Doe f Julie McDonald

m m f

endof exampl.raw-contains tab-separated data. The type command with the showtabs

option shows the tabs:

type eXampl.rau, slowtabs 1O1John Smithm OO.IPaulLin_T>m OIOJan Doe<3>f oO.Julie Mc[onaldf Z

It could be read in by • insheet a b c name gender using exampl

Example topof examp2.raw--

i

a,b, c,name, gender 1,0,I ,John Smith,m 0,0,I ,Pa_l Lin,m O,l,O,Jan Doe,f 0,0, Julie McDonald,:

!

endof examp2.rawcould be read in by

i

" insheet

using

examl,2

q

Example topof examp3.raw 1 0 0 0

0 0 I 0

i 1 0

"John Smith" m "Paul Lin" m "Jan Doe" f "Julie McDonald"

f

t

endof examp3.raw

contains tab-separated data _'ith strings in double quotes. ;

. type

examp3.raw, s]lowtabs

lO"John Sm th"m OO"Paul Li:"m O<:T>IO"JanDoe f OO."Julie M _Donald"f

It could be read in by i

• infile byte (a b c

I strl5 name strl gender hsing examp3

_

76

infile -- Quick reference for reading data into Stata

or

I

!

• zlsheet

a b c name

gender

using

examp3

!

Or

i_file

using

dict3

where the dictionary

dict3.dct

contains top of dict3.dct

/ infJle

dictionary

using

byte

a

byte

b

byte str15

c name

strl

gender

examp3

{

} end of dict3.dct

..

q

> Example top of examp4.raw 1 0 1 "John

Smith"

0 0 1 "Paul

Lin"

0 1

"Jan

Doe"

0 0 .

"3ulie

m

m

! !

f

McDonald"

f

end of examp4.rawcould be _ad in by • ii file

byte

• infile

using

(a b c) strl5

name

strl

gender

using

examp4

or dict4

i

l where

the _dictionary dict4,

dct

contains

i

-- top of dict4.dct

| inf_le

dictionary

using

byte

a

byte

b

byte strl5

c name

strl

gender

examp4

{

} end of dict4.dct

<3

> Example mp of examp5.raw I01_ John

Smith

O01z Paul

Lin

010J Jan Doe O0

Julie

McDonald end of examp5.raw

• i_fix

could be

a I b 2 c 3 str gender

:ad in by

4 str

name

5-19

using

examp5

i

infile-- Quickirdferencefor readingdata into Stata

77

or • imfix

}

using

dict5a

where d_ct5a.dct

contains -- topof dict5aAct--

infix didtionary

usinl a

str sir

examp5 1

b

2

c

3

gende: name

4 5-19

{

i

i

endof dict5a.dct--

or . i_file Using

dictSb

where dict5b.dctcontains ! !

topof dict5b.dct--_ infile dictionary

using

examp5

{

%If

I

byte

a

!

byte

b

_.If

i

byte strl

c gent er

Zlf Xis

strl5

name

%15s

} endof dict5b.dct i

> Example top of examp6.raw line There

I : a heading are a

total

of 4 lines

of heading.

The next line contains a useful heading: ---4+ .... I.... + .... 2 .... + .... 3.... + .... 4 .... +1

0

1

m

Jo_hn Smith

i

0

0

I

m

Paul Lin

i

0 0

i 0

0

f f

Jan Doe Julie McDonald

i i• • !

--

endofexamp6.raw

could be read in by . i_file using

where diCt6a.dct

dict6a

contains mpof dM6a.dct

i _

infi_le didtionary -firstline (5)

_ i

ex_mp6

{

byte byte _col_m_(17)

I i

usin_

-co!kunn(33)

%lf

byte strl

ender

strl5

ame

Y.15s

} endof dict6aAc_

q

11

_

78

_nfile

_

Ouick

_n_

_r

reading

d_ta

ir'_to

Stli_l_

or could bte read in by | • i_ix

5 first a I b 9 c 17 str gender 25 sir name 33-46 using examp6

or could @ read in by • infix using dict6b l where dict6b.dct contains ,_

top of dict6b.dct

infi_ dictionary using examp6 5 fifst a 1

l

b str str

{

9

c

17

gender name

25 33-46

} end ofdict6b.dct

> Example I a b _ gender name I 0 1

top of examp7.raw

John m I Smith

ooI Paul|Lin 010

't

oe

Jan O0

'4 Juli

McDonald

I

end of exampT.raw

could be --r_adin by • in_ile using dictTa

where dic}Ta.dctcontains

---

top of dict7a.dct

infi_ dictionary using examp7 { _firs_line (2) byte a byte b i _linel(2)

byte

c

_line (3)

strl str15

gender name _.15s

} end of dict7a.dct

Or, if you _,'anted to include variable labels: • fl._1 e using dict7b • in

infile -- Quick reference for reading data into Stata

where dictTb,

dct

79

contair mp of dict7b.dct

ififile dictionaryu-cLngexamp7{ __irstline (2) byte "Question 1" byte "question2" byte "Question3" _iine(2) str! mder "Genderof subject" _line (3) strl5

ime

_.15s

} end of dict7b.dct infix

could also read this data: • infix

2 first

3 lines

a i b 3 c 5 str gender

2:1 str name

3:1-15

using

examp7

or itcould be read in by

. :-infixusing

dict7 i

where dictZc.dct contair_s | iz[fixdictionaryusing examp7{ 2 ifirst I a

str str

I

I

b I

3

g der name

2:1 3 :1-15

top of dict7c.dct

end of dictTcAct or it could be read in by

where" i i

s ; i_fix dictionaryusing examp7 { 2 first a I b 3 c 5 / str

g_

1

str

n_

1-15

top of dict7d.dct

/ !

}

i

end of dict7d.dct
AlsoSee Complemental:

Background:

[R]

tit, [R] infile (fixed format),

[R]

_fix (fixed format),

[u]

4 Commands

[R] input,

to input data

[R] infile (free format), [R] insheet

Title _,_

I

infile _fixed format)|

Read ASCII (text) data in fixed format Iwith a dictionary I

]

Syntax infile

using

dfilename

[if exp] [in range]

[, _automatic

If dfilename

is specified

without

an extension,

.dot

is assumed.

;

lf filename2

is specified

without

an extension,

.raw

is assumed.

1

The synta

u_sing(fitename_)

clear

]

for a dictionary, a file created with an editor or word processor outside of Stata, is

[inf

.le]

top of dictionary

file

end of dictionary

file

dictionary [using filename] { * comments may be included freely _irecl(#) _firstlineof file (#) _lines (#) _line (#)

_newline[(') J

I

}

[_.pe]

varname

[:lblname]

['/.infmt]

["variable

labet"]

(you_data_might appearhere)

where ',in(mr is { 7,[#[.#]]{flgle} If using

ill,name /

is not specified,

If using ill, name is specified, extensiofi., raw is assumed.

] X[#]s }

the data are assumed

the data are assumed

to begin

on the line following

to be located

in filename.

the close brace.

If filename

is specified

without

an

Descriptk n infilc using reads data from a disk dataset that is not in Stata format, infile using does this by firs reading dfilename, called a dictionary, that describes the format of the data file, and then reads the ile containing the data. The dictionary is a file you create in an editor or word processor outside of Stata. The da

may be in the same file as the dictionary or in another file.

Anothe_ variation on infile omits the intermediate dictionary; see [R] infile (free format). This variation i_ easier to use. but will not read fixed-format files. On the other hand. although infile using will read free-format files, the variation is even better at it. An alternative to infile using for reading fixed-format files is infix; see [R] infix (fixed format). _nfix provides fewer features than infile using but is easier to use. I Stata his other commands for reading data. If you are not certain that infile are lookinl_ for, see [R] infile and [U] 24 Commands to input data.

80

i

using

is what you

6

•

i

infile (fixed format) -- Read ASCII (text)data in fixed format with a dictionary

81

Options automaticcauses Stata tc create value labels from the nonnumeric data it reads. It also automatically widens the display forn mt to fit the longest label.

i

using(filenamei) specific:s the name of a file containing the data. If using() is not specified, the data are assumed to fc low the dictionary in dfilename or, if the dictionary, specifies the name of Some other file, tha file is assumed to contain the data. If using(filenamei) is specified, fitename2 is used to ob din the data even if the dictionary itself says otherwise.

}! ; i

clear specifies that it is )kay for the new data tO replace what is currently in memory. To ensure that you do not lose solnething important, infi_le using will refuse to read new data if data are already in memory, cl_iar is one way you can tell infile using that it is okay. The other is to drop the data yourself _,_,typing drop _all before reading new data.

Dictionarydirective,, * marl_s comment lines. "_herever you wish to place a comment, begin the line with a *. Comments can !_appearmany times in the same dictionary. _lreci (#) is used only f_r reading da_asets that do not have end-of-litle delimiters (carriage return, line_'eed_or some coml:ination). Such files are often produced by mainframe computers and have bee n poorly translated from EBCDIC into ASCII. _lrecl() specifies the logical record !ength _lrecl()

[ i

requests thai infile

act as if a line ends every # characters.

_l_ecl() appears onl I once. and typically not at all, in a dictionara,. _firsttineoffile(#) _bbreviation ._.first())is also rarely specified. It states the line of the file where the data be_in. _:first() is not specified when the data follow the dictionary: Stata can!fi_ure that out for itself. _first () is instead specified when reading data from another file in _+hich the first line loes not contain data because of headers or other markers. F

_f_rst

() appears onl

once. and typically not at all, in a dictionary.

_line_. (#) states the nu: _ber of lines per observation in the file. Simple datasets typically have _li_aes(i). Large dat;lsets often have many lines (sometimes called records) per observation. _lines() is optional :yen when there is more ihan one line per observation because in:file can isometimes figure il out for itself. Still. if Alines(i) is not right for your data. it is best to spe_ifv the directive. _lines()

i

appears onl' once in a dictionary.

-line(#) tells infile tc .jump to line # of the observation. Distinguish _lines () from _line (). and consider a file with _lines (4). meaning four lines per observation. _line (2) says to go to the Second line of the observation. _line(4) says to go to the fourth line of the observation. You may.jump forward or b_ckward, infile does not care nor is there any inefficiency, in =,,oin_,_. forward to 21ine(a), reading: few variables, jumping back to _line(l), reading another variable, and jumping forward again to _line (3). It is not your responsib lity to ensure that, at the et_d of your dictionary, you are on _he last line of the !observation. infile knows how to get to |he next observation because it knows where you are iand it knows _lin._s(), the total number of lines per observation. _l_ne()

may appear, tnd typically does, many times in a dictionary.

f

I I I

II t

.................

I

.... ,

82 ...newlix goes _new to get

........

,

infile (fixed format) -- Read ASCII (text) data in fixed format with a dictionary e [(#) ] is an alternative to _line (). _newl ine (1), which may be abbreviated _newline, orward one line. _newline(2) goes forward two lines. We do not reconmlend the use of ine() because _line() is better. If you are currently on line 2 of an observation and want to line 6, you could type _newline (4), but your meaning is clearer if you type _line (6).

_new: ine () may appear many times in a dictionary. _colnmr (#) jumps to column # on the current line. You may jump forward or backward within a line. _ column() may appear many times in a dictionary. _skip (#_) jumps forward # columns on the current line. _skip _skil_ () may appear many times in a dictionary.

() is just an alternative to _column

().

[t)_e] va/_ame [: lblname] [%infint] ["variable label"] instructs inf i le to read a variable. The simplest form f this instruction is the variable name itself: varname. First _nderstand

that at all times infile

is on some column

of some line of an observation.

infi_e starts by being on column 1 of line 1, so pretend that is where we are. Given the simplest directiige 'varname, infile goes through the following logic: Is the.lcurrent column blank? If so, skip forward until there is a nonblank column (or until the end o_the line). If we just skipped all the way to the end of the line, store a missing value in varna/_e. If we skipped to a nonblank column, be_in collecting what is there until we come to a blankbolumn or the end of the line. That is the data for varname, Now set the current column |

to wherever we are. !

The l@ic is a bit more complicated. For instance, when skipping forward to find the data, infile might _ncounter a quote. If so, it then collects the characters for the data by skipping forward until it find_ the matching quote. If you specified a %infmt, in:file skips the skipping-forward step and simpl_ collects the specified number of characters. Nevertheless, the general logic is (optionally) skip, _oltect, and reset. ! I

Remarks i ,!

in_il using follows a two-step process using d_ cript and

i_

1. infil and

to read your data. You type something

like ingile

using reads the file descript.dct, which tells infile about the format of the data:

1

2. infil_

using

then reads the data according to the instructions

recorded

in descript.dct.

descrip_.dct ! (the file could be named anything) is called a dictionary and descript, a text file I _ou create with an editor or word processor outside of Stata. As forithe data themselves, does not aatter.

they can be in the same file as the dictionary

det is just

or in a different file. It

Readingfi ee-formatfiles 1

' ;, I

Another variation of infile for reading free-format files is described in [R] infile (free format). We will _efer to the variation as infile without a dictionary. The distinction between the two variations is in the treatment of line breaks, infile without a dictionary does not consider them significant, infile with a dictionary does.

infile(fixedformat) -- ReadASCII(tAx!)data in fixed formatwith a dictionary

83

A' line, also known a', a record, physical record, or physical line (as opposed to observations, or logical records, or logica lines), is a string of characters followed by the line terminator. If you were to type the file, a line is _,hatwould appear on y0ur screen if your screen were infinitely wide. Your screen would have to be infinitely wide so that th_erewould be no possibility that a single line could take more than one line 9f 3,our screen, thus fooling you into thinking there are multiple lines when I

i i

i

i.

i

!

i

i

therei is only one. A logical line, on the other hand, is a sequence of one or more physical lines that represents a singl+ observation of yoar data. infile with a dictionary does not willy-nilly go to new physical linesi it goes to a new lJne between obser_,ations and it goes to a new line when you tell it to, but that is all. infile without a dictionary, on the other hand, goes to a new line whenever it needs to, which can be right in tl-e middIe of an observation. Thus, consider the following little bit of data which, we will tell you, is for three variables: 54 193 2

How do you interpret th,_se data? H, re is one interpretation: There are three observations. The first is 5, 4. and missing. The second is 1, 9, and 3. The thirc is 2. missing, and missing. That is the interpretation that ±nfile with a dictionary makes. H,re is another interp etation: There are two observations. The first is 5, 4, and 1. The second is 9, 3, and 2. That is the interpretation that infile without a dictionary makes. W_aichis right? We would have to ask the person who entered these data. The question is. are the line t_reaks significant? )o they mean anything? It' the line breaks are significant, we use infile with a dictionary. If the line breaks are not significant, we use infile without a dictionary The other distinction _etween the two infiles is that infile with a dictionary does not process comma-separated-value ormat. If your data are c0mma-separated, see [R] infile (free format) or [R] Msheet.

Example Omside of Stata yout ave typed into the file -,-highway"dct information on the accident rate per million vehicle miles aldng a stretch of highwav_ the speed limit on that highway, and the number of access points (on-ram_s and off-ramps) per rnile, Your file contains [

' "

top ofhigh_ay.dct,,example]

infile dictionary i{ acc rate Ispdlimit acc_pts

} 4.58 55 4.6 2.86 6O 4.4 1.61 . 2.2 3.02 60 4.7

endof highway.dct,examplet This file can be read by yping infile data: •

using

infile using hil;hvay

infile dictionary 1{ acc_rate Ispdlimit acc_pts }(4observations r4ad) |

!

highway.

Stata displays the dictionary and reads the

1 ...........

lnfile

J ............. 84

| st . II !

.................. (fixed format) -- Read ASCII (text) data in fixed format with a dictionary

ace_rate

I. 2 I. 3_. 4_

. '_ _

4.58 2.86 1.61 3.02

spdlimit 55 60 60

acc_pts 4.6 4.4 2.2 4.7

.Z

(> Example We ca: include variable labels in a dictionary so that after we infile the data, the data will be fully labe ed. We could change highway, dot to read i

"

;

I : i I ;

.

top of highway.dct,

example 2

end of highway.dct,

example 2

inf: le dictionary { * T]is is a comment and will be ignored by Stata * Y,u might type the source of the data here. acc_rate "Acc. Rate/Million Miles"

} 4.5_ 2.8_ 1.6_ 3.0"-

spdlimit acc_pts

"Speed Limit (mph)" "Access Pts/Mile"

55 4.6 60 4.4 . 2.2 60 4.7

Now wheJb we type infile

using

highway,

Stata not only reads the data but labels the variables.

<1 I > Example

l

We caniindicate the variable types in the dictionary. For instance, if we wanted to store acc_rate as a doub..e and spdlimit as a byte, we could change highway.dct to read top of highway.dct,

example 3

infi .e dictionary { * Th s is a comment and will be ignored by Stata * dm Yo imight type the source of the data here. le acc_rate "Acc. Rate/Million Miles" byt@ !

spdlimit acc_pts

"Speed Limit (mph)" "Access Pts/Mile"

2.861 60 4.4 1.61 . 2.2 3.02

60 4.7

Since we c 3 not indicate the variable type for acc_pts, (or the typ, specified by the set type command).

end of highway.dct

example 3

it is given the default variable type float

<1

"

tnfile (fixed for at) -- Read ASCII (text) data in fixed format with a dictionary

85

Example i

By specifyingthe typ,is, we can read stringvariablesas well as numericvariables.For instance, +

topof emp.dct

iinfile dictionary '

• data on employe( str20 name

I

int

age sex

"Name" "Age" "Sex coded

0 male

I female"

} I

!"Lisa Gilmore"

25

!Branten 32 1 "Bill Ross"

27 0 end of" emp.dct

The stringscan be delimiLedby singleor doublequotesand quotesmay be omittedaltogetherif the string _:contains no blanks or other special characters,

q t [3Technical Note You may attach value abels to variables in the dictionary using the colon notation: _opof emp2.dct infile

dictionary-

data on name, strl6 name i

se: , and age "Name"

sex: sexlbl int age

"Sex" "Age"

} _Arthur Doyle" Malt 22 _Mary Hope" Female 37 #Guy Fawkes" Male _ 8 #Sherry

Crooks"

Fel ale 25 end of emp2.dct

i

If you _want the value labels to be created automatically, you must specify the automatic option on the infile command. Tl-ese data could be read by typing infile using person2, automatic assuming the dictionary alld data are stored in the file person2.dct.

i

J

}

I t

1

i

_"Example The data need not be n the same file as the dictionary. We might leave the highway data in highway.raw and write dictionary called highway.dct describing it: topof highway.dct,example4 infile

dictionary

u _ing highway

* This

dictionary

r._ads the file highway.raw.

* file

were

* read

"dictionary ace_rate

called

spdlimit

}

aee_pts

lighway.txt,

{ the first

If the

line would

1sing highway.txt" _cc. Rate/Million Miles!' Speed " ccess

Limit

(mph)"

Pts/Mile" --

end of highway.dcl,example 4

I

86 intrile(fixed format) -- Read ASCII (text) data in fixed format with a dictionary _, Example The fir_tlineoffile following rakv dataset:

() directive allows you to ignore lines at the top of the file. Consider the top of mydata.raw

The f( flowing data was entered by Marsha Martinez. Helen troy. id in( _me educ sex age 1024 5000 HS Male 28 1025 27000 C Female 24

It was checked by

end of mydata.raw

Your diction1 u'y might read top of mydata.dct infil_ dictionary using mydata { _first (4) int id "Identification Number" income "Annual income" str2 educ "Highest educ level" str6 sex byte age

} end of mydata.dct

q

1 i

> Example The _lir_e () and _lines () directives instruc! Stata how to read your data when there are multiple records per _Sbservation. You have the following in mydata2 .raw: ri

top of mvdata2.raw

id incpme educ sex age 1024 2_000 HS Male | 28 1025 2_7000 C Femalei 1035 2 000 HS Male 32 1036 25000 C Female 25

1

You can read this with a dictionary mydata2, reads the daia: • infi_e using mydata2, clear

end of mydata2.raw

dct, which we will just let Stata list as it simultaneously

i infile(fixed formal -- Read ASCII (text) data in fixed format with a dictionary

87

z

in_ile dictionary usiag mydata2 { _first(2) _lines(3) int id "Identific ttion Number" income "Annual in :ome" sir2 educ "Highes _line(2) sir6 sex _line(3)

* * * *

Begin reading on line 2 Each obbervatiOn takes 3 lines. Since __ine is not specified, Stata assumes that it is I.

educ level" * Go to line 2 of the observation. * (values for sex are located on line 2) * Go to llne 3 of the observation. * (values for age are located on line 3)

int age

} (4 bbservations read)

. list Ii 2! 3_

id 1024 1025 1035

inc(ime 251)00 27_i00 26_I00

4i

1036

25( 00

Now, here is the really good ii

could jus( as wdll have

educ _S C HS

sex Male Female Male

age 28 24 32

C

Female

25

art: We read these variables in order but that was not necessaD_. We

usedrhedictionary:

top of mydata2p.dct

inf_le dictionary using mydata2 { _first (2) _lines (3) _line (1)

int

id "Identification number" income "_ual income"

_line(3) _line(2)

sti int st_

educ age sex

"Highest educ level"

} end of mydata2p.dct

We would obtain the same re_ults--and just as quickly--the only difference being that our variables in the fin_ dataset would be n the order specified: id, income, educ, age, and sex. q

Technical!Note i.

You can use _newline tO specify where breaks occur, if you prefer: Z

........ i

i

!_

topof highway.dct,example5--

infile dictionary { acc_rate "Ac :. Rate/Million Miles" spdlimit "S )eed Limit (mph)"

>

_newline acc_pts

"Ac :essPts/Mile"

4.58 55 4.6 2.861 60 4.4 1.61. 2.2 3.02 i 60 4.7 end of highway.dct, example 5

The line th)at reads '1. 61 .' ould have been read 1.61 (without the period), and the results would have been unchanged. Since _tictionaries do not go to new lines automatically, a missing value is assumed for all values not foulnd in the record.

!

88 ---

i_file (fixed format) -- Read ASCII (text) data in fixed format with a dictionary =

]

Readingfied-format files Values _n formatted data are sometimes packed one against the other with no intervening For instande, the highway data might appear as I top of highway.raw,example 6

,,

:',

blanks.

4.58_54.6

2.86 04.4 1.61| 2.2 3.02604.7

":

end of highway.raw,example6 The first f_ur columns of each record represent the accident rate; the next two columns, the speed limit; and _he last three columns, the number of access points per mile. To read:: these data, you must specify the %infmt in the dictionary. Numeric Y,infints are denoted by a leadir_g percent sign (%) followed optionally by a string of the form w or w.d, where w and d stand for @o integers. The first integer, w, specifies the width of the format. The second integer, d, specifies ti_enumber of digits that are to follow the decimal point. Logic requires that d be less than or equal tqw. Finally, a character denoting the format type (f, g, or e) is appended. For example, %9.2f spe_zifies an f format that is nine characters wide and has two digits following the decimal point.

Numericformats The f f_rmat indicates that infile is to attempt to read the data as a number. When you do not specify th_%infmt in the dictionary, infile assumes the %f format. The missing width w means that infille is to attempt to read the data in free format. At the _mrt of each observation, to 1, indic moves the occurrence is left at tl

infile

reads a record into its buffer and sets a column pointer

ating that it is currently on the first column. When infile processes a %f format, it "olumn pointer forward through white space. It then collects the characters up to the next of white space and attempts to interpret those characters as a number. The column pointer e first occurrence of white space following those characters, If the next variable is also

free forma I, the logic repeats. When ypu space. Instead, the result @ a is, on the first

explicitly specify the field width w, as in %wf, infile does not skip leading white it collects the next w characters starting at the column pointer and attempts to interpret number. The column pointer is left at the old value of the column pointer plus w, that character following the specified field.

Example If the d tta above are stored in highway, the data:

raw, you could create the following

infi e dictionary using highway { acc_rate Y,4f "Acc. Rate/Million spdlimit acc_pts

dictionary to read

top of highway.dct,

example 6

end of highwa?.dct,

example 6

Miles"

Y,2f "Speed Limit (mph)" Y,3f "Access Pts/Mile

} 1

Wh:ncolu_s you explicitly field width, not skip intervening and characters. The first are usedindicate for the the variable ace_rate,infile the does next two for spdlim-it, the last three for acc_pts.

<1

i

•

Q

|

....

infile (fixed format,l-- Read ASCII (text) data in fixed format with a dictionary

89

The d specification in the i'/,w.df indicates the number of implied decimal places in the data. For Technica_ Note instance, the string 212 read tin a _3.2f

format repre_ems the number 2.12. You should not specifv

d unless _¢ourdata have ele@nts of this form. The w alone is sufficient to tell ±nfile

how to read

i '_ i

data in which the decimal P_]int is explicitly indicated" When iyou specify d, it is taken on13 as a su_,_estion. If the decimal point is explicitly indicated in the data, ihat decimal point a_wa3s m errides the d specification. Decimal points are also not implied

t 1 I

if the data contain an E, e, I], or d, indicating scientific notation. Fields i!are right-just, fled Otefore lmptymo dec,mal points. Thus, as 0 2 by the _3. If format.

I

2

,

2 . and

2 are all read

a TechnicalNote The g and e formats are the same as the f format. YoUCanspecify any of these letters interchangeably. The letters g and e are inch ided as a convenience to those familiar with Fortran. In Fortran. the e format i_icates scientific n, .ration. For example, the number 250 could be indicated as 2,5E+02 or 2.5D402. Fortran prograr Imers would refer to this as an ET. 5 format, and in Stata. this format would _ indicated as 7'7.5. _. In Stata. however, you need specify only the field width w. so you could react this number usin 7'7f, 7,7g. or '/,7e.

i ! i I

The gi format is really a :or_ran output format that indicates a freer format than f. In Stata. the two formtats are identical. !

i ! i i

i !

Throughout this section,

a

Technical

ou may freely substitute the g or e formats for the f format.

Note

Be careful to distinguish b__tween'/,tints and '/,infints: '/,tints are also known as display formats--_hey describe how a variable is :o look when it is outputted; see [u] 15.5 Formats: controlling how data are!displayed. 7,ilTfi,ts are also known as input formats--they describe how a variable looks when it _s inputted. For instance, there is an output date format 7,d, but there is no corresponding input format. (See [U] 27 C, remands for dealing Mth dates for recommendations on how to read dates.) Fbr the other formats we have attempted to make the input and output definitions as similar as possible. Thus. we includ g. e. and f }',i_!fints,even though they all mean the same thing, since g, e, and f are also '/,tints.

String formats The s format is for read ng strings. The syntax is gu:s where the w is optional. If you do not specify the field width, your strings must be enclosed in quotes (single or double) or the_ must not contain Nny characters other than letters, numbers, and '_. This may surprise you, ,ut the s format can be: used for reading numeric variables and the f tbrmat c_n be used for rea ring string variables! When you specify the field width u, in the '/,'u,f format, all embedded blank., in the field are removed before the result is interpreted: They are not removed by the Xws /ormat

90

' ,I

_nfile(fixed format) -- Read ASCII (text) data in fixed format with a dictionary

For instance, the _3f format would read '- 2', '-2 ', or ' -2' as the number -2. The _3s format would notl be able to read '- 2' as a number, since the sign is separated from the digit, but it could read ' -2" or '-2 '. The %wf format removes blanks; datasets written by some FORTRAN programs separate the sign from the number. There ge, however, some slde-effects of this practice. The stnng 2 2 will be read as 22 by %3f format. Most FORTRAN compilers would read this number as 202. The %3s format would issue a wamingland store a missing value. Now c6nsider reading the string 'a b' into a string variable. Using a Z3s format, it will store as it appears t a b. Using a Y,3f format, however, it wilt be stored as ab--the middle blank will be removed. I Examples using the Xs format are provided line numbers.

below, right after we discuss specifying column and

|

Specifying column and line numbers _colu_() jumps to the specified column. For instance, the documentation of some dataset indicates that the variable age is recorded as a 2-digit number in column 47. You could read this by coding

I

_column(47)

age Y.2f

After this,i you are now at column recording _ex as 0 or 1,

you

[

_column(47)

49, so if immediately

_column(47) _column(49)

I

were a 1-digit number

age Y.2f sex Y.lf

could instead code

age Y.2f sex Y, lf

It makes np difference. If at column 50 were a 1-digit code for race, skip readirlg the sex code, you could code _column(47)

age

could code

or, if you tvanted to be explicit about it, you I I

following

and you wanted to read it but

age Y,2f

column(50) race Y, lf

You couldlequivalently

skip forward using _skip ():

1

_colunm(47)

age _2f

I

_skip(1)

race Zlf

One advar_tage of column() over _skip is that it lets you jump forward or backward in a record. |

If you war,ted to read race

and then age, you could code

_column(50) race Y, lf _column(47) age Y,2f

If the d tta you are reading have multiple lines per observation (sometimes said as multiple records per observ _tion), tell infile how many lines per record there are using _lines (): _ lines (4)

_lines () appears only ()nee in a dictionary. Good style says it should be placed near the top of the dictionary, but Stata does not care.

;infile (fixed format -- Read ASCII (text) d_ta in fixed format with a dictionary

91

When you want to go to a particular line, includb the _line() directive. In our example, let's assume race, sex, and age are recorded on the second line of each observation: _lines(4) _line (2) _column(47)

a_e Y,2f

_column(50)

Tce

}'.If

Let's assume id is recordedlon line 1. 1 lines (4)

|

_line(l) i I

i

_column(I) _line(2)

d

Y,4f

Y,tf _co1_(47) ace goX2_ _column(50)

_line() works like _colu as well be read by

m() in that you can jump forward or backwardl so these data could just

_lines(4) _line(2) _column(47) _column(S0)

_ge %2f ,race %If

_line (I) _colnmn(1)

_d

7,4f

Remember that this dataset aas 4 lines per observatibn and yet we have never referred to line (3) or line(4). That is okay./Jso note that, at the endof our dictionary, we are on line t. not 4. That is okay, loo, infile will stll get to the next observation correctly.

E3TechnicalNote

i l.

Anotl!er way to move bet' _een records'is _newline ().._newline () is to _line () as _skip () is Io _column(), which is to say, ..mewline () can only go forward. There is one difference: _skip() has its uses; ._.newline () L' useful only for backward capability with older versions of Stata. _skip()has its uses bec Lusesometimes one thinks in terms of columns and sometimes one thinks in tern'ts:of widths. Some d Ia documentation might very well include the sentence "At column 54 are recorded the answers to the 25 questions, one column alloted to each." If we want to read the answers io questions 1 and 5, it would indeed be na_tutalto code _column(54)

tl

_.If

_skip(3)

is %1_ [

i

Nobody has ever read data _ocumentation with the siatement, "Demographics are recorded on record 2and, 2 records after that, _re the income " values. " The ' " documentanon " would instead " " sa3,' " Record ,_ '_ contains the demographic ir formation and record 4, irJcome." The _newline() way of thinking Is based on what is convenien for the computer which does, after all, have to eject a certain number of records. That, however, no reason for making yot_ think that way.

i

Before that thought occ fred to us, Stata users specified _mewline() to go forward records. They stiil can, so their old [ictionaries will work. When you use _neTaline() and do not speci_ _ -lines(), it is your respoasibility to eject the right number of records so that, at the end of the dictionary, you are on the last record. In'this mode. when Stata re-executes the dictionary to process

i

the next iobservation, it doe: forward one record.

I

!

[

...................

B

I

...........

92

infile (fixed format) -- Read ASCII (text) data in fixed format with a dictionary

'i':

Example of reading fixed-format files

:

> Example In thisIexample, "i each observation

i

Joh_ Dunbar 101 111111

P

i.

Sam! g. Newey,

i'

OlObOOOOO

!:

occupies two lines. The first two observations

Jr.

10001

101

North

10002

15663

42nd

in the dataset are

Street

Roustabout

Boulevard

The first )bservation tells us that the name of the respondent is John Dunbar; his id is 10001; his address i., 101 North 42nd Street; and that his answers to questions 1 through 10 are yes, no, yes, no, yes, _s, yes, yes, yes, and yes. The sei:ond observation tells us the name of the respondent is Sam K. Newey, Jr.; his id is 10002; his addre_ is 15663 Roustabout Boulevard; and that his answers to questions l through 10 were no, yes, no, _s, no, no, no, no, no, and no. (Probably John and Sam are not best friends.) In ord r to see the layout within the file, we can temporarily the appro!briate columns: ....

+ ....

Jol_

I ....

+ ....

2....

+ ....

3 ....

Dunbar

add two rulers to help our eyes see

+ ....

4 ....

+ ....

5 ....

+ ....

i0001

I01

North

42nd

Street

6 ....

10002

15663

+ ....

7 ....

+ ....

8

+ ....

7 ....

+ ....

8

lOlq 111111 Sam 010 ....

K. Newey, 000000 + ....

1....

Jr. + ....

2 ....

+ ....

3 ....

+ ....

4 ....

Roustabout + ....

5 ....

Boulevard + ....

6 ....

Each observation in the data appears in two physical lines within our text file. We had to check in our editorlto be sure that there really were newline characters (i.e., "hard returns") after the address. This is irr_ortant because some programs will wrap output for you and a single line may appear as many line_. The two seemingly identical files will differ in that one has a hard return and the other has a soft ireturn added only for display purposes. In our tidata, the name occupies columns 1 through 32; a person identifier occupies columns 33 through 37; and the address occupies columns 40 through 80. Our worksheet revealed that the widest address erred

in column 80.

The teat file containing these data is called fname, txt.

Our dictionary

file looks like

t

infi t e dictionary , I * Example * th_n

one

using

reading line.

fname.txt

in data The

where

next

line

{

top of fname.dct

observations

extend

tells

there

infile

across are

more

2 lines/obs:

_lin ,s(2)

,

_col mm(33) _ski _(2)

sir50

name

Z32s

"Name

long str50

id addr

Y,Sf Y,41s

"Person id" "Address"

of respondent"

_lin :(2) _col mm(1)

byte

ql

ZIg

"Question

!"

I

b}_te

q2

Y,lf

"Question

2"

i If

1

byte byte byte

q3 q4 q5

7,1f Y,lf Zlf

"Question "Question "Question

3" 4" 5"

byte

q6

Y,lf

"Question

6"

, I}

I"

7

infile (fixed form ) -- Read ASCII (text) data in fixed format with a dictionary byte

q8

%if

"Durst ion 8"

byte

q9

%If

"Du_stion

9"

byte

qlO

%If

"Question

I0"

93

} end of fname.dct

i

Up t6 five pieces of information may be supplied in the dictional3, for each variable: the location of the data, the storage tylie of the variable, the name of the variable, the input format, and the variable!label. !

i ! !

Thusl the str50 line sa_s that the first, variable, is to be given a.,, storage type of str50, should be called name, and have the '_ariable label Name of respondent. The %32s is the input format--this tells Stata how to read the _lata. The s tells Stata not to remove any embedded blanks: the 32 tells

li

Stata to go across 32 colur_ns when reading the data. The next line says that the second -,ariable- is to be assigned a storage type of long, named id, and labeled "Person id". Stata should start reading the information for this variable in column 33 The f tells Stata to remove any embedded blanks, and the 5 says to read across 5 columns.

I i • i i

The third variable is to De given a storage type of str50, called add.r, and labeled "Address". The _skip(2) directs Stat_ to skip 2 columns bef6re beginning to read the data for this variable, and the g4J.s instructs Stat_ to read across 41 colurnng and to not remove embedded blanks. line::(2)

instructs Stata o go to line 2 of the observation.

The remainder of the dzta is 0/1 coded--the answers to the questions. It would be convenient if we could use a shorthan to specify this portion of the dictionary, but we must supply explicit directives q

'3TechnicalNote i ! i

ii ! !

i

i

In the preceding exampl._, there were two pieces of information about location: where the data begin fo_ each variable (the _column(), _skip(), ,line ()) and how many columns the data span (the %32s, %5f, %41s, %lf). In our dictionary, some of this information was redundant. After reading name, Siata had finished with 32 columns of information. Unless instructed otherwise, Stata would _roceed to the next column--column 33--to begin reading information about id. The _column (33) was unnecessary. The _skip(2) was not.• however, unnecessary. Stata had read 37 columns of information and was ready to look at columl 38. Although the address information does not begin until column 40, columns "_8and 39 contain )lanks. Since these are leading blanks, instead of embedded blanks. $tata would jtist ignore them. Th .'re is no problem so far. The problem is with the %4is. If Stata begins reading _he address informa ion from column 38 and reads 41 columns, Stata would stop reading in column 78 (78 - 41 + 1 = 38), and the widest add?ess ends in column 80. We could have omitted the -skip(2) if we had specified an input format of X43s. The __l±ne(2) was necessary although we could have gotten to the second line by coding --newlineinstead. The _column(1)

could _ave been omitted. Afte_ the _line(),

Stata begins in column 1

See the following examp e for a dataset where both pieces of location infomaation are required.

r t

! mt.e (rlxeamrmatj -- Heaa ASCII (text) data in fixed format with a dictionary

7

D Example! The llowing file contains six variables in a variety of formats. Note that in the dictionary we read the _,ariables fifth and sixth out of order by forcing the column pointer. _

'r

topofexample.dct

in ile dictionary { -_I i double i ! i:

i .skip(2) ,column(21) ,_column(18)

i'_ _,

first second third

str4

%3f Zi.lf %6f

fourth %4s sixth 7,4.If fifth Y, if

1.2 L25.Te+252abcd 1 .232 1.3:35,7 52efgh2 5 1.4.457 52abcd 3 100. 1.5 155.TD+252efgh04 1.7 16 6 .57 52abed 5 t.71 [

end of example.dot

l

Assumin_| the above is stored in a file called example, dct, it can be infiled and listed by typing • i_file using example infile dictionary { i ! double i.skip(2)

str4

first second third fourth

7,Sf 7,2.if 7,6f

7,4s

sixth 7,4. If fifth _,2f

i.column(21) i_column(18)

} (5 _bservations

read)

list first i.2

second i.2

third 570

fourth abcd

sixth .232

fifth 1

2;{ 3 .i

1.3 I,4

1.3 I.4

5.7 57

efgh abcd

.5 i00

2 3

4.i 5.i

i.5 16.

i.5 1.6

570 .57

efgh abcd

I.7 1.71

4 5

1.

q {

Reading fi_(ed-block files

u Technical_ote The _l_'ecl (#) directive is used for reading datasets that do not have end,of-line delimiters (carriage return, lindfeed, or some combination): Such datasets are typical of IBM mainframes--where they are known as fixed block or FB. The word LRECLis IBM-mainframe jargon for logical record length. Fixed'block datasets are datasets where each # characters are to be interpreted as a record. For instance, @nsider the data 2 3 63

infile (fixed fo mat) -- Read ASCII (te_) data in fixed format with a dictionary'

95

In fixed-block format, tlq.'se data might be recorded top of mydata.ibm 1

212 423 63 end of mvdata.ibm

!

and you wouldbe told, m the side, thatthe LRECL is 4. If you then pass along that informationto inside,

it will be able

read the data: top of mydata.dct

infile dictionary using mydata.ibm { _Irecl(4)] int i_ int

}

a_e end of mydata.dct

When you do not spe( iCythe _lre¢l(#)

directive, in:file

i

assumes that each line ends with the

standardASCII delimiter whichcan be linefeed0r carriagereturn or linefeedfollowedby carriage return or carriage return 9llowed by finefeed). When you do specify _1reel data ih blocks of # char_ :ters and then acts as if that is a line.

i

(#), infile

reads the

A i:ornmon mistake ir processing fixed-block datasets is to be incorrect about the LRECL value, for instance, thinking the I.RECLis I60 when it is really 80. To understand what can happen, pretend we thought the LRECLin )ur data was 6 rather than 4, Taking the characters in groups of 6, the data

i

appearas 212

i

423 63 Stata has no way of verify ng that you have specifi_ the correct LRECLso, if the data appear incorrect,

verifyiyouhave the corre :t number. ThemaximumLRECLnfile allowsis 18,998withStatafor Unix,7,998with StataforWindows, and 3.998with Statafor /lacintosh.

References Gleason:, J. R. 1998. dm54: C tpturing comments from data dictionaries. in Siata Technical Bulletin Reprints, vol. 7. pp. 55-57. f

Stata Technical Bulletin 42: 3-4.

Gould, "iV,W. 1992. dml0: Inf ling data: Automatic dictionary' creation. Stata Technical Bulletin 9: 4-8. Stata Technical Bulletin Re_rints, vol. 2, pp. 28-34. Nash. J. D. 1994. dml9:

Me

4

ng raw data and dictionary, files. Stata Technical Bulletin 20: 3-5.

Technical Bulletin Reprints 1 vol. 4, pp. 22-25.

| AlsoSee Compl_mentary:

[R] utfile, [R] outsheet. [R] save

Related:

[R] afix (fixed format)

Background:

[u] ;4 Commands to input data, [R] afile

Reprinted

Reprinted in

Reprinted

in Stata

....................

i

Title '

...................

i

I infile ]i free format)-

Read unformatted ASCII (text)data [

i

]

Z'

Syntax ingil_

!i

]

varlist ['skip[(#)]

[varlist [_skip[(#)].

,.]I]

using filename [if exp] [in range]

If filename is specified without an extension, .raw is assumed

Descriptiln i

infil_reads into memory a disk dataset that is not in Stata format. Here _x!ediscuss using infile to read free-format data, meaning datasets where the knowledge of the forr_atting information is not necessary to make sense of them. Another variation on infile allows rea_ing fixed-format data; see [R] intile (fixed format). Yet another alternative is insheet, which is dasier to use if your data are tab- or comma, separated and contains one observation per line. Statalhas other commands for reading data, too. If you are not certain that infile is what you are lookin_ for, see IN] infile and [U] 24 Commands to input data, After tl_edata are read into Stata, the data can be saved as a Stata-format dataset; see [R] save.

Options

i

automati_causes Stata to create value labels from the nonnumeric data it reads. It also automatically widens the display format to fit the longest label. yvariable (#) specifies that the external data file is organized by variables rather than by observations. All the bbservations on the first variable appear, followed by all the observations on the second vanable_/ and so on. Time-series datasets sometimes come in this format 1

clear

spe}ifies that it is okay for the new data to replace what is currently in memory. To ensure

that youido not lose something important, infile will refuse to read new data if data are already in mem;ry, clear is one wa} you can tell infile that it is okay The other is to drop the data yourself_by typing drop _all before reading new data. i

Remarks infile_or,at least, the infilefeatures discussed here value forrn_tt.

reads data in free or comma-separated-

Remarkl are presented under the headings I I , ! t t

i 1

I

Reading free format data Reading comma-separated data Specifying variable types Reading string variables Skipping variables Skipping observations Reading time-series data

96

i I

infile (free formal -,--Read unformattedASCII (text) data

Readi

97

free format cmta

In ifree format, data • e separated by one or more white-space characters. White-space characters are blanks, tabs, and ne vlines (carriage return, linefeed, or carriage-retum_inefeed combinations). Thus. a single observatic may span any number:of lines. Numeric missing valu

Example

are indicated by single periods ('.').

J

In the file highway, r_w, you have information oft the accident rate per million vehicle miles along a stretch of highway, thelsoeed limit on that highway, and the number of access points (on-ramps and off-ramps) per mile.---Ifour file contains top of highway:raw, example1 _.58 55 4.6 2.86 60 4.4 i.61. 2.2

3.026o

4.7' endof highway•raw, example1 !

You can read these data by typing i infile ace_rate _ pdlimit {4 observations re_d)

acc_pts

using

highway

list I. 2. 3. 4.

acc_rate 4.58 2.86 1.61 3.02

s dlimit 55 60 60

acc.pts 4,6 4,4 2.2 4.7

Note that the spacing of tl e numbers in the original file is irrelevant.

[3TechniCalNote It isl not necessary that missing values be indicated by a single period. The third observation on the speed limit is missing in the previous example• The raw data file indicates this by recording a single period. Let's assume that instead the missing value was indicated by the word unknown. Thus, the raw data file appears a:

1

---

_ 4_58 55 4.6 2i86 60 4.4 1161 unknown 3 i02 60

top ofhighwayyaw,example22.2

4i7 endof highway.raw, example2 :

Here is the result of infilin

these data:

• _lnfite ace_rate s] llimit acc_pts using h_gh_ay ""/nk_ows" cannot be read as a number for spdlimit[3] (_ observations read

![

l

[1

98

iqfile (free format) m Read unformatted ASCII (text) data

infile wa_ed us that it did not know what to make of the word unknown, stored a missing, and then continhed to read the rest of the dataset. Thus, aside from the warning message, results are unchanged. ! Since not all packages indicate missing data in the same way, this feature can be useful when reading dat_ created by them. Whenever infile sees something it does not understand, it warns you, record_ a missing, and continues. If, on the other hand, the missing value were recorded not as unknown Nit as. say, 99. Stata would have no difficulty reading the number, but it would also store 99 rather than r_ssing. To convert such coded missing values to true missing values, see [R] mvencode.

i' .

[]

]

t

Reading comma-separated data In co_m_a-separated-value are are separated by either commas. You may intermix separated_'llue and free format.format, Missingdatavalues indicated bv single periods or by commamultiple commas which serve as place holders, or both. As with free format, a single observation may span any numbed of input lines.

Example We can imodify the format of highway, raw used in the previous example without affecting infile's ability to read it. The dataset can be read with the same command and the results would be the samd if the file instead contains 1

--

--

top of highway,raw. example 3

2.86, 4.58,1560,4.4 4.6 1.61, 12.2 3.02,_0 4.7 end of highway.raw,

example 3

<1

Specifying ,ariable types The variable names you type following the word infile are new variables. variable is

The syntax for a new

[type] new_varname[Llabel_name] A full discuJ;sion of this syntax can be found in [U] 14.4 varlists. As a quick review, new variables are, by defaldt, of type :float. This default can be overridden by preceding the variable name with a storage ty ,e (byte, int, long, float, double, or str#) or by using the set type command. A list of varia: _les placed in parentheses will be given the same type. For example. double_

_rst_var second_var...

causes first, var second_var _:

Ii

Iast_var)

... tast_var to all be of type double.

There is _lso a shorthand syntax for variable names with numeric suffixes. The varlist varl-var4 is equivalent to specifying varl vat2 var3 vat4.

I

,

infile (free format) -- Read unformattedASCII (text) data

99

> Example In!the highway exam _le, we could infile i

the data acc__rate,

spdtimit,

and acc_pts

and

force the variable spdli:Litto be of type int by typing •

infile

acc_rate

i(4 observations

int spdlimit

ace_pts

uaing

highway,

clear

r_ad)

We could force all vafia]:les to be of type double by typing • infile double(acc_rate (4 observations read)

spdlimit

aec_pts)

using

highway,

clear

We could call the three _ariables vl, v2, and v3 and make them all doubles bv typing i.

infile

double(vt-v3)

!(4 observations

using

highway,

clear

read)

q

Reading string variables B} explicitly specifyir g the types, we can read string variables as well as numeric variables.

Example Typing infile

str2(

; "Sherri Holliday" !Branton 32 1 "Bill

Ross"

name age sex using

myfile

would read top of myfile,raw

25 1

27,0

topof myfile.raw or even topof myfile.raw,variation2 "Sherri l,'Bill

golliday" 25,1 Ross', 27,

32

"Branton"

end of myfile.raw,

i

variation

2

Note ihat the spacing is i Televant and either single or double quotes may be used to delimit strings. The quotes_do not count when calculating the leng|h of strings, Quotes may be omitted altogether if the 'string contains no )lanks or other special characters (anything other than letters, numbers, or undergcores). Typing • infile str20 nam_ age sex using (3 observations re _d)

makes

name

a str20

an(

age

and

sex

myfile

floats.

• infile sir20 nam_ age int sex using (3 observations re Ld)

We

might have typed

myrtle

tomake sex an int or • infile striO nam._ int(age (3 observations re _d)

to make both age and sex ints.

seX) using

myfile

d

! ;_

100

infile (free format) -- Read unformatted ASCII (text) data

13Technical Note infile ican also handle nonnumeric data by using value labels. We will briefly review value labels, but you should see [U] 15.6.3 Value labels for a complete description. A value hbel

is a mapping from the set of integers to words. For instance, if you had a variable

J

called sex in your data that represented the sex of the individual, you might code 0 for mate and 1 for female. Yofi could then just remember that every time you see a value of 0 for sex, that observation refers to a male, whereas 1 refers to a female. Even better, you could inform State that 0 represents males and 1 represents lab_l

define

sexfmt

0 "Male"

females by typing

1 "Female"

Then you must tell State that this coding scheme is to be associated with the variable sex. This is typically ddne by typing • lab_l

values

sex

sex,mr

Thereafter, State will print Male rather than 0 and Female rather than 1 for this variable. State is :unique in that it has the ability to turn a value label around. Not only can it go from numeric c@es to words like "Male" and "Female", it can go from the words to the numeric code. We tell infite which value label goes with which variable by placing a colon (:) after the variable name and ts{ping the name of the value label. Before we do that, we use the label to inform S_tataof the coding.

define

command

Let's assume that we wish to infile a dataset containing the words Male and Female and that we wish to store numeric codes rather than the strings themselves. This will result in considerable data compression, especially if we store the numeric code as a byte. We have a dataset named persons .raw that contains name, sex, and age: top of persons.raw "ArthUrDoyle"Male 22 "Mary! Hope"Female37 "GuyFawkes"Male48 "Carrke

House"

Female

25 end of persons.raw

Here is hoW we read and encode it at the same time: label inf_le

define str16

sexfmt name

(4 observatlons

0 "Male" sex:sexfmt

i "Female" age using

persons

read)

list

i.

name

sex

age

Doyle

Male

22

Hope

Female

37

Guy Fawkes Carrie House

Male Female

48 25

Arthur

2.

Mary

3. 4.

The strl61in

the infile

command applies

only to the name variable: sex is a numeric variable.

as we can Orove by 1

lis_,

1. 2. 3. 4.

nolabel name

sex

age

Doyle

0

22

Mary Hope Guy Fawkes

1 0

37 48

1

25

Arthur

Carrie

House

_1

!

Int.e (_

tormm)-- -eaaumormmmaA:su. _jext)ama

,u,

D Technidal Note

i l

When infile is direct_d to use a value label arid it finds an entry in the file that does not match any ofithe codings record ;d in the label, it prints a warning message and stores missing for the observation. By specifyin_ the automatic option,you can instead have infileautomatically add entries to the value la _el, new Say!you have a dataset containing three variables. The first, region of the countr7, is a character string; the remaining two eariables, which we will just call varl and vat2, contain numbers. You have stored the data in a le called geog. raw: top of geog.raw

l

'_NE" _NCntrl"

31.23 29.52

South West

29.62 28.28

- -

87.78 98.92 114.69 218.92

.E

17.5o

44.3a

/_Cntrl

22.51

55.21 end of geog.raw

The easiest way to read tlris dataset is . infile str6 regica varl vat2 using geog

making region a string affable. You do not want to do this, however, because you are practicing for reaNng a dataset like his containing 20,000 observations. If region were numerically encoded and stored as a byte, th _.rewould be a 5-byte _aring per observation, reducing the size of the data by 100,000 bytes. Y( also do not want to bother with first creating the value label. Using the automatic option, infi: e creates the value label automatically as it encounters new regions. infile byte regi( a:regfmt varl vat2 usiflggeog, automatic (6 observations re_5) , list 1. 2. 3. 4. 5. 6,

vat1 31.23 29.52 29.62 28.28 17.5 22.51

region NE NCntrl South West NE NCntrl

vat2 87.78 98.92 114.69 218.92 44.33 55.21

±nfi_e automatically ere _tedand defined a new value label cal/ed regfmt. We can use tbe label list izomrnandto view i is contents: • label list regfmt :

regfm I 2 3 4

NE NCn ;rl Sou,;h Wes

It is not necessary that he value label be undefi_edprior to the use of infile with the automatic option. If the value label regfmt had been previOu._]ydefined as ;. label define reg Ymt 2 "West" i

the result of labellistafter the in_ilewould have been reEfmr : 2 3 4 5

West NE NCntrl South

•

.............

i

......

The automatic option is so convenient that you may see no reason for not using it. Here is one. 102 iinfile(free format)-- Read unformattedASCII(text) data Suppose _ou had a dataset containing, among other things, an individual's sex. You know that the sex variat_le is supposed to be coded male and female. If you read the data using the automatic 1 option and if one of the records contains fmlae, infile will blindly create a third sex rather than print a wdrning. 1 iIi Cl :

i t

.

Skippingariables Specifying _skip instead of a variable name directs infile to ignore the variable in that location. This feature makes it possible to extract manageable subsets from large disk datasets. A number of contiguou_ variables can be skipped by specifying _skip(#) Ignore. {

where # is the number of variables to

> Example In •the. _ighway example that started this section, the data file contained three variables: acc...xate, ] spcllamt; and acc_pts. You can read just the first two variables by typing t

• in_ile ace_rate

spdlimit _skip using highway

You can r_ad the first and last variables by typing in_ile ace_rate _skip acc_pts using highway, clear

You can r_ad just the first variable by typing :

• in_ile ace_rate _skip(2)using highway, clear {

ma_ be specified more than once. If you had a dataset containing four variables, say a, b, c, and d, _nd you wanted to read just a and c, you could type infile a _skip c _skip using filename, i _
,,

_skip

i I

1

Skipping observations Subsets! of observations can be extracted by specifying if exp, which also makes it possible to extract manageable subsets from large disk datasets. Do not, however, use the _variable __Nin exp. Use the ifirange modifier to refer to observation numbers within the disk dataset.

•

> Example

'

i

.Again r_ferring to the highway example, if you type • in] ile ace_rate spdlimit (2 oiservationsread)

ace_pts

iI

ace_rate>3

only obser 1 'ations for which ace_rate is greater than 3 will be infiled. You can type •

iniile

(30_

'

ace_rate

servations

to read onl_ the second,

spdlimit

acc_pts

in

2/4,

clear

read)

third,

and fourth

observations.

q }

I

infile (free format)-- Read unformattedASCII (text) data

Reading time-series ! ! ! i

103

clata

If you are dealing wilh time-series data, you may receive datasets organized by variables rather than by obser_'ations. All the observations on the first variable appear, followed by all the observations on the second variable, _nd so on. The byvariable(#)option specifies that the external data file is organized in this way. You specify the number of obseWations in the parentheses, since infile needs to know that numter in order to read the data properly. Alternatively, you can mark the end of one variable's data an:l the beginning of anottier's by placing a semicolon (';') in the raw data file. You may then specif a number larger than the number of observations in the dataset and leave it to 4nfile to determin, the actual number of observations. This method can also be used to read unbalanced data.

> Example YoU have time-series data on four years recorded in the file time.raw. information on year, amount, and cost and is organized by variable:

The dataset contains

top of time.raw i1980

1981 17 25

14 120

135

1982

198

30 150

180 end of time.raw

I

You can read these data _y typing :. infi!e year amot at cost using (4 observations re_d)

time,

byvariable(4)

list

I.

year 1980

amount 14

cost 120

2. 3.

1981

17

135

1982

25

4.

150

1983

30

180

If the ldata instead contai_ed semicolons marking the end of each series and had no information for amoum in 1983, the raw data might appear as 1980

I981

1982

14 17 25 ; i20 135 150 180

;

1983

; i

i

You could read these datI by typing • infile year amount (4 observations re_d) t . list

cost using

time,

I amount 14

cost 120

t.

year 1980

2.

1981

17

135

3.

1982

25

150

4.

1983

byvariable(lO0)

180

4

104

_

i:

infile (free format) -- Read unformatted ASCII (text) data

Also See! Complementary:

JR] outfile, JR] outsheet, [R] save

Related:

JR] infile (fixed format), JR] input, JR] insheet

Backgrodnd:

[U] 24 Commands to input data, [R] infile

[,n,x,,,.e0,oroa,,Re.d A=.extinxod fo nat t

_

i,,

i

i

i i

il

i

,

Syntax

in!fix specification sing filename [if exp] [in range] [, clear i i wherespecificationis # _irstlineoflile # lines #:

/ E

[byte lint Ifloat ilong ]double I str ] varlist and dfilename,

if it exists

[#-]#[-#]

contains

t

[

t

_

top of dictionary file --

I

infix dictionary Lsingfitename] { * comment, _recededby asterisk may appear freely specificafion_ _(yourdata might appear 5ere) end of dictionary file ......

4

If dfile_ame is specified wit

tan ex)ensiom .dot is assumed.

If fileng_me2or filename is specified without an extension, .rawis assumed. In the first svntax, if usingJ_lename 2 is not specified onthe command line and using file,atne is not specified in the _dictionarv_the data ard assumed to begin on the lifie following the close brace.

Description infix reads into memory

a disk dataset that is not in Stata format,

infix requires

that the data

_!

be in fixed-column forma You have alternatives t_ infix,

! !

(fixed format)--and it can read data in free format--see JR] infile (free format). Most people think infix is easier to use for reading fixed-format data, but infile has more features. If your data are not fi_ed-format,

another

is what you are looking In its first syntax, i !

i

is one. It can also read data in fixed-format--see

is insheet; See [R] insheet.

)r. see [R] irdile and [u]24 x reads the data in a two,step

Commands process.

[R] infile

If you are not certain that infix

to

input data.

You first create a disk file describing

how t_e data are recorde, t. You tell infix to read that file--called a dictionary--and from there infix goes on to read th_ data. The data can be in the same file as the dictionary, or a different file. In its second intermediate

i

inf

alternative

infile

syntax,

ou tell infix

how to read the data right on the command

file.

105

line with no

'i

106

:infix (fixedformat)

Read ASCII (text) data in fixed format

Options using(fi!ename2) specifies the name of a file containing the data. If using() is not specified, the data ate assumed to follow the dictionary in dfitename or, if the dictionary specifies the name of some other file, that file is assumed to contain the data. If using(fiIename2) is specified, filenamez is used to obtain the data even if the dictionary itself says otherwise. clear specifies that it is okay for the new data to replace what is currently in memory. To ensure that y+u do not lose something important, inf ix will refuse to read new data if data are already i in memory, clear is one way you can tell infix that it is okav. The other is to drop the data yourself by typing drop _all before reading new data. I

Specifications # first_ineoffile

(abbreviation first) is rarely specified. It states the line of the file where the

for its#lf, first is instead specified when only the data appear in a file and the first few lines of that fillecontain headers or other markers. data begin, first is not specified when the data follow the dictionary; infix can figure that out i firstl appears only once in the specifications.

i! i" :

# lines!states the number of lines per observation in the file. Simple datasets typically have '1 ; lines!. Large datasets often have many lines (sometimes called records) per observation, lines is optional even when there is more than one line per observation because infixcan sometimes figure _t out for itself. Still, if 1 lines is not fight for your data, it is best to specify the directive.

'

i, ,

lines Iappears only once in the specifications. #: tells infix to jump to line # of the observation. Consider a file with 4 lines, meaning four lines per observation. 2: says to go to the second line of the observation. 4: says to go to the fourth line of_the observation. You may jump forward or backward: infix does not care nor is there any inefficiency in going forward to 3:, reading a few variables, jumping back to 1:, reading anothei" variable, and jumping back again to 3 :. It is n0t your responsibility to ensure that, at the end of your specification, you are on the last line of!the observation, infix knows how to get to the next observation because it knows where you are and it knows lines, the total number of lines per observation #: may appear, and typically does, many times in the specifications. / is an alternative to #:. / goes forward one line. //goes forward two lines. We do not recommend the usd of / because #: is better. If you are currently on line 2 of an observation and want to get [_ "'

to linei6, you could type////, but your meaning is clearer if you type 6:. / may!appear many times in the specifications.

: : i

[byte I int Ifloat j long I double and, sdmetimes,_more than one.

I str ]varlist [#-]#[-#]

instructs infix

to read a variable

'_

Begin _y realizing that the simplest form of this is varname #, such as sex 20. That says that variabl_ varname is to be read from column # of the current line: variable sex is to be read from

: '_ t

column20 and here, sex is a one-digit number. varn " rn m fr m the column ran e s eclfied read ar_e#-#, such as age 21-23, says to readva a e o " g p " ; age frtm columns 21 through 23 and here, age is a three-digit number. You cab prefix the variable with a storage type. str name 25-44 means to read the string variable name _rom columns 25 through 44. If you do not specify str. the variable is assumed to be numeriC. "Youcan specify the numeric subtype if you wish.

infix (fixed format) _ Read A_Cll (text) data in fixeclformat-

i i

You can specify more than one variable, with or without a type. byte ql-q5 51-55 means read va_ables ql, q2, q3, q4. and q5 from column_; 51 through 55 and store the five variables as b_tes. Finally, you can spec fy the line on which the Variable(S) appear, age 2:21-23

i

107

says that age is

tO:be obtained from }he second line, column_ 21 through 23. Another way to do this is to put together the #: direct}ve _ith the input-_afiabte directive: 2: age 21-23. There is a difference. but not with respect t_ reading the variable age, Let s consider two alternatives: ;1: str name 25-4_ age 2:21-23 ql-q5 51-55 1:

[

str

name

25-44

2:

age

21"23

ql-q5

51-55

The difference is thai the first directive says variables ql through q5 are on line I whereas the seCond says they are an line 2. When the colon is p_t out front it says on which line variables are to be found when we do not explicitly say otherwise. Vc'hen the colon is put inside, it applies only to the variable under consideration.

Remafks There are two ways t9 use "infix il

One is to type the specifications that describe how to read the

fixed_format data on thelcommand line: .

infix

ace

rate

_-4

spdlimit

6-7

acc_pts

9-11

using

highway.raw /

The other is to type the specifications into a file Z

topof highway.dcI,exampleI

--i infix

dictionary acc rate spdlimit acc_pts

asing highway.raw t-4

{

3-7 I-II

} end of highway.dct,

example

I

and {hen, inside Stata. t, _e . infix

{

i

using

hig way.dct

Which you use makes r_o difference to Stata. The first form is more convenient if there are only a few variables and the second form is less prone to error if you are reading a big, complicated file The second form alkws two variations, the one we just showed--where file_and one where the data are in the same file as the dictionary:

the data are in another

topof highwav.dct,example2i

infix

dictionary acc_rate

{ i-4

spdlimit

_-7

acc_pts

)-II

} 4.58

55 .46

2.8660 1.61 3.02

4.4

2.2 60 4.7 --

>a ot6 that

in the first ex

ple, the top line of the file read infix

wheieas in the second toe line reads simply iMix {

dictionary.

data _.are it is implied t_at__the data follow the dictionary.

end of highway.tier

dictionary

example

using

2

highway.raw where the When you do not say.

108

infi]K(fixed format) -- Read ASCII (text) data in fixed format

'!.

> Example So let's complete the example we started. You have a dataset on the accident rate per million vehicle miles along a stretch of highway, the speed limit on that highway, and the number of access points per mile. You have created the dictionary file highway, dct which contains the dictionary and the data: top of highway.dct,example 2 infix d_ctionary { ace_rate I-4 spdlimit 6-7 acc_pts 9-11

} 4.58 2.86

55_ .46 6_ 4.4

1.61 3.02

i 2.2 6_ 4.7

|

! ! end of highway.dct,

example 2

You created this file outside of Stata using an editor or word processor. Inside Stata. you now read the data. infix lists the dictionary so you will know the directives it follows:

! i

: !

• infix_using highway infix dictionary { ace_rate 1-4 spdlimit 6-7 acc_pts 9-11

} (4 observations

read)

list 1. 2. 3. 4.

ace_rate 4.58 2.86 1.61 3.02

spdlimit 55 60 60

acc_pts .46 4.4 2.2 4.7

Note that we simply typed infix using highway rather than infix using highway.dct, When we do not specify the file extension, infix assumes we mean .dot. <1

Reading string variables When you do not say otherwise in your specification either on the command line or in the dictionary infix assumes variables are numeric. You specify that a variable is a string by placing str in front 9f its name: infix

id t-6

str name 7-36

age 38-39

str sex 41

uslng employee.raw

or top of emptoyee.dct infix d_etionary using employee.raw id t-6 isir name 7-36 age s_r sex

{

38-39 40

} end of empIoyee.dct

infix (fixedformat)--- Read ASCII(text) data in fixedformat

109

f Reading!multiple-lines-er-observation When a dataset has muir le lines per observation, sometimes said multiple records per observation. you specify the number ol lines per observation using lines and you specify on which line the elements appear using #:. . infix

2 lines

1: id 1-6

str name 7-36

2: age I-2

str sex 4

using emp2.raw

oF topofemp2Act iz_fixdictionary using emp2.raw { 2I:lines id sir name 2: age str sex

i"6 7' 36 1"i2 4

} end of emp2,dct

There _e lots of different _,ays to say the same thing.

,

> Example Consider the following

l

aw data:

i_ income educ / se_ age / rcode, 1024 25000 HS | Male 28 119503

top of mydata.raw answers

to questions

--

1-5

1025 27000 C Female 24 022113 1035 26000 HS Male 32 110321

f36 2sooo c Female 25 131232 ;

--

end of mydata.mw

This dmaset has 3 lines oer observation and the first line is just a comment. One possible set of specifi+ations to mad _is ktata'_is infix dictionary u i 2 first 3 lines I: id income str educ 2: str sex 3:

4

topof mydatalAct

ing mydata {

I-4 6-10 12-13 6-11

int age !13-14 rcode 16 ql-q5

7-16

I end of mydatal,dct

----_

although we pre_r i

110

infix (fixed format) -- Read ASCII (text) data in fixed format top of mydata2.dct infi_

dictionary

using

mydata

{

2 first 3 lines id

1:I-4

income

I: 6-10

' E_

sir

educ

1:12-13

i!i

sir

sex

2:6-11

I '

age rcode

2:13-14 3:6

ql-q5

3:7-16

} •end of mydata2.dct Either will read these data, so we will use the first and then explain why we prefer the second. • infix

using

mydatal

infix dictionary 2 first I:

using

lines id

mydata

{

1-4

income

6-10

str

educ

12-13

2:

str

sex

6-11

3:

int age rcode

13-14 6

ql-q5

7-16

} (4 observations • list

in

read)

I/2

Observation

1 id

1024

income

sex

Male

age

28

q2 q5

9 3

q! q4 Observation

1 0

25000

educ

HS

rcode

1

q3

5

2 id

1025

sex

Female

income

educ

C

age

27000 24

rcode

0

q3

1

ql

2

q2

2

q4

1

q5

3

Now, what is better about the second? What is better is that the location of each variable is completely documented on each line, in terms of both line number and column, Since infix does not care about the order in which we read the variables, we could take the dictionary, jumble the lines, and it would still work. For instance, .... infi:

dictionary first

using

mydata

top of mydata3.dct

{

lines str

sex

1

rcode

!

sir age id

educ

ql-q5 income

2:6-11 3:6 1:12-13 2:13-14 I: i-4 3:7-16 i: 6-10

}

t

end of mydam3.dct

!

[ ]

I

infix(fixedformat)--Read ASCII(text)datain fixedformat

111

wilt also read these data even though•for each observation, we start on line 2, go forward to line 3, jump back to line l, and end up on line 1. It is not even inefficient to do this because infix does not really jump to record 2, then record 3, then record 1 again, etc, infix takes what we say and organizes it efficiently. The order in which we say it makes no difference. Well, it does make one: the order of the variables in the resulting Stata dataset will be the order we specify. In this case the reordering is senselessbut, in real datasets, reordering variables is often desirable. Moreover, we often construct dictionaries, realize _at we omitted a variable, and then go back and modify them. By making each line complete in and of itself, we can add new variables anywhere in the dictionary and not worry that. because of our addition, something that occurs later will no longer read correctly. <1

Readingsubsetsof observations

i

If you wanted to read only the males from some raw data file, you might type • infix

id i-6

sir name 7-36

age 38-39

str sex 41

using employee.raw if sex=="M"

If your specification was instead recorded in a dictionary, you could type infix

using employee.dct i_ sex=="M"

In another dataset, if you wanted to read just the first t00 observations, you could type (

infix 2 lines > in i/i00

1:

id I-6

str name 7-36

2: age i-2

str sex 4

using empi.raw

Or, if the specification was instead recorded in a dictionary and you wanted observations 10l through 573, you could type • infix using emp2.dct in 101/573

Also See Complementary:

[R]outfile, [R] outsheet, [R] save

Related:

[R]intile (fixed format), [R]insheet

B_ckground:

[L] 24 Commands to input data, [R]intile

i

F °'; e

input -- Enter data from keyboard I

I II I

I III

I

I

I

Syntax input

[varlist]

[,_automatic label]

Description input allows you to type data directly into the dataset in memo_• alternative to input.

Also see [R] edit for a windowed

Options automatic causes Stata to create

value labels from the nonnumeric

data it encounters•

automatically widens the display format to fit the longest label. Specifying label even if you do not explicitly type the label option.

automatic

It also implies

label allows you to type the labels (strings) instead of the numeric values for variables associated with value labels. New value labels are not automatically created unless automatic is specified.

Remarks If there are no data in memory, when you type input you must specify a vartist• Stata will then prompt you to enter the new observations until you type end.

> Example You have data on the accident rate per million vehicle miles along a stretch of highway along with the speed limit on that highway. You wish to type these data directly into Stata: • input nothing to input r (104) ;

Typing input by itself does not provide enough information know the names of the variables you wish to create. • input ace_rate spdlimit 1. 2. 3. 4.

ace_rate 4.58 55 2.86 60 1.61 end

spdlimit

112

about your intentions.

Stata needs to

input -- Enter data from keyboard

113

:

! _ !

i

We typed input acc_rate spdlimit and Stata responded by repeating the variable names and then prompting us for the first observation. We then typed 4.58 and 55 and pressed Retth,'n. Stata prompte_ktusfor the second obsen, ation. We entered it and pressed Return. Stata prompted us for the third 6bservation. We knew that the accident rate is 1.61 per million vehicle miles, but we did not know the corresponding speed limit for the highway. We typed the number we knew, 1.61, followed by a period, the missing value indicator. When we pressed Return, Stata prompted us for the fourth 6bservation. We were finished entering our data, so we typed end in lowercase letters.

i i

We can now list

the data to verify that we have entered it correctly:

. list i. 2. 3.

acc_rate 4.58 2.86 1.61

spdlimit 55 60 Q

If you have data in memory and type input without a vartist, you will be prompted to enter adklitional information on all the variables. This continues until you type end. :

i

i

Examp You now have an additional observation you wish to add to the dataset. Typing input by itself tells Stata that you wish to add new observations: • i_ut 4, 5,

act_rate 3.02 60 end

spdlimit

St_ta rem/nded us of the names of our v_-iables and prompted us for the fomth observation. We entered 'the numbers 3,02 and 60 and pressed Return. Stats then prompted us for the fifth observation. We could add as many new observations as we wish. Since we needed to add only one observation, we typ_ _nd, Our dataset now has four observations. "xl

You may add new variables to the data in memory by typing input followed by the names of the new variables. Stata will begin by prompting yGu for the first observation, then the second, and so on, until you type end or enter the last observation.

'iExample i

! ,

In addition to the accident rate and speed limit, we now obtain data on the number of access points (omramps and off-ramps) per mile along each stretcl of highway. We wish to enter the new data.

I

• input acc_pts acc_pts t. 4.6 2. 4.4

3 2.2 I

i

4. _4.7

F

114 input -- Enter data from keyboard When we typed input acc_pts, Stata responded by prompting us for the first observation. There are 4.6 access points per mile for the first highway, so we entered 4.6 and pressed Return. Stata then prompted us for the second observation, and so on. We entered each of the numbers. When we entered the final observation, Stata automatically stopped prompting us--we did not have to type end. Stata knows that there are four observations in memory, and since we are adding a new variable, it stops automatically. We can, however, type end anytime we wish. If we do so, Stata fills the remaining observations on the new variables with m/ssing. To illustrate this, we enter one more variable to our data and then list the result: • input

junk

jun_ 1. 1 2. 2 3. end • list acc_rate 4.58

I.

spdlimit 55

acc_pts 4.6

60

4.4

2.86

2. 3,

1.61

4.

3• 02

junk 1 2

2.2 60

4.7

q

You can input string variables using input, but you must remember to explicitly indicate that the variables are strings by specifying the type of the variable before the variable's name.

> Example String variables are indicated by the types str#, where #represents the storage length, or maximum length, of the variable. For instance, a str4 variable has maximum length 4, meaning it can contain the strings a, ab, abe, and abed but not abede. Strings shorter than the maximum length can be stored in the variable, but strings longer than the maximum length cannot. You can create variables up to str80 in Stata. Since a str80 variable can store strings shorter than 80 characters, you might wonder why you should not make all your string variables str80. You do not want to do this because Stata allocates space for strings based on their maximum length. It would waste the computer's memory. Let's assume that we have no data in memory and wish to enter the following input

strl6

name

age

str6

name i.

"Arthur

2.

"Mary

3. Guy "Fawkes" 3.

"Guy

Hope"

Fawkes cannot

We first typed input sex a str6 variable.

sex age

sex

22 male

37

"female"

48 male be read

Fawkes"

4. "Kriste 5. end

:

Doyle"

data:

as a number

48 male

Yeager"

25 female

strl6 name age str6 sex, meaning that name is to be a strl6 variable and Since we did not specify anything about age, Stata made it a numeric variable.

Stata then prompted us to enter our data. On the first line, the name is Arthur Doyle, which we typed in double quotes. The double quotes are not really part of the string; they merely delimit the

_!lput

_

_l_,t;'w uam

llVli!

hVyL_)awu

J ,_,

beginning and end of the str ng. We followed that with Mr Doyle's age, 22, and his sex, male. We did not bpther to type doubk quotes around the word male because it contained no blanks or special characters. For the second o _servation,we did type the double quotes around female;it changed _othing. In the third observation w omitted the double quotes around the name, and Stata informed us that Fawkes c_uld not be read as number and repromptddus for the observation. When we omitted the double q_otes, Stata interpre:ed Guy as the name, Fa_rl_esas the age, and 48 as the sexl All of this would have been okay with Stata except for one problem: Fawkes looks nothing like a number, so Stata complained and gave :s another chance. This lime, we remembered to put the double quotes around ttie name.

i

Stata was satisfied, and _ continued. We entered lhe fourth observation and then typed end. Here is our dataset: • _ist 1. 2. 3. 4.

1 nam_ Arthur Doyle Mary Hope Guy Fawke. _ Kriste Yeagez

age 22 37 48 25

sex male female male female

q

I

>

I

Example Just as we indicated whic Lvariables Werestrings by placing a storage type in front of the variable name, we can indicate the .torage type of our numeric variables as well. Stata has five numeric storage types: byte, int, 1c ng, float, and double. When you do not specify the storage type. Stata assumes the variable is afl _at. Youmay want to review the definitions of numbers in [U] 15 Data.

! ' i i i i !i i i

,'_dditional Therei are two reasons you might The wantdefault to explicitly specify storage type: toforinduce precision or to co_vhy aserve memory. type float has the plenty of precision most circumstances because Stata performs all calculations in double precision no matter how the data are stored. I[ you were storing 9-digit Social Security Numbers, however, you would want to coerce a different storage type or else the last digit would be r0uhded, long would be the best choice: double would _,_orkequally well, b_]tit would waste memory. Sometimes you do not need to store a variable as float.If the variable contains only integers between -32,768 and 32,7i_6,it can be stored as an int and would take only half the space. If a variable contains only inti',gersbetween -127 and 126, it can be stored as a byte which would _:akeonly half again as mu( i space. For instance, in tile previous example we entered data for age ,_ithout explicitly specifyin, the storage type; hence, it was a float. It would have been better to _tore it as a byte.To do ti" tt. we would have typed input

strl6

name b _te age str6 nam _

_. "Arthur Doyle"

sex sex

12male

°I i

2. "Mary Hope" 37 'female" _. "Guy Fawkes" 48 male

i

4. "Kriste

Yeager"

age

25 female

5. end

i

Stata understands a number of shorthands. For instance,

_I

input

int(a

b) c

allows you to input three variables, Remember .{input

int

and c a float•

a b c

would make a an int *

a, b, and c, and makes both a and b ints

but both b and c floats.

. inputa longb double(cd) e would make a a float,b a long,c and d doubles,and e a float. Statahas a shorthandforvariable names withnumericsuffixes. Typingvl-v4 isequivalent to typing Vl v2 v3 v4. Thus, • linput

'

int(vl-v4)

inputs f6ur variables and stores them as ints. q

Q Technic,_l Note You may want to stop reading now. The rest of this section deals with using input with value labels. If you are not familiar with value labels, you should first read [U] 15.6.3 Value labels. Remdmber that value labels map numbers into words and vice versa. There are two aspects to the process. !First, we must define the association between numbers and words. We might tell Stata that 0 corresponds to male and 1 corresponds to female by typing label define sexlbl 0 "male" 1 "female". The correspondences are named, and in this case we have named the O_male l++female correspondence sexlbl. Next, iwe must associate this value label with a variable. If we had already entered the data and the variable was called sex, we would do this by typing label values sex sexlbl. We would have entered the data by typing O's and !'s, but at least now when we list the data, we would see the words rather than the underlying.numbers. We cab do better than that. After defining the value label, we can associate the value label with the type:variable at the time we input the data and tell Stata to use the value label to interpret what we l_bel • i_put

define strl6

I.

"Arthur

2.

"Mary

3.! "Guy

sexlbl name

byte(age

Hope"

1 "female"

sex:sexlbl),

name Doyle" 22 male

Fawkes"

4. "Kriste 5. end

0 "male"

age

label sex

37 "female" 48 male

Yeager"

25 female

After deft ing the value label, we typed our input command. Two things are noteworthy: We added the label option at the end of the command, and we typed sex:sexlbl for the name of the sex variable, T_e byte(...) around age and sex:sexlbl was not really necessary: it merely forced both age _nd sex to be stored as bytes. Let's first decipher sex : sexlbl, sex is the name of the variable we want to input. The : sexlbl part tells Stata thal the new variable is Lo be associated with the value label named sexlbl. The label option tells Stata that it is to look up any strings we type for labeled variables in their

input- Enter datafrom keyboard

117

corresponding value label and substitute the number when it stores the data. Thus, when we entered the first observation of ou • data, we typed male for Mr Doyle's sex even though the corresponding variable is numeric. Rather than complaining that ""mate" could not be read as a number", Stata accepted what we typed, 3oked up the number corresponding to male, and stored that number in the data.

i

The! fact that Stata has lctually stored a number rather than the words male or female is almost irrelevant. Whenever we ist the data or make a table, Stata will use the words male and female just as if those words were actually stored in the dht/set rather than their numeric codings: • list

I.

nm _e

age

se_

DoylLe

22

male

Ho],e

37

female

Guy Fawb is

48

male

95

female

Arthur

2.

Mary

3. i

, Kriste Yeag_ r tabulate sex sex

] req.

Percent

Cure,

male

2

50. O0

50.00

female

2

50. O0

I00. O0

Total

4

I00. O0

It is only almost irreleva at since we can make use of the underlying numbers in statistical analyses. For instance, if we were to ask Stata to calculate the mean of sex by typing sumrnarize sex, Stata would report 0.5. We woul interpret that to mean that one-half of our sample is female.

i i

Value labels are perman_ Itly associated with variables. Thus, once we associate a value label with a variaNe, we never have ti do so again. If we wanted to add another observation to these data, we

i

could type . input,

i i

label

i5. "Mark

Esman"

nam _ 26 male

age

sex

_. end

!_ i

[3Technical Note The automatic option ',utomates the definition of the value label. In the previous example, we _nformed Stata that male c, ,rresponds to 0 and female corresponds to 1 by typing label define sexlbl 0 "male" :t "female". It was not necessary to explicitly specify the mapping. Speci_,ing the aut6maticoption tells ;tara to interpret what we type as follows:

i i

ii

First, ;see if it is a numbeI If so, store that number and be done with it. If it is not a number, check

I I ! i_ i

I

the value label associated u th the variable in an attempt to interpret it. If an interpretation exists, store theIcorresponding nun: .tic code• If one does not exist, add a new numeric code corresponding to what was typed. Store th_ new number and update the value label so that the new correspondence is never t'orgotten. We can use these feature to reenter our age and sex data. Before reentering the data, we drop -all and label drop _all to prove tha_ we have nothing up our sleeve:

atop_an _abel

drop

_all

....

i

118

i input -- Enterdata from keyboard input

strl6

name

!

byte(age

sex:sexlbl),

name

i.

"Arthur

_.

"Mary

3.

"Guy

4.

"Kriste

Doyle" Hope"

22

37

Fawkes"

age

48

Yeager"

automatic sex

male

"female" male 25

female

. end i I

•

T We previouslydefinedthevaluelabelsexlbl so thatmale correspondedto 0 and female corresponded

to 1. Th+ label that Stata automatically created is slightly different but just as good: i

Sabel list sexlbl se: Ibl : I

male

0 2

female

Also See

'

Complementary:

[R] save

i

Related: i

[R] edit, [R] infile

Background:

[U] 24 Commands to input data

, i

t

I

i

, I

!

!_ !

i

_ in_iheet -- Read AS II (text)data created by a spreadsheet i i iHll i r i i iJ ii i iil ill [

i

Syntax i i

i i

i

insheet

[varlist] using

filename

[, _double [no]n_ames comma t__abclear

]

If filen_me is specifiedwithmt an extension, .raw is assumed.

Description

in_heet reads into rremory a disk dataset that is not in Stata format. ±nsheet is intended for readir_g files created by a spreadsheet or database program. Regardless of the creator, :i,nsheet reads text (ASCII) files where here is one observation per line and the values are separated by tabs or commas. In addition, the first line of the file can contain the variable names or not. The best thing

I i

[ about!insheet is that if you type . insheetusingill, name

i

insheet will read your lata; that's all there is to it. Stata has other comrr ands for reading data. If you are not sure that insheet

i

lookingfor, see [R] infih and [U] 24 Commands to input data. If y/ou want to save your data in "spregdsheet-style" forma

Options

is what you are

see [R] outsheet.

i

double forces Stata to st_age types.

t

tore variables as double_

rather than floats:

see IV] 15.2.2 Numeric

:-

[no]names informs Stata whether variable names are included on the first line of the file. Speci_,ing this option will speed insheet's processing--assuming you are right--but that is all. ±nsheet can determine for itse!f whether the file includes variable names.

1

comma tells Stata that the values are comma-separated. Specifying this option will speed insheet's pr0cessing--assumin_ you are right--but thai is all. insheetcan determine for itself whether the separation charact_ is a comma or a tab.

i i

!

tab prOcessing--assumin_ tells Stata the v_lues are right--but tab-separated. this can option will speed insheet's you are that Specifying ig all. insheet determine for itself whether the separation charact_:r is a tab or a comma.

i

clear specifies that it is okay for the new data |o replace what is currently in memory. To ensure that you do not lose sc mething important, insheetwill refuse to read new data if data are already in memory I clear

is _ne way you can tell ±nsheet

x_ourselfb_ typing drip

_all

that it is okay. The other is to drop _he data

before reading new data.

119

i

12o

Remarks

insheet-

Read ASCII (text) data created by a spreadsheet

There i_ nothing to using insheet.You type insheet and

insheet

using

filename

will read your data, That is, it will read your data if

1. It can find the file and 2. The file meets insheet's

expectations

as to the format in which it _s written.

Assuring I is easy enough; just realize that if you type infix using myfile, Stata interprets this as an instruction to read myfile.raw. If your file is called myfile.txt, type infix using myf ile. t,btt. As for the file's fo,-mat, most spreadsheets and some database programs write data in the form insheet expects, It is easy enough to look as we will show you--and it is even easier simply to try and see what happens. If typing • insheet

using

filenarrle

does not produce the desired result, you will have to try one of Stata's other infile commands: see [R] infile.

> Example You ha*e a raw data file on automobiles and can bd read by typing (5

called auto.raw.

This file was saved by a spreadsheet

insheet using auto vars, I0 obs)

That done, we can now look at what we just loaded: • describe Contains

data

obs:

I0 5

vats: size:

310

(99.8%

storage

of memory

free)

display

value

type

format

label

make_ price

strt3 int

%13s %8.0g

mpg

byte

Z8.0g

rep78

byte

%8.0g

foreign

strlO

ZlOs

variable

name

Sorted by: |Note:

dataset

has

changed

since

last

variable

label

saved

li_t

I. i 2._I 3,

make

price

mpg

AMC Concord AMC Pacer

4099 4749

22 17

3 3

foreign Domestic Domestic

Spirit

3799

22

4. Buick 5. Buick

Century Electra

4816 7827

20 15

3 4

Domestic Domestzc

6. Buick

LeSabre

5788

18

3

Domestzc

4453

26

7. !

AMC

rep78

Buick

Opel

Domestic

Domestic

insheet 8. BuickRegal 9. Buick Riviera 10. Buick

Read ASCII (text) data created by a spreadsheet

5189 10372

20 16

3 3

Domestic Domestic

4082

19

3

Domestic

Skylark

Note that these data contain a combination of string and numeric variables, insheet out by i_elfi

121

figured all that

i

i

3 Technical Note Now let's back up and look at the auto.raw screen: • Sype mal_e

These invisible

file. Stata's type command will display files to the

auto.raw mpg

rep78

foreign

AM¢ Concord

4099

22

3

Domestic

AMC Pacer

4749

17

3

Domestic

AMC Spirit

3799

22

Buick

Century

4816

20

3

Domestic

Buick Buick

Electra LeSabre

7827 5788

15 18

4 3

Domestic Domestic

Buick

Opel

4453

26

Buick Buick

Regal Riviera

5189 10372

20 16

3 3

Domestic Domestic

Buick

Skylark

4082

19

3

Domestic

data and

i

price

have

tab

hence

characters

Domestic

Domestic

between

indistinguishable

i !

]

i

values.

from

blanks,

Tab

characters

type's

showtabs

are

difficult option

to makes

see

since the

tabs

thev

are

visible: I

_ype

auto.raw,

showtabs

1

makepricempgrepZ8foreign AMC Concord4099223Domestic AMC Pacer4749173Domestic AMC Spirit379922.Domestic Buick Century4816203Domestic Buick Electra7827lS4Domestic Buick

LeSabre5788lS3Domestic

Buick

Opel4453<_>26.Domestic

Buick Buick

Rega!5189203Domestic KivieralO372i63Domestic

Buick

Skylark4082193Domestic

:

This is an example of the kind of data insheet is willing to read. The first line contains the variable names, Nthough that is not necessm/. What is nedessary is that the data values have tab characters between them. + insheet would be just as happy if the data values were separated by commas. Here is another vafiationi on auto. raw that insheet can read: type

auto2.raw

make,price,mpg,rep78,foreign AMC Concord,4099,22,3,Domestic AMC Pacer,4749,17,3,Domestic AMC Spirit,3799,22,,,Domestic Buick Buick

Century,48i6+,20,3,Domestic Electra,7827i, 15,4,Domestic

Buick

LeSabre,5788,18,3,Domestic

Buick

Opel,4453,26!,,Domestic

i

Buick Buick

Regal,5189,20,3,Domesti¢ Riviera,lO37_,16,3,Domestic

i

Buick

Skylark,4082.,19,3,Dom_sZic

["

It is way one easieror for the us other. human beings to see the commas rather than the tabs. but computers 122

insheet-

do not care O

Read ASCII (text) data created by a spreadsheet

!

D Example The file does not have to contain variable names. Here is another variation on auto. the first line. this time with commas rather than tabs separating the values:

raw without

type auto3, raw AMC Concord, 4099,22,3, Domestic AMC Pacer, 4749,17,3 ,Domestic (output omitted ) Buick

Skylark. 4082,19,3, Domestic

Here is what happens when we read it: insheet using auto3 you must start with an empty dataset r(I8);

Oops; we still have the data from the last example in memory. • insheet using auto3, clear (5 vars, I0 obs) . describe Contains data obs : vats : size:

variable name

10 5 310 (99.8Y,of memory free) storage type

display format

vl

strl3

Y,13s

v2 v3 v4 v5

int byte byte strlO

7,8.0g ]/,,8.0g XS. 0g Y, IOs

Sorted by : Note:

value label

variable label

dataset has changed since last saved

list vl AMC Concord AMC Pacer

v2 4099 4749

v3 22 17

v4 3 3

v5 Domestic Domestic

(output omitted ) i0. Buick Skylark

4082

19

3

Domestic

I. 2.

'j

The only difference is that rather than the variables being nicely named make, price, mpg, rep78, and foreign,they are named vl,v2, ..., v5. We could now give our variables nicer names: • rename vl make . rename v2 price

insheet-- Read ASCII (text) data created by a spreadsheet

123

!

Another alternative is to specie' the variable names when we read the data: • insheet make price mpg rep78 foreign u_ing auto3, clear

(5 vats,I0obs) list make AMC Concord AMC Pacer

i. 2.

price 4099 4749

mpg 22 17

rep78 3 3

4082

19

3

foreign Domestic Domestic

!

i

(outputomi,ed ) 10.

,i

Buick Skylark

Domestic

ii

If we use this approach, we must not specify too few variables • insheet make price mpg rep78 using aut03, clear too few variables specified error in line 11 of file r,(102) ;

i 4 I

|

or too many.

|

. insheet make price mpg rep78 foreign weight using auto3, clear too many variables specified e_ror in line 11 of file r,(103);

mat is why we recommend . insheet using

i

filename

/

It is not difficult to rename your variables afterwards should that be necessary,

q

> Example

I

About the only other thing that can go wrong is that the data are not appropriate for reading by insheet. Here is yet another version of the automobile data: type auto4.raw, showZabs "_AMCConcord" 4099 22 '_,AMC Pacer" 4749 17

3 3

Domestic Domestic

3 4

Domestic Domestic Domestic

"!AMCSpirit" '_BuickCentury" "Buick Electra"

3799 4816 7827

22 20 15

"Buick LeSabre" "Buick 0pel" "Buick Regal" ".'Buick Riviera"

5788 4453 5189 10372

18 26 20 16

3 3 3

Domestic Domestic Domestic Domestic

"_uick Skylark"

4082

19

3

Domestic

] ]

i

Note that we specified type's showtabs option and no tabs are shown. These data are not tabdelimited or comma-delimited and are not the kind of data insheet is designed to read. I,et's try insheetanyway:

] 1

(Continued on next page)

i

124

insheet -- Read ASCII (text) data created by a spreadsheet insheet using auto4, clear (I vat, I0 obs) describe Contains data obs : vats:

10 1

size:

430 (99.8Y,of memory free)

variable name

storage type

vl

display format

sir39

Sorted by: Note:

value label

variable label

7,39s

dataset has changed since last saved

• list vl I. AMC Concord 2. AMC Pacer (output omitted) 10, Buick Skylark

4099 4749

22 17

3 3

Domestic Domestic

4082

19

3

Domestic

When £nsheet tries to readdatathathave no tabsor commas, itisfooledintothinkingthedata contain justone variable. Ifyou had thesedata,you would have toreadthedatawithone ofStata's other commands such as infile (free format). <3

Also See Complementary:

[R] outfile, [R] outsheet, [R] save

Related:

[R] infile (free format)

Background:

[U] 24 Commands [R] intile

to input data,

Ilrt"w

1

e

r

I i;insect -- Display simple summary of data's ci laracteristics I IU ( i i II t _ _

m I,

_

I / "

I

i

yntax, :in.nspeict [varlist_ [±fexp] [in range] i byi ...

i

: may be used

with

inspect;

see [R] by.

)esCriiOn

i 1

i

The inspect command provides a quick summary of a numeric variable that differs from that provided by summ_arizeor tabulate. It reports ttie number of negative, zero, and positive values;

I

the nunlber of integers and nonintegers; the number of unique values: the number of missin_ and produces _asmall histogram. Its purpose is not anal)tical but to allow you to quickly gain familiarity with u_own data.

'

! l

ilt_vartisl, i

! i 1

Typing inspect

Example inspect

by itself produces an inspectio_ for all the variables in the dataset. If you specify

aninspectionofjustthosevafiablesisp resented. is not a replacement or substitute for s!!rnma/'ize and tabulate.

It is instead a data

management or information tool that lets you quickly g_in insight into the values stored ir_a variable, For instance, you receive data that purport to be on automobiles, and among the variables in tile dataset is one called mpg.Its variable label is l_ileage (mpg), which is surely suggestive. You i_aspect the variable • use

auto,

(1978

Automobile

inspect m]_g:

Data)

mpg

Mileage

! (mpg)

Number

of

Observations Non-

# #

Negative Zero

Total-

# #

Positive

74 .....

74

-

# #

#

#

74

74

-

#

#

#

,

i

clear

Total #

12

Missing

-

Integers -

-

41 (21 unique

Integers

74

values)

and you discover that the variable _s never missiflg; all 74 observations in the dataset have some value for rapg. Moreover. the values are all positive and they are all integers as well. Among those 74 observations are 2t unique (different) values. The variable ranges from 12 to 41. and you are provided with a small histogram that suggests the variable appears to be what it claims

i

lzs

[

_zo

znspec[ -- ulspiay slmpm summary ot data's characteristics

> Example Bob, a co-worker, presents you with some census data. Among the variables in the dataset is one called region which is labeled Census Region and is evidently a numeric variable. You inspect this variable. • use bobsdata • inspect region region:

Census region

Number of Observations Non-

#

# #

# # # #

Negative Zero Positive

#

#

#

#

Total

#

#

#

#

Hissing

# # #

5

Total 50 50

Integers

Integers

50 50

-

50

(5 unique values) region is labeled but I value is NOT documented in the label.

In this dataset something may be wrong, region takes on five unique values. The variable has a value label, however, and one of the observed values is not documented in the label. Perhaps there is a typographical error. A call to Bob would be in order.

> Example You call Bob and there was indeed an erron He fixes it and returns the data to you. Here is what inspect produces now: inspect region region:

# # #

Census region

# # # #

# # # # # #

Number of Observations NonNegative Zero Positive

# # # #

1

Total Missing 4

Total 50 50 -

Integers

Integers -

50 50

50

(4 unique values) region is labeled and all values are documented in the label.

4

I

inspect-- Displaysimplesummaryof data'scharacteristics

i

'

127

Example You receive data on the climate in 956 U.S. Cities. The variable tempjan records the Average January temperature in degrees Fahrenheit. The results of inspectare • inspect tempjan tempjan:

Average January temperature

Number of Observations Non-

i Negativ@ Zero Positiv4

# # # # # #

# # #

# # #

Total 954

Total Hissing

954 2

2.2 72.6 (More than 99 unique values)

Integers 78 78

Integers 876 ........ ! 876

i i

956

, i

In two of the 956 observations, tempjan is nlissiQg. Of the 954 cities that have a recorded tempjan, all are positive and 78 of them are integer valuesl tempjanvaries between 2.2 and 72.6. There are more than 99 unique values of tempj an in the dathset. (Stata stops counting unique values after 99.) <1

i

1

SaVediResults inspect

saves

i in r()"

I

Scalars r(N) r(N..Jaeg)

number of observatiqns number of negative 0bservations

r(N_O)

number of observations equal to 0

i ]

r (N_pc_s) number of positive observations r (N_negint) number of negative, Integer observations r(N_posint)

number of positive, integer observations

r(N_tmique) r(N_undoc)

number of unique values or. if more than 99 number of undocumented values or. if not labeled

AlsoSee

t

i i

Related: i

i !

[R] codebook, [R] compare, [R] describe, [R] Iv, [R] summarize, [R] table, [R] mbsmn, [R] tabulate

i ii

1

1 lue [ ipolate -- Linearly interpolate(extrapolate) I I values

]

I

]

I

Syntax ipolate war war by ...

[if

exp]

: may be used with ipolate;

[in range],

generate(newvar) [ epolate

]

see [R] by.

1

Description ipolate interpolated

creates newvar = yvar where yvar is not missing and fills in newvar (and optionally extrapolated) values of yvar where yvar is missing.

with linearly

Options generate (newvar) epolaZe

is not optional; it specifies the name of the new variable to be created.

specifes values are to be both interpolated and extrapolated.

Interpolation

only is the default.

Remarks > Example You have data points on y and x although, in some cases the observations on y are missing. You believe y is a function of x justifying filling in the missing values by linear interpolation: 1ist 1. 2. 3. 4. 5. 6. 7.

x 0 1

y 3

1.5 2 3 3.5 4

6

18

ipolate y x, gen(yl) (I missing value generated) • ipolate y x, gen(y2) epolate • list i. 2. 3. 4. 5. 6. 7.

x 0 1 1.5 2 3 3.5 4

y

yl

3

3 4.5 6 12 15 18

6

18

y2 0 3 4.5 6 12 15 18

128

8

!

!

i_i

l l

ipolate-- i_inearlyinterpolate(extrapolate)values

129

]

> E_amNe i

i

You have a dataset of circulation of a magazine from 1970 through 1993. Circulation is recorded in a Variablecalled circ and the year in year. in a few of the years, the circulation is not known

i

_o you want to fill it in by linear interpolation: . ipo!ate

!

cite year, gen(icirc)

Now assume you have data on the circulations of 50 magazines; the identity of the magazines is i recorded in magazine (which might be a string variable--it does not matter): i

by magazine: ipolate circ year, gen(ic_rc)

!i

! if the by ... : prefix is specified, interpolation is performed separately for each group.

]

> E_ample You have data on y and z although some of the y values are missing. You wish to smooth y(x) using lowess (see [R] ksm) and then fill in missifigwdues of y using interpolated values: I

. ksm y x, gen(yls) lowess 'ipolate yls x, gen(iyls)

i i i

q

"

:i]

/

]

MethodsandFormulas

:

i

ipolateis implemented as an ado-file. The value Y at x is found by finding the closest points (xo, yo) and (xt.yl), and xl > x, where Yo and 91 are observed, and +alculating

such that x0 < x

Y] - Yo y= _ (x-- Xo) +yo X 1 -- X 0

if epoZate is specified and if (xo, Yo) and (xt,yl) cannot be found on both sides of x, the two closest points on the same side of x are found arid the same formula is applied.

t

!

AlSO _e Complementary:

[t_]ksm

i

I I

i

ii

I Itle

[ ivreg -- Instrumental ,, variablesmidtwo-stageleast " squaresregression

i

Syntax ivreg

depvar

[vartish

] (varlist2

= varlistiv)

[weight]

[if

exp]

[in

range]

[, level(#) _beta_hasconsnoconstant _robustcl.uster(varname) firs_ noheader eform(string)depname(varname) msel 1 by ... : may be used with ivreg; see [R] by. aweights, freights, iweights, and pweights are allowed; see [U] 14.1.6 weight. depvar, varlistt, varlist2, and varlistiv may contain time-series operators; see [U] 14.4.3 Time-series varlists. ivreg shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands.

Syntaxfor predict predict

[l-ype] newvarname

[if

exp]

[in range!

[, statistic j

where statistic is

xb re s iduals

xjb, fitted values (the default) residuals

p_r(a,b)

Pr(a < Yi < b)

e(a,b)

E(yjla < Yi < b)

ystar(a,b) stdp stdf

(Yj-), yj = max{a, min(yj,b)} standard error of the prediction standard error of the forecast

where a and b may be numbers or variables; a equal to '.' These aresample. available both in and out of sample; type predict the statistics estimation

means -2; ...

b equal to ' ' means -roe.

if e(sample)

...

if wanted only for

Description ivreg estimates a linear regression model using instrumental variables (or two-stage least squares) of depvar on varlistl and varlist2 using varIisti,, (along with varlish) as instruments for varlist2. In the language of two-stage least squares, varlist_ and varlistiv are the exogenous varlist2 the endogenous variables.

130

variables and

ivreg -- Instrumentalvariablesand two-stage least squares regression

131

OptiOns level (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as set by set level: see [U] 23.5 Specifying the width of confidence intervals.

(95)

i

beta asks that normalized beta coefficients be reported instead of confidence intervals,

i I

hascons indicates that a user-defined constant or its equivalent is specified among the independent variables. Some caution is recommended when specifying this option as resulting estimates may not be as accurate as they otherwise would be. See [R] regress for more information. noconstant

suppresses the constant term (interCept) in the regression.

robust specifies that the Huber/White/sandwieh estimator of variance is to be used in place of the traditional calculation (White, 1980). This alternative variance estimator produces consistent standard errors even if the data are weighted or if the residuals are not identically distributed. robust combined with cluster() further allows residuals which are not independent within ctu_ter (although the3, must be independent between clusters). See [u] 23.11 Obtaining robust variance estimates.

i i

i t

If you specie, pweights,

I

robust

is implied[ see [u] 23.13 Weighted estimation.

cluster (varname) specifies that the observations are independent across groups (clusters) but not necessarily independent within groups, varname specifies to which group each observation belongs; e.g., cluster(personid) in data with repeated observations on individuals, cluster () affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficientsL see [U] 23.11 Obtaining robust variance estimates, cluster () can be used with pweights to produce estimates for unstratified cluster-sampled data, see [U] 23.13 Weighted estimation. Also see [R] svy estimators for a command designed especially for sur_'ey data. c!us't;er() by itself. first

implies robust:

specifying robust

cluster()

is equivalent to typing cluster()

requests that the first-stage regression results be displayed.

noheader suppresses the display of the ANOVAtable and summary statistics at the top of the output, di:splaying only the coefficient table, This option is often used in programs and ado-files,

'

eform(string) is used only in programs and ad0-files that employ ivreg to estimate models other than instrumental variable regression, eformO specifies the coefficient table is to be displayed in "exponentiated lbrm" as defined in [R] maximize and that string is to be used to label the exponentiated coefficients in the table. depname (varname) is used only in programs and ado-files that employ ivreg to estimate models other than instrumental variable regression, depname () may be specified only at estimation time. varname is recorded as the identity of the dependent variable even though the estimates are ca'tculated using depvar. This affects the labeling of the output--not the results calculated_but could affec_ subsequem calculations made by predict, where the residual would be calculated as deviations from varname rather than depvar, depname() is most typically used when depvar is a temporary variable (see [el macro) used as a proxy for varname. msel is used only in programs a_d ado-files that employ ivreg to estimate models other than instrumental variables regression, msel sets the mean square error to 1, thus forcing the variancecovariance matrix of the estimators to be (X'DX) -1 (see [R] matsize Methods and Fonnulas) and so affects calculated standard errors. Degrees of freedom for t statistics are calculated as _ ra_her than n - k.

t

132

ivreg -- Instrumental variables and two-stage least squares regression

Options for predict xb, the default, calculates

the linear prediction.

res±duals calculates the residuals; that is, gj -xjb. These are based on the estimated equation when the observed values of the endogenous variables are used--not the projections of the instruments onto the endogenous variables. pr(a,b) calculates interval (a, b).

Pr(a < xjb

+ uj < b), the probability

that yjlxj

would be observed

in the

a and b may be specified as numbers or variable names; tb and ub are variable names; pr(20,30) calculates Pr(20 < xjb + uj < 30); pr(lb,ub) calculates Pr(/b < xjb -+.uj < ub); and pr(20,ub) calculates Pr(20 < xjb + uj < ub). a =. means -_zxz; pr(. ,30) calculates Pr(xjb + uj < 30); pr(/b,30) calculates Pr(xjb + uj <: 30) in observations for which It) =. (and calculates Pr(/b < xjb + uj < 30) elsewhere). b =. means +c_; pr(20, .) calculates Pr(xjb + uj > 20); pr(20,ub) calculates Pr(xjb + uj > 20) in observations for which ub =. (and calculates Pr(20 < x3b + uj < ub) elsewhere). e(a,b) calculates E(xjb + uj I a < xjb + uj < b), the expected value of 9jlxj yj[xj being in the interval (a,b), which is to say, 9jlxj is censored. a and b are specified as they are for pr ().

conditional

on

ystar(a,b) calculates E(y_) where y_ = a if xjb + uj < a, y_ = b if xjb + uj > b, and 9j" = xjb + uj otherwise, which is to say, #_ is truncated, a and b are specified as they are for prO. strip calculates the standard error of the prediction. It can be thought of as the standard error of the predicted expected value or mean for the observation's covarlate pattern. This is also referred to as the standard error of the fitted value. sgdf calculates the standard error of the forecast. This is the standard error of the point prediction for a single observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdl are always larger than those by strip; see [R] regress Methods and Formulas.

Remarks ivreg performs instrumental variable_ regression (or two-stage least squares), and weighted instrumental variables regression. For a general discussion of two-stage least squares, see Johnston and DiNardo (1997), Kmenta (1997), and Wooldridge (2000). While computationally identical. Davidson and MacKinnon (1993, 209-224) present their discussion using instrumental variables terminology. Some of the earliest work on simultaneous systems can be found in Cowles Commission monographs--Koopmans and Marschak (1950) and Koopmans and Hood (1953)--with the first development of two-stage least squares appearing in Theil (1953) and Basmann (1957). The syntax for ±vreg assumes you want to estimate a single equation from a system of equations, or an equation for which you do not want to specify the functional form of the remaining system, if you want to estimate a full system of equations, either using two-stage least squares equation-by-equation or using three-stage least squares, see [R] reg3. An advantage of ±vreg is lhat you can estimate a single equation of a multiple-equation equations.

system without speci_ing

the functional form of the remaining

i;] ivreg -- Instrumentalvariables and two-stage least squares regression

133

EXample i

Let us assume that you wish to estimate hsngval = so + _lfaminc + _2reg2 + _3reg3 + _4reg4 rent = _0 + _lhsngval+ fi2pcturban + u

'

1

i 5%u have state data from the 1980 Census. housing, and rent is median monttfly gross income (famine) and region of the country flmction of hsngval and the percentage of

} i

+ E

hsng#al is the mextian dollar value of owner-occupied rent. You postulate that hsngval is a function of family (reg2 through reg4). You also postulate that rent is a the p_ulation living in urban areas (pcturban). i

If you are familiar with multi-equation modelS, you have probably already noted the triangular structure of our model. This triangular structure ]is not required. In fact, given the triangular (or i

recursive) structure of the model, if we were to assume that c and u were uncorrelated, either of the equations could be consistently estimated by ordinaD' least squares. This is strictly a characteristic of triangular systems and would not hold if hsngval were assumed to also depend on rent,regardless of assumptions about e and u. For a more detailed discussion of triangular systems see Kmenta (1997. _19-720). "

i

You tell Stata to estimate the rent equation by specifying the structural equation and the additional exogenous variables in a specific form. The depeddenT variable appears first and is followed by the exogenous variables in the structural model for rent, These are followed by a group of variables in parentheses separated by an equal sign. The variables to the left of the equal sign are the endogenous regressors in the structural model for rent and those to the right are the additional exogenous variables that will instrument for the endogenous variables. Only the additional exogenous variables need to be specified to the right of the equal sign: those already in the structural model wilt be automatically included as instruments, • As the following command shows, this is more difficult to describe than to perform. In this example, rent is the endogenous dependent variable, hsngvat is an endogenous regressor, and

i

• ivreg

i

I

rent

pcturban

(hsng_al = famine

:

i

'

]

i

r_g2-reg4)

famine,Instrumental reg2, reg3,variables reg4, and peturban are th_ exogenous variables. (2SLS) regression

R-squared = Number of obs = Adj R-squared =

0.5989 50 0.5818

1249.851)59

Root MSE

221862

18338.7_)17

Repidual Source

24565.7167 SS

47 df

Total

61243.12

49

F( 2, Prob > F

i

h_ngval pc_urban _cons Ins¢runented: Instrunents:

i

522,6741_23 MS

2

rent

1

42.66 0.0000

30677.4033

i .

47) = =

Model

Coef.

Std. Err.

t

P>lt[

.0022398 .081516

.0003388 '. 3081528

_. 61 _. 26

0.000 O. 793

120.7065

15.70688

_.68

O.000

=

[957,Conf. Interval] .0015583 -. 5384074 89.10834

.00292i3 .7014394 152.3047

hsngval i pc_urban fam_nc reg2 rag3 #eg4

i

I,

134

ivreg -- Instrumental variables and two-stage least squares regression

> Example Given the triangular nature of the estimated system, we might wonder if there is sufficient correlation between the disturbances to warrant estimation by instrumental variables. (We might have a similar question, even if the system were fully simultaneous.) Stata's hausman command (see [R] hausma.) will allow us to test whether there is sufficient difference between the coefficients of the instrumental variables regression and standard OLS to indicate that OLS would be inconsistent for our model. To perform the Hausman test, we use hausman to save the ivreg estimates, perform an OLS regression, and compare the two using hausmma. •

hausma11,

save

• regress

renl

hsngval

Source

pcturban SS

Model Residual

df

MS

Number

of obs

50

40983.5269

2

20491.7635

20259.5931

47

431.055172

K-squared

=

0.6692

1249.85959

Adj R-squared Root MSE

= =

0.6551 20.762

Total

61243,12

rent

49

Coef.

Std.

Err.

t

47)

=

F( 2, Prob > F

P>It _

[95_, Conf.

= =

47.54 0.0000

Interval]

hsngval

.0015205

.0002276

6.68

O. 000

.0010627

.0019784

pcturban _cons

.5248216 125.9033

.2490782 14. 18537

2.11 8.88

O. 040 O. 000

.0237408 97. 36603

i.025902 154.4406

hausman,

constant

sigmamore Coefficients Prior

Current

Difference

S.E.

hsngval

.0022398

.0015205

.0007193

pcturban _cons

.081516 120.7065 (b)

.5248216 125.9033 (B)

-.4433056 -5. 196801 (b-B)

I

b = less

efficient

B = fully Test:

Ho:

efficient

difference

estimates

in coefficients

chi2(I)

obtained

estimates not

The Hausman test clearly indicates that

=

OLS

previously

obtained

from

from

ivreg

regress

systematic

= (b-B)" [(V_b-V_B)'(-I)] = 12.08 Prob>chi2

.000207 .1275655 I. 49543 sqrt (diag(Vib-V_B))

(b,B)

O. 0005

is an inconsistent

estimator for this equation.

As opposed to a direct test of hsngval's endogeneity, Davidson and MacKinnon (1993. 236-242) have noted that this Hausman test is best interpreted as evaluating whether O[S is a consistent estimator for the model. The null hypothesis is that the model was generated by an OLS process and the test is performed under the assumption that the instrumental variables estimates are consistent. As an alternative to the Hausman test, Davidson and MacKinnon suggest an augmented regression test that is based on the same asymptotic requirements as the Hausman test. We can easily fornl the augmented regression by including the predicted values of each endogenous right-hand-side (rhs) variable, as a function of all exogenous variables, in a regression of the original model. For our hsngval model, we regress hsngval on all exogenous variables and include the prediction from this regression in an OLS regression of the hsngval equation. • regress hsngval (outputormttcd ) predict (option

faminc

reg2-reg4

hsng_hat xb assumed;

fitted

values)

pcturban

r

ivreg -

Instrumentalvadables and two-stage least squaresregression

.prediCt

hsng_res,

. regress

rent

res

hsng_al

Source

pcturbknhsng_hat SS

df

Number of obs F( 3 46) Prob > F R-squared Adj R-squared Root MSE

MS

: Model Residual

46189.152 15053.968

Total

61243.12

rent

Coef.

hs_val pctttrban hsns_hat _cons

135

.0006509 .0815159 .0015889 120.7065

3 46

15396.384 327.260173

49

1249.85959

Std. Err.

t

P>ttl

.0002947 .2438355 .0003984 12.42856

2.21 0.$3 3.99 9._1

0.032 0.740 0.000 0.000

= = = = = =

50 47.05 0.0000 0.7542 0.7382 18.09

[95_ Cong. Interval] .0000577 -.4092994 .000787 95.68912

.00124_2 .57233113 .0023908 145.72_9

Since we have only a single endogenous rhs variable, our test statistic is just the t statistic for the hsng__hat variable. If there were more than one endogenous rhs variable, we would need to perform a joint test of _illtheir predicted value regressors being zero. For this simple case, the test statement w_ld be • ,test _sng_hat (1) ,

Itsng_hat= 0.0 _( 1, 46) = Prob > F =

15.91 0.0002

While the p-value from the augmented regression test is somewhat lower than the p-value from the Hausman test, both tests clearly show that OLS is nor indicated for the rent equation (under the assumption that the instrumental variables estimator is a consistent estimator for our rent modeD.

_!Example Robust sta_ard • ivreg

rent

errors are availab_ with ±vreg: pcturban

(hsngval

= famine

reg_-reg4),

robust

IV (2SLS) regression with robust standard errors

--_

rent

Coef.

hsngval pcturban _cons

.0022398 .081516 120.7065

Robust Std. Err.

t

P>It I

.0006931 .4585635 15.7348

3.23 O.18 7.67

O.002 O.860 O.000

Number of obs = F( 2, 47) = Prob > F =

50 21._4 O.O(YO0

R-squared Root MSE

O.5989 22.882

= =

[95_ Conf. Interva_l] .0008455 -. 8409949 89.05217

.0036342 i.004027 152.3609

T

InstzRlmented: hsngval In_tra_ments: pcturban famine reg2 reg3 re$4

The robust star_darderror for the coefficiem on housing value is double what was previously estimated.

_

_

13u

wreg -- mstrumemal variables and two-stage least squares regression

Q Technical Note You may perform weighted two-stage instrumental variables qualifier with irreg. You may perform weighted or unweighted variable estimation, suppressing the constant, by specifying the constant is excluded from both the structural equation and the

estimation by specifying the [weight] two-stage least squares or instrumental noconstant option. In this case, the instrument list. Cl

Acknowledgments The robust estimate of variance with instrumental Mead Over, Dean Jolliffe, and Andrew Foster (1996).

variables was first implemented

in Stata by

Saved Results ivreg saves in e() : Scalars e (N) e (ross) e(df_m) e(rss) e(df.._r) Macros

number of observations mode] sum of squares mode] degrees of freedom residual sum of squares, residual degrees of freedom

e(r2) e (r2_a) e(F) e(rmse) e(N_clust)

e(cmd)

ivreg

e(clustvar)

e(version)

version number of ivreg name of dependent variable iv weight type weight expression

e(vcetype)

e(depva.r)

e(model) e(wtype) e (wexp) Matrices e (b)

coefficientvector

e(instd)

e(insts) e(predict)

e (V)

R-squared

adjusted R-squared F statistic root mem_square error number of clusters name of cluster variable covariance estimation method instrumented variable instruments program used to implement predict

variance-covanance matrix of the estimators

Functions e (sample)

marks estimation sample

Methods and Formulas ivreg

is implemented

as an ado-file.

Variables printed in lowercase and not boldfaced (e.g., x) are scalars. Variables printed in lowercase and boldfaced (e.g., x) are column vectors. Variables printed in uppercase and boldfaced (e.g., X) are matrices. Let v be a column vector of weights specified by the user. If no weights are specified, then v -- 1. Let w be a column vector of normalized weights. If no weights are specified or if the user specified fweights or iweights, w= v. Otherwise, w = {v/(I'v)}(ltl).

i

i

1 i

J

!

ivreg -- Instrumentalvariablesand two-stageleast squares regression

137

The number of observations, n, is defined as l'w. In the case of iweights, this is truncated to an integer. The sum of the weights is l'v. Define e = t if there is a constant in the regression and zero otherwise. Define k as the number of right-hand-side (rhs) variables (including the constant). Let X denote the matrix of observations on the ths variables, y the vector of observations on the left-hap,d-side (lhs) variable, and Z the matrix of observations on the instruments. In the following formulas, if the user specifies weights, then X'X, X ! y, y'y, Z'Z, Z'X, . and Z'y are replaced by X'DX; X'Dy, y'Dy, Z'DZ, Z'DX, and Z'Dy, respectively; where D is a diagonal matrix whose diagonal elements are the elements of w. We suppress the D below to simpli_ the notation. Define A as X'Z(Z'Z)-I(x'z)

j

i

' and a as X'Z(Z'Z)-IZ'y.

The coefficient vector b is defined as A-la. Although not shown in the notation, unless hascons is specified, A and a are accumulated in deviation form and the constant is calculated separately. This comment applies toall statistics listed below. The total sum of squares, ySS, equals y'y if there is no intercept and y'y - { (l'y)2/n The degrees of freedom are n - c. The error sum of squares, ESS, is defined as y'_ - 2bX'y

k.

i

aren

The mode/sum

+ b'X'Xb.

} otherwise.

The degrees of freedom

of squares, MSS, equals TSS- ESS. The degrees of freedom are k - c.

The mean square error, s2. is defined as ESS/(n 2 k). The root mean square error is s, its square root. If c - 1, then F is defined as F = (b - c)iA(b - c) (k _ 1)s 2 where e is a veclor of k - 1 zeros and h'th element l'y/n. this case, you may use the test

Otherwise, F is defined as missing. (In

command to construct any F test you wish )

]

The R-squared, R 2, is defined as R 2 = 1 - ESS/TSS.

i

The. adjusted R-squared, R2a, is 1 - (1 - R2)(n- c)/(n - k). If robust is not specified, the conventional estimate of variance is s2A -1.

i ]

For a discussion of robust variance estimates in the context of recession and regression with instrumental Variables. see [R] regress. Methods and Formulas. See this same section for a discussion of: the formulas for predict after irreg,

i i i]

References Baltagi,B. H. 1998. Econometrics.New York: Springer-Verlag. Basmann. R. L. t057. A generalizedclassical method of iinear estimationof coefficientsin a structural equation. -

Econometrica25: 77-83. Davidson.R, and J. G. MacKinnon.t993. Estimationand Inferencein Econometrics.New York:OX%rdUniver_iU Press:

Johnston.J. and J. DiNardo.I99Z EconometricMethods.4lh ed New York:McGraw-Hill. Kmenta,J. 1997. Elememsof Economemcs.2d ed. Ann Arbor:Universityof MichiganPress. Koopmans,T. C. and W C, Hood. I953. Studiesin EconometrkMethod,New York:John Witey& Son_. Koopmans.T. C and J. Marschak.I950. StatisticalInferencein DynamicEconomic Models. New York:John Wiley & SOns.

I

138

ivreg -- Instrumental variables and two-stage least squares regression

Over, M.. D. Jolliffe. and A. Foster. 1996. sg46: Huber correction for two-stage least squares estimates. Stata Technical Bulletin 29: 24-25. Reprinted in Sta_aTechnical Bulletin Reprints, vo]. 5. pp. 140-142. Theil, H. 1953. Repeated Least Squares Applied _oCompleteEquation Systems. Mimeographfrom the Central Planning Bureau, Hague. White, H. 1980. A heteroskedasticity-consistentcovariance matrix estimator and a direct test for heteroskedasticity. Econome_ca 48: 817-.838. Wooldridge, J. M. 2000. Introductory Econome;_cs: A Modern Approach. Cincinnati. OH: South-Western College Publishing.

Also See Complementary:

[R] adjust, [R] lincom, JR] linktest, [R] mfx, [R] predict. [R] testnl, [R] vce, [R] xi

Related:

[R] anova, [R] areg, JR] ensreg,

[R] mvreg, [R] qreg, [R] reg3, JR] regress,

[R] rreg, [R] sureg, [R] svy estimators, [P] _robust Background:

[U] [U] [U] [U]

[R] test,

[R] xtreg, [R] xtregar,

16.5 Accessing coefficients and standard errors. 23 Estimation and post-estimation commands, 23.11 Obtaining robust variance estimates, 23.13 Weighted estimation

1

!

Title

i jknife_-- J_kknife ]

i i

i 'i

,i

estimation

'

i

I

i

i ,I

_

exp_list [if exp] [in ra,ge]

[,

i

N I

,, f

i

Syntax jtmife

"cmd"

[r_class

1 e_class

t n(exp) ]

!

level(#) keep ] expJist

Contains

] i i

newvarnarne = (exp) (exp) eexp

] i i

eexp is specname [eqno]specname specnarne is _b

i I

_b[]

1

_se _se []

eqno is ## /lan'td

Distinguish between [ ], which are to be typed, and [], which indicate optional arguments.

Iscription jknife

performs jack&nile estimation.

cmd defines the statistical command to be executed, cmd must be bound in double quotes. Compound double quotes ("" and "") are needed if the command itself contains double quotes exp_[ist specifies the statistics to be retrieved after the execution of cmd. on which jackknife statistics will lie calculated.

Qptions rclass, eclaSS, and n(exp) specify where crnd saves the number of observations on which it based the calculated results. You are strongly advised tO specify one of these options.

i i

rclass

specifies that cmd saves the number of dbservations in r(N).

ectass

specifies that cmd saves the number of observations in e(N).

n(exp) allows you to specify an}, other expression that evaluates to the number of obser_'ations used. Specifying n(r(N)) is equivalent to spedifying option rclass. Speci_'ing n(e(N)) is equi'falent to specifying option eclass. If cmd saved the number of observations in r(Nl), specify n(_(Ni) ). Y

139

140

Jknife-- Jackknifeestimation

If you specify none of these options, jknife assumes that all observations in the dataset contribute to the calculated result. If that assumption is incorrect, the reported standard errors wilt be incorrect. For instance, say you specify • jknife "regress y xl x2 x3" coef=_b[x2]

and pretend that observation 42 in the dataset has x3 equal to missing. The 42nd observation plays no role in obtaining the estimates, but jknife has no way of knowing that and will use the wrong N. If, on the other hand, you specify jknife "regress y xl x2 x3" coef=_b[x2], e

will correctly notice that observation 42 plays no role. Option e is specified because regress is an estimation command and saves the number of observations used in e (N). When jknife runs the regression omitting the 42nd observation, jknife will observe that e(N) has the same value as when jknife previously ran the regression using all the observations. Thus, jknife will know that regress did not use the observation. jknife

In this example, it does not matter whether you specify option eclass eclass is easier.

or n (e (N)). but specifying

level(#) specifies the confidence level, in percent, for the confidence intervals. The default is level(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. keep specifies that new variables are to be added to the dataset containing the pseudo-values of the requested statistics. For instance, if you typed . jknife "regress y xl x2 x3" coef=_b[x2], e keep

new variable cool would be added to the dataset containing the pseudo-values for _b [x2]. Let b be defined as the value of _b [x2] when all observations are used to estimate the model, and let b(j) be the value when the jth observation is omitted. The pseudo-values are defined as pseudovaluej = N • {b- b(j)} + b(j) where N is the number of observations used to produce b.

Remarks While the jackknife--developed in the late 1940s and earl}, 1950s--is of largely historical interest today, it is still useful in searching for overly influential observations. This feature is often forgotten. In any case, the jackknife is 1. an alternative, first-order unbiased estimator for a statistic; 2. a data-dependent way to calculate the standard error of the statistic and to obtain significance levels and confidence intervals; and 3. a way of producing measures called pseudo-values for each observation, reflecting the observation's influence on the overall statistic. The idea behind the simplest form of the jackknife the one implemented here is to calculate the statistic in question N times, each time omitting just one of the dataset's observations. Write S for the statistic calculated on the overall sample and S(j) for the statistic calculated when the jth observation is removed. If the statistic in question were the mean, then S=

(N - 1)5'(3) + sj N

i

_

jknife-- J_ffe where

estinl_.i_

141

is the value of the data in the jth observation. Solving for sj, we obtain

sj

sj = NS - (N-

1)S(j)

These are the pseudo-values the jackknife calculates, even though the statistic in question is not the mean. The jac_nife estimate is _, the average of the sj's, and its estimate of the standard error of the statistic is the corresponding standard error of the mean (Tukey 1958). The jackknife estimate of variance has been largely replaced by the bootstrap (see [R] bstrap), which is widely viewed as more efficient and robust. The use of iackknife pseudo-values to detect outliers is too o[ten forgotten and is something the bootstrap is unable to provide. See M0stetler and ,Tukey (1977, 133-t63) and Mooney and Duval (1993., 22-27) for more information.

I

JaCkknifedStandard deviation Example Moste!ler and Tukey (1977, 139-140) request a 95% confidence interval for the standard deviation of the eleven v_ilues: 0.t,

0.1,

0,1,

0.4,

0.5,

t.0,

1.1,

1.3,

1.9,

1.9,

4.7

Stata's summarize command calculates the mean and standard deviation and saves them as r (mean) and r (sd), To obtain the jackknifed standard deviation of the eleven values and to save the pseudovalues asia new variable sd, type • i input

x X

1.0.1 2. 0.1 3. 0.1 4. 0.4 5.0.5 6. i.O 7.!.1 8. t.3 9. 1.9

lo. 1.9 11. 12.

4.7 end

j:knife"summarize x" sd=r(sd), r keep command: statistic: n():

summarize x sd=r(sd) r(N)

Variable

Obs

jknife sd

overall

Statistic

i.489364 II

Std. Err.

[95Y,Conf. Interval]

.6244049

.0981028

2.880625

I.343469

Interpreting the, oulpu[, the standard deviation repoded by snmmaxize mpg is 1.34. The jackknife estimate is 1.49 with standard error 0.62. The 95% confidence interval for the standard deviation is .10 to 2.88. By spectfyii_g the keep option, jknife pseudo-valu_es.

creates a new variable in our dataset, sd. for the

• list -' ,

142

jknife-- Jackknife estimation x sd 1. 2. 3. 4. 5. 6. 7. 8. 9, 10. 11.

•I •1 •1 •4 •5 1 i.1 I.3 1.9 1.9 4.7

1.1399778 1.1399778 1,1399778 . 88931508 .8242672 • 63248882 •62031914 •62188884 .8354195 . 8354195 7.7039493

The jackknife estimate is the average of sd,so sd contains the individual "values" of our statistic. We can see that the last observation is substantially larger than the others, The last observation is certainly an outlier, but whether that reflects the considerable information it contains or indicates that it should be excluded from analysis is a decision that must be based on the context of the problem. In this case, Mosteller and Tukey created the dataset by sampling from an exponential distribution. so the observation is quite informative. ,q

> Example Let us repeat the above example using the automobile dataset, obtaining the standard error of the standard deviation of mpg. • use auto, clear (1978 Automobile Data) jknife "summarize mpg" sd=r(sd), r keep command: statistic: n() :

smmmarize mpg sd=r(sd) r(N)

Variable

Obs

Statistic

74

5.785503

Std. Err.

[95_ Conf. Interval]

sd overall jknife

5.817373

,607251

4.607124

Looking at sd more carefully, summarize sd, detail r(sd) pseudovalues

i_ 57 107 25_ 50Z 75_ 90_ 95_ 997

Percentiles 2.870485 2.870485 2.906249 3•328494

Smallest 2.870485 2.870485 2.870485 2.870485

3.948327 6.844408 9.597005 17.34316 38.60905

Obs Sum of Wgt. Mean

Largest 17.34316 19.76169 19.76169 38.60905

Std. Dev. Variance Skewness Kurtosis

74 74 5.817373 5.22377 27.28778 4.07202 23.37822

7.027623

_

jknife-- Jackknifeest_ • ]/istimake mpg

sd if sd > 30

,.pg

_ake 71•

_

143

Diesel

41

sd 38.60@052

Inthi s case,the_v Dieselistheonlydiesel carinourdataset.

q

!Collectingmultiple statistics i>Example : jknife is not limited to collecting just one ;tatistic. For instance, you can use s-T.marize, detail and then obtain the jackknife estimate of _e standard deviation and skewness, m_rnmarize, detail saves the standard deviation in r(sd) and the skewness in r(skewness), so you might type i

• _se (I_78

auto, clear _utomobile

• jkni_e

"summarize

comm_: statistic n():

Data) mpg, detail"

summarize :

sd=r(sd)skew=r(skewness),

r

mpg, detail

sd=r (sd) skew=r (skewness) r(N)

Variable

Obs

Statistic

74

5.78550_

Std. Err.

[95_, Conf.

Interval]

sd overall jknif

e

5. 817379

•607251

4. 607124

7• 027623

.3525056

1. 694686

skew 74

overall

.948717_ 1. 023596

jknife

.3367242

q

!Collectingcoefficients and standard errors Example

, jkni_eCan also collect coefficients and standard errors from estimation commands. For instance, using auto. klta we wish to obtain the jackknife e,_timate of the coefficients and standard errors from a regression in which we model the mileage of a _ar by its weight and trunk space. To do this. we refer io the Coefficients and the standard errors as _b [weight], _b [trunk], _se [weight], and _se [ttrumk] in the exp_list, or we could simplify them by using the extended expressions _b and

i 1

_SO. • Use _uto, clear (1978 iAutomobile Data) • _kniife "reg mpg weight

trunk"

co_iman61:

reg mpg weight

statistic:

b_weight=_b

_b _se, e

trunk

[weight]

se_weight=_se [weight] b_trunk=_b [trunk] b cons=

n() :

b[_cons]

se

Zrtmk=_se

se

cons=_se

_(_)

[trunk] [_cons]

] i

144

jknife -- Jackknife estimation Variable

1

Obs

Statistic

74

-.0056527

Std. Err.

[95X Conf. Interval]

b_weight overall jknife

-.0056325

.O010216

-.0076684

-.0035965

se_weight overall

74

.0007016

jkaife

.0003496

.000111

.0001284

.0005708

b_trunk overall

74

-.096229 -.0973012

jknife

.1486236

-.3935075

.1989052

b_cons overall

74

39.68913

jknife

39.65612

1.873323

35.92259

43.38965

.0218525

.0196476

.1067514

.2690423

.2907921

1.363193

se_trunk overall

74

.1274771

jknife

.0631995

se_cons overall

1.65207

74

jknife

.8269927

q

Saved Results jknife saves in r(): of observations

used in calculating

#

r(N#)

number

r(stat#)

value of statistic # using all observations

statistic

f

r(me_n#) r(se#)

jackknife estimate of statistic # (mean of pseudo-values) standard error of the mean of statistic #

|

Methods and Formulas jknife is implemented

as an ado-file.

References Gould, W. 1995. sg34: Jackknife estimation. Reprints, vol. 4, pp. 165-170.

Stata Technical

Bulletin

24: 25-29.

Mooney, C. Z. and R. D. Duval. Park, CA: Sage Publications.

1993. Bootstrapping:

A Nonparametric

Mosteller, E Company.

1977. Data Amdysis

and Regression.

and J. W. Tukey.

Tukey, J. W. 1958. Bias and confidence 614.

in not-quite

large samples.

Reprinted

Approach Reading.

Abstract

to Statistical

Related:

[R] bstrap, [R] statsby

Background:

[U] 16.5 Accessing coefficients and standard errors, [u] 23 Estimation and post-estimation commands

Inference.

MA: Addison-Wesley

in Annals

Also See

in Stata Technical

of Mat,_ematical

Bulletin

Newbury Publishing

Statistics

29:

rT!tle jointly --- l_orm all p_rwise combinations within groups

Syntax joinby

[varIist] using

nqlabe_. ....... , update

filena,ne

replace

[, _atehed({none

_merge(varname)

] b_oth [ re_aster

using

})

]

DescripUon j oinby joiqs, within groups formed by varlist, observations of the dataset in memory withfiiename, a Stata-format dataset. By join we mean "form all parwise combinations", filename is required to be sorted by varti_t. Iffilename is specified without an extension, '. dta' is assumed. If rarlist isnot specified, joinby memory antt in filename.

takes as varligt the set of variables common to the dataset in

Observations unique to one or the other dataset are ignored unless unmatched () specifies differently. Whether you load one dataset and join the other or vice versa makes no difference in terms of the number of resalting observations. If there ar_ common variables between the two datasets, however, the combined dataset will contain the va]ues from the master data for those observations. This behavior can be modified with the update and replace options.

Options

z

unmatched({llone I both !master I using }) specifies whether observations unique to one of the datasets are to be kept, with the variables from the other dataset set to missing. Valid values are none both m_.stier using

all unmatched observations are ignored (default) unmatched observations from the master and using data are included unmatched obse_'ations from ihe master data are included unmatched observations from the using data are included

}

no!abe! Nevents Stata from copying the value label definitions from the disk dataset into the dataset in memory. Even if you do not specify this option, in no event do label definitions from the disk dataset tephce label definitions already in memory. update i

varies the action that joinby

data_et is tield inviolate--values

takes when an observation is matched. By default, the master from the master data are retained when the same variables are

found in both datasets. If update is specified, however, the values from the using dataset are retained in cases where the master dataset contains missing. i '-

replace,

fillowed with update only, specifies that even when the master dataset contains nonmissin_ values, the_' are to be replaced with corresponding values from the using dataset when the corresp0ndjng values are not equal. A nonmissi_g value, however, will never be replaced with a

missing value. -merge(varname) specifies the name of the variable that will mark the source of the resulting observation. The default is _.merge (_merge) . To preserve compatibility with earlier versions of joir_b_, ___erge is only generated if unmatched is specified. 145

l

Remarks

146 joinby -- Form all pairwise combinations within groups The following, admittedly artificial, example illustrates j oinby.

> Example You have two datasets: child, dta and parent .dta, identifies the people who belong to the s_ne family. .

use

Both contain a family_id

child on

(Data

Children)

describe Contains

data

from

child.dta

obs:

5

Data

vats:

4

13

size:

50

(99.9_

storage variable

name

type

of memory

on Children

Jul

display

value

format

label

variable

family_id

int

_8.0g

Family

child_id

byte

_8.0g

Child

xl

byte

_8.0g

x2

int

_8.0g

by:

Sorted

2000

15:29

free)

label Id Number

Id Number

family_id

list family~d 1025

I.

child_id 3

xl II

x2 320

2.

1025

1

12

300

3.

1025

4

10

275

4.

1026

2

13

280

5.

1027

5

15

210

. use (Data

parent, clear on Parents)

describe Contains obs:

data

from

parent.dta 6

Data

vats:

4

13 Jul

size:

108 storage

(99.9_

of memory

on Parents 2000

15:31

free)

display

value

type

format

label

family_id

int

_8.0g

Family

Id Number

parent_id

float

_9.0g

Parent

Id Number

xl

float

_9.0g

x3

float

_9.0g

variable

Sorted

name

variable

by:

list

1.

fsmily-d 1030

2. 3.

1025 1025

4. 5. 6.

parent_id I0

xl 39

x3 600

11 12

20 27

643 721

1026

13

30

Z60

1026

14

26

668

1030

15

32

684

label

variable which

joinby-- Form ail pairwisecombinationswithin groups

147

You want tO "join" the information for the parents and their children. The data on parents are in memory;the data on children are on disk. dhild.dta has been sorted by family_id, but parerit._ti has not, so first we sort the parent _data on famity_id: • Sort

i family_id

• joinby

family_id

using

child

• describe Co,tails

data

o.bs:I

8

vats:

6 168

Data (99.4_, of memory

free)

i

i

storage

on Parents

,

display

value

type

format

label

family¢id

int

Y,8.0g

Family

Id Number

paz_nt_id

float

Y,9.0g

Parent

Id Number

Xl

float

%9.0g

x3

float

Y.9.0g

child__d

byte

XS.0g

x2

int

Y,8.0g

variable

name

Sorted iby : Npte:

dataset

has changed

variable

Child

since

last

label

Id Number

saved

l_st 1.

family-d 1025

parent_id 12

xl 27

x3 721

child_id 3

2;.

1025

11

20

643

3

320

3. 4.

1025 1025

12 11

27 20

721 643

1 1

300 300

5,

1025

li

20

643

4

275

6. 7.

1025 1026

12 13

27 30

721 760

4 2

275 280

8.

1026

14

26

668

2

280

x2 320

Notice that

I , I

1. fami_y__d of I027, which appears only in child.dta, and family_id only in Narent. dta, are not in the combined dataset. Observations variable(s) are not in both datasets are omitted.

of 1030, which appears for which the matching

2. The x_ v_riable is in both datasets. Values for this variable in the joined dataset are the values from par_nt.dta--the dataset in memory when we issued the joinby command. If we had cMld.d_a in memory and parent.dta on di_k when we requested joinby, the values for xl wouldiha_'e been from child.dta. Values from the dataset in memory take precedence over the datasel o_i disk. q

Methods joinby

Formulas ii implemented as an ado-file.

t

148

joinby-- Formall pairwisecombinationswithin groups

Acknowledgment joinbywas written by Jeroen Weesie, Department of Sociology, Utrecht University, Netherlands.

Also See Complementary:

[R] save

Related:

JR] append, [R] cross, JR] fillin, JR] merge

Background:

[U] 25 Commands for combining data

1

!

-¥itle kappa,

interrater agreement 4

1

i Syntax :

}

kap va_ai_el varname2 varnarne3 [...] [weigh3] [if exp] [in range] kappa i,ariist [if exp] [in range] fweights

i

a_e aliowed; see [U] 14,1.6 weight.

DescriptiOn kap (first s_,ntax)calculates the kappa-statistic measure of interrater agreement when there are two unique raters and two or more ratings. /

kapwgt: defines weights for use by kap in measuring the importance of disagreements. kap (secoqd syntax) and kappa calculate the kappa-statistic measure in the case of two or more (nonuniqu¢) r_atersand two outcomes, more than two outcomes when the number of raters is fixed, and more thah two outcomes when the number of raters varies, kap (second syntax) and kappa produce the same results: they merely differ in how they expect the data to be organized. kap assurrie's that each observation is a subject, varnamel contains the ratings by the first rater, varname2 'by ihe second rater, and so on. kappa also_assumesthat each obse_'ation is a subject. The variables,however, record the frequencies with which r@ngs were assigned. The first variable records the number of times the first rating was assigned, the gecond variable records the number of times the second rating was assigned, and so on.

Options tab displays a tabulation of the assessmentsby the two raters. wgt(wgtid) _pecifies that wgtid is to be used to weight disagreements. User-defined weights can be created using kapwgt: in that case, wgt() specifies the name of the user-defined matrix. For instance, you might define . kapwg_ mine i \, .8 1 \ 0 .8 I \ 0 0 .8 I'

and them . k_p

rgta

ratb,

wgt(mine)

14g

i i

150

kappa -- lnterrater agreement

In addition, two prerecorded

weights are available.

wgt(w) specifies weights 1 - [i -jl/(k - 1), where i and j index the rows and columns of the ratings by the two raters and k is the maximum number of possible ratings. wgt(w2)

specifies weights 1-{(i-

j)/(k-

1)} 2.

absolute is relevant only if wgt () is also specified; see wgt () above. Option absolute modifies how i, j, and k in the formulas below are defined and how corresponding entries are found in a user-defined weighting matrix. When absolute is not specified, i and j refer to the row and column index, not the ratings themselves. Say the ratings are recorded as {0, 1, 1.5, 2}. There are 4 ratings; k = 4 and i and j are still 1, 2, 3, and 4 in the formulas below. Index 3, for instance. corresponds to rating = 1.5. This is convenient but can, with some data, lead to difficulties. When absolute is specified, all ratings must be integers and they must be coded from the set {1,2, 3,...}. Not all values need be used; integer values that do not occur are simply assumed to be unobserved.

Remarks The kappa-statistic measure of agreement is scaled to be 0 when the amount of agreement is what would be expected to be observed by chance and 1 when there is perfect agreement. For intermediate values, Landis and Koch (1977a, 165) suggest the following interpretations: below 0.0 0.00-0.20 0.21-0.40 0.41-0.60 0.61- 0.80 0.81- 1.00

Poor Slight Fair Moderate Substantial Almost Perfect

The case of 2 raters > Example Consider the classification by two radMogists of 85 xeromammograms as normal, benign disease. suspicion of cancer, or cancer (a subset of the data from Boyd et al. 1982 and discussed in the context of kappa in Altman 1991, 403-405). . tabulate Radiologist A's assessment

rada

radb

Radiologist Normal

B's

benign

suspect

cancer

Total

12

0

0

33

benign

4

17

1

0

22

suspect cancer

3 0

9 0

15 0

2 1

29 1

38

16

3

85

Normal

Total

21

assessment

28

Our dataset contains two variables: Each observation is a patient.

rada,

radiologist A's assessment: radb.

We can obtain the kappa measure of interrater agreement

by typing

radiologisl B's assessment.

kap

-- lnterrat_ agreement

151

• kap rada radb ; Agreement i

Expected Agreement

Kappa

Std. Err,

30. 2z 0.472a 0.0694 !

Prob>Z

6.81

0.0000

Had each radiologist made his determination randomly (but with probabilities equal to the overall proportions), _we would expect the two radiologist_ to agree on 30.8% of the patients. In fact, they agreed on 6}.5% of the patients, or 47.3% of the way between random agreement and perfect a_reemenL _he amount of agreement indicates that we can reject the hypothesis that the?, are making their detetrni lations randomty.

Example l

Z

,I

i

There is a difference between two radiologists disagreeing whether a xeromammogram indicates cancer or th_ ,_uspicion of cancer and disagreeing whether it indicates cancer or is normal. The weighted kappa attempts to deal with this. kap provides two "prerecorded" weights, w and w2: . k_p _ada radb, wgt(_;) Ratlng_ weighted by: 1.0¢00 O, 666? 0.61_67 1.0000 O. 3_33 O. 6667 O. od,oo o. 3333

O.3333 0.6667 1.0000 o. 6667

i Expected /{gr¢_em_nt Agreement

l i i I

i !

O. 0000 0.3333 O. 6667 1. 0000

Kappa

Std. Err.

Z

Prob>Z

The w vJe_ghts are given by 1 - li - jt/(k - 1) where i and j index the rows of columns of the ratings by th_ two raters and k is the maxinmm _umber of possible ratings. The weighting matrix

i i

ratings normal, benign, suspicious, and cancerous. i the table. In our "case, the rows and columns of the 4 × 4 matrix correspond to the is prin_ed above A weight Of 1 indicates an observation should count as perfect agreement. The matrix has l s down the dia_ofials!--when both radiologists make the s_me assessment, they are in agreement. A weight of, say_0J66_7 means they are in two-thirds agreement. In our matrix they get that score if they are one aparl -+one radiologist assesses cancer and the other is merely suspicious, or one is suspicious and the o_herisays bemgn, and so on, An emry of 0.3333 means they are in one-third agreement or, } if you prefer,!two-thirds disagreement. That is the gcore attached when they are "two apart". Finally, they are in c_mplete disaueement when the weighi is zero, which happens only when the3, are three apart--one says cancer and the other says normal.: <1

i!

Example The other prerecorded weight is w2 where the weights are given by 1 - {(i• kap ttating_

i

_ada radb,

wgt(w2)

weighted by:

l.oqoo

0.8889

0.5556

0.0000

.s 89 oooo o.88s9 o.sss6 55_56 (),.oclooo. O. 8889 1. 0000 O.8889 s56 o.8889 oooo

j)/(_-

1)}2:

152

kappa -- lnterrater agreement Expected Agreement

Agreement 94.77X

The w2weight

Kappa

84.09X

0.6714

makes the categories

Std.

Err.

0.1079

even morealike

Z 6.22

Prob>Z 0.0000

and is probably inappropriate

here.

> Example In addition to prerecorded weights, you can define your own weights with the kapwgt command. For instance, you might feel that suspicious and cancerous are reasonably similar, benign and normal reasonably similar, but the suspicious/cancerous group is nothing like the benign/normal group: • kapwgt xm i \ .8 1 \ 0 0 1 \ 0 0 .8 1 • kapw_ xm 1.0000 0.8000 1.0000 O. 0000 O.0000 i.0000 O. 0000 O.0000 O.8000 1.0000

You name the weights--we named ours xm and after the weight name, you enter the lower triangle of the weighting matrix, using \ to separate rows. In our example we have four outcomes and so continued entering numbers until we had defined the fourth row of the weighting matrix. If you type kapwgt followed by a name and nothing else, it shows you the weights recorded under that name. Satisfied that we have entered them correctly, we now use the weights to recalculate kappa: . kap rada radb, wgt (xm) Ratings weighted by: I.0000 O.8000 O.8000 I.0000 O.0000 O.0000 O.0000 O.0000

Agreement 80.47)',

O.0000 O.0000 I.0000 O.8000

O.0000 0.0000 O. 8000 1.0000

Expected Agreement

Kappa

52.677.

O. 5874

Std, Err, O.0865

Z

Prob>Z

6.79

O.0000

4

[3 Technical Note In addition to weights for weighting the differences in categories, you can specify Stata's traditional weights for weighting the data. In the examples above, we have 85 observations in our dataset one for each patient. If all we knew was the table of outcomes that there were 21 patients rated normal by both radiologists, etc. it would be easier to enter the table into Stata and work from it. The easiest way to enter the data is with tabi; see [R] tabulate.

) ) ) ( ,

kappa-- lnterrateragreement

!

153

. tabi 21 12 0 0 \ 4 17 I 0 \ 3 9 15 2 \ 0 0 0 I, replace col row

1

2

3

4

Total

1 2

21 4

12 17

0 1

0 0

33 22

3 4

3 0

9 0

15 0

2 1

:29 1

3

85

,(

T_tal

28

Pearson cM2(9)

38 =

16

77.8111

Pr _ 0.000

tabi

felt obligated to tell us the Pearson X2 for this table, but we do not care about it. The important thing is tlaat,with the replace option, tabi left the table in memory: • list )in I/5 row 1 1

col 1 2

pop 21 12

3_

I

3

0

4_ 5,

1 2

4 1

0 4

1, 2.

The variable row is radiologist A's assessment: so assesse_ _ both. Thus, •:kap _ow col [freq=pop] ; Expected Agreement Agreement Kappa : j

Std. Err.

!

63.53'/,

30.827,

col,. radiologist B's assessment: and pop the number

O.4728

O.0694

Z

Prob>Z

6.81

O.0000

If we are going to keep these data, the names row and col are not indicative of what the data reflects. We could (seb [U] 15.6 Dam.set, variable, a,d value labels) • rename row rada • rename col radb . label var rada "Radiologist A's assessment" label var radb "Radiologist B's assessment" . label define assess I normal 2 benign 3 suspect 4 cancer l&be] values rada assess label values radb assess l&be] data "Altman p. 403"

kap's

tab

option, which can be used with or withont weighted data. shows the table of assessments: i

• kap _ada radb

[freq=pop],

Radiolqgist iA's assessment

Radiologist B's assessment normal benign suspect cancer

!

tab

Total

21

n

0

o

33

bez_ign

4

17

1

0

22

Suspect

3

9

15

2

29

cancer ) T_tal

0

0

0

I

1

28

38

18

3

85

_o_mal

)

]:_

Kappa -- mmrramr agreement Expected Agreement

Agreement 63.53_

Kappa

30.82_

Std.

0.4728

Err.

Z

0.0694

Prob>Z

6.81

0.0000

0

Q Technical Note You have data on individual patients. There are two raters and the possible

ratings are I, 2, 3,

and 4, but neither rater ever used rating 3: . tabulate ratera raterb raterb •

ratera

I

2

4

Total

1 2 4

6 5 1

4 3 1

3 3 26

13 11 28

12

8

32

52

Total

In this case, kap would determine the ratings are from the set {1,2, 4} because those were the only values observed, kap would expect a use_defined weighting matrix to be 3 x 3 and, were it not, kap would issue an error message. In the formula-based weights, the calculation would be based on i,j -- I, 2, 3 corresponding to the three observed ratings {1,2, 4}. Specifying the absolute option would make it clear that the ratings are 1, 2, 3, and 4; it just so happens that rating = 3 was never assigned. Were a use_defined weighting matrix also specified, kap would expect it to be 4 × 4 or larger (larger because one can think of the ratings being 1, 2, 3, 4, 5, ... and it just so happens that ratings 5, 6, ... were never observed just as rating -- 3 was not observed). In the formula-based weights, the calculation would be based on i,j -- I, 2, 4. • kap ratera raterb, wgt(w) Ratings weighted by: 1.0000 0.5000 0.0000 0,5000 1.0000 0.5000 0.0000 0.5000 1.0000

Agreement 79.81_

Expected Agreement 57.17Z

Kappa

Z

Prob>Z

4.52

0.0000

Z

Prob>Z

Std. Err.

0.5285

0.1169

. kap ratera raterb, wgt(w) absolute Ratings weighted by: 1.0000 0.6667 0.0000 0.6667 1.0000 0.3333 0.0000 0,3333 1.0000

Agreement

Expected Agreement

81.41Z

55.08X

Kappa 0.5862

Std. Err. 0.1209

4.85

0.0000

If all conceivable ratings are observed in the data, then whether absolute is specified makes no difference. For instance, if rater A assigns ratings { 1,2, 4} and rater B assigns {1,2, 3, 4}, then the complete set of assigned ratings is {1,2, 3, 4}, the same as absolute would specify. Without absolut e, it makes no difference whether the ratings are coded { 1,2, 3, 4}, {0.1.2, 3 }, {1,7, 9, 100}, {0, 1, t.5, 2.0}, or otherwise.

O

kappa-- lnterrateragreement

The case

155:,

more than two raters

In the c,se of more than two raters, the matha aatics are such that the two raters are not considered unique.!Fol " instance, if there are three raters, there is no assumption that the three raters who rate I

the are the the three ratersraters that rate thanflrst_suSject two r_iters case, it same can beasused with two whenthe thesecond. raters' Although identities we vary.call this the more The 'norlunique rater case can be usefully broken down into three subcases: (a) there are two possible raiings which we will call positive and negative; (b) there are more than two possible ratings but _thenumber of raters per subject is the same for all subjects; and (c) there are more than two possiblle ratings and the number of raters per subject varies, kappa handles all these cases. To emphasize that there is no assumption of constant identity of raters across subjects, the variables specified contain counts of the number of raters rating the subject into a particular category.

!

{ i

_ Example (Two; ratings.) Fleiss (1981, 227) offers the following hypothetical ratings by different sets of raters on 25}subjects:

Subject 1 2 3 4 5 6 7 8 9 t0 11 i

l

NO.of No. of raters pos. ratings 2 2 2 0 3 2 4 3 3 3 4 1 3 0 5 0 2 0 4 4 5 5

12 13

34

34

No. of No. of Subject raters pos. ratings 14 4 3 15 2 0 16 2 ' 2 17 3 1 18 2 t 19 4 t 20 5 4 21 3 2 22 4 0 23 3 0 24 3 3 25

2

2

We have entered these data into Stata and the variables are called subject, raters, and pos. kappa, however, re@ires that we specify variables containing the number of positive ratings and negative ratings; that i_s,pos and raters-pos: gen

_eg

kapp_

= raters-pos

pos neg

TWo4ou_comes, Kappa 0.5415

multiple

raters: Z 5.28

Prob>Z 0.0000

We wouldlha_e obtained the same results if we had typed kappa neg pos.

Example (More thanitwo ratings, constant number of raters,) Each of ten subjects is rated into one of three categories by five raters (Fleiss 1981, 230): li_t

I i

156

kappa-- Interrateragreement subject 1. 2. S. 4. 5. 6, 7. 8. 9. 10.

cat1 1 2 3 4 5 6 7 8 9 10

cat2 1 2 0 4 S 1 5 0 1 3

cats 4 0 0 0 0 4 0 4 0 0

0 S 5 1 2 0 0 1 4 2

We obtain the kappa statistic: • kappa earl-cat3 Outcome

Kappa

Z

Prob>Z

catI cat2 cat3

O.2917 0.6711 0.3490

2.92 6.71 3.49

O. 0018 0.0000 O. 0002

combined

0.4179

5.83

O.0000

The first part of the output shows the results of calculating kappa for each of the categories separately against an amalgam of the remaining categories. For instance, the cat I line is the two-rating kappa where positive is carl and negative is cat2 or catS. The test statistic, however, is calculated differently (see Methods and Formulas). The combined kappa is the appropriately weighted average of the individual kappas. Note that there is considerably less agreement about the rating of subjects into the first category than there is for the second. q

> Example Now suppose that we have the same data as in the previous example, but that it is organized differently: • list 1. 2. 3. 4. 5. 6. 7. 8. 9. i0.

subject 1 2 3 4 5 6 7 8 9 i0

raterl 1 1 3 I 1 1 1 2 1 1

In this case, you would kap

use

kap

rater2 2 I 3 1 1 2 I 2 3 1

rater3 2 3 3 1 1 2 1 2 3 1

rather than

kappa:

raterl rater2 raterS rater4 rater5

There are 5 raters per subject: Outcome

Kappa

Z

Prob>Z

1 2 3

0.2917 0.6711 O. 3490

2.92 6.71 3.49

0.0018 0.0000 O. 0002

combined

O. 4179

5.83

O. 0000

rater4 2 3 3 1 3 2 1 2 3 3

rater5 2 3 3 3 3 2 1 3 3 3

_

,

kappa -- Interrater agreement

157

Note that thfe information of which rater is which is not exploited when there are more than two raters.

q

_, Example (More' tha_ two ratings, vmo,ing number of raters!) In this unfortunate case, kappa can be calculated, but there is _o test statistic for testing against _ > 0. You do nothing differently--kappa calculates the total nun{bet of raters for each subject and, if it is not a constant, it suppresses the calculation of test statisttics[ .

1,ist

1,

subject 1

cat 1 1

cat 2 3

2.

2

2

0

3

3.

3

0

0

5

4.

4

4

0

1

5.

5

3

0

2

6.

6

1

4

0

7.

7

5

0

0

8_

8

0

4

1

9;

9

1

0

2

10.

10

3

0

2

• k_pp_

0

catl-cat3 Outcome

Kappa

cat i

O. 2685

cat2

O. 64,57

cat3

O. 2938

combined note:

cat3

Z

Prob>Z

O. 3816

}Number of ratings

per

subject

vary;: cannot

calculate

test

Istatistics,

q

Example This case _s similar to the previous example, but the data are organized differently: • list

I.

_ubject i

raterl I

rater2 2

rater3 2

2.

2

3.

3

4.

1

1

3

3

3

3

3

3

3

3

4

1

1

t

1

3

5-

5

1

1

1

3

3

6. 7.

6 7

1 1

2 1

2 1

2 1

2 1

8.

8

2

2

2

2

3

9.

9

1

3

10.

10

1

t

1

3

In this

cas_,

|

we

specify

kap,

instead

of

kappa:

rater4

rater5 2

3 3

i i

158

kappa -- Interrater agreement • kap raterl-rater5 There are between 3 and 5 (median = 5.00) raters per subject: 0utcome

Kappa

1 2 3

0.2685 0.6457 0.2938

Prob>Z

0.3816

combined note:

Z

Number of ratings per subject vary; cannot calculate test statistics.

Saved Results kap and kappa save in r(): Scalars r(N)

number

of subjects

(kap only)

r(prop_o) observed proportion of agreement (kap only) r(prop_e)expected proportion of agreement (kap only)

r(kappa)

kappa

r(z) r(se)

z statistic standard error for kappa statistic

Methods and Formulas kap, kapwgt,

and kappa

are implemented

as ado-files.

The kappa statistic was first proposed by Cohen (1960). The generalization for weights reflecting the relative seriousness of each possible disagreement is due to Cohen (1968). The analysis-of-variance approach for k = 2 and ra _> 2 is due to Landis and Koch (1977b). See Altman (1991, 403-409) or Dunn (2000. chapter 2) for an introductory treatment and Fleiss (198t, 212--236) for a more detailed treatment. All formulas below are as presented in Fleiss (1981). Let rn be the number of raters and let k be the number of rating outcomes.

kap: m = 2 Define wij (i = 1.... , k, j = 1,..., k) as the weights for agreement and disagreement (wgt ()) or, if not weighted, define wiz = 1 and wij = 0 for i ¢ j. If wgt (_r) is specified, u'ij -- 1-l i-jt/(k1). If wgt (_r2) is specified, wij -- 1 - {(i-j)/(k The observed proportion of agreement

1)} 2.

is k

k

Po = _ _ wijPij i=1 3=1

where Pij is the fraction of ratings i by the first rater and j by the second. The expected proportion of agreement is k 'De-

_ i=1

k _wijPi-P.j j=l

Ii

_

-- Interrater agreement

159

where Pi. = ___jPij and p.j = Y'_-iP/J" f Kappa is _iven by _ = (Po - Pe) / (I - Pe). The standard error of _ for testing against 0 is

s0 (1-

j

where n is th, number of subjects being rated and Ni. = _j statistic Z= _/'so is assumed to be distributed N(0, 1).

p.jwij

and ¥.j = _i Pi':wiJ" The test

kappa: m > 2,!k = 2 Each sUbjeCt i, i = 1,...,

n, is found by' xi of mi raters to be positive (the choice as to what is

labeled:positiVe being arbitrary). The overail proportion of positive ratings is _ = _i xi/(nN), between-s_bj@s mean square is (approximately)

B

where _

= _-_i rni/n.

_

The

!1 1

--

n

t

i

r°'i

and the w_thla-subject mean square is

W = n(_--

1 1i

E i

xi(mimi

xi) ] i

Kappa is thent defined

i

i

The standard !error for testing against 0 (Fleiss arid Cuzick 1979) is approximately calculated as 1 _'0 = (N.-

1)_

/

(_-

(2(_H z 1)+

_H)(1

equal to and

- @-_)

mpq

ii

i

where Nh, is the harmonic mean of rni and _ = 1 - _. The test st',ttistic Z = "/^_/Sois assumed to be distributed N(0, 1).

i

kappa: m >2,ik> 2 Let .rij be ithe number or ratings on subject i, i = 1,...,

n, into category j, j = 1,...,

k. Define

i

_j as the overall proportion of ratings in category j, _j = 1 - _._, and let _j be the kappa statistic given above f_r k = 2 when category j is compared with the amalgam of all other categories. Kappa

I

is (Landis an_ Koch 19778)

_=

A____PJ-qjr'JJ

t

160 kappa m lnterrater agreement In the case where the number of raters per subject _j xij is a constant m for all i, Fleiss, Nee, and Landis (1979) derived the following formulas for tile approximate standard errors. The standard error for testing

_j against

0 is

/ V /

and the standard

error

for testing

1)

g is

_= _. _f_jx/nm(m-J

2

i)

PJqJ

j

PJ'qJ('qJ-

_j)

References Altman, D. G. t991. Practical Statistics for Medical Research. London: Chapman & Hall. Boyd, N. F., C. Wolfson, M. Moskowitz, T. Carlile, M. Petitclerc, H. A. Ferri, E. Fishell. A. Gregoire, M. Kieman. J. D. Longley, I. S. Simor, and A. B. Miller. 1982. Observer variation in the interpretation of xeromammograms. lournaI of the National Cancer Institute 68: 357-63. Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37-46. 1968. Weighted kappa: Nominal scale agreement with provision for scaled disagreement Psychological Bulletin 70: 213-220.

or partial credit.

Dunn. G. 2000. Statistics in Psychiatry. London: Arnold. Fleiss, J. L. 1981. Statistical Methods for Rates and Proportions. 2d ed. New York: John Wiley & Sons. Fleiss, J. L. and J. Cuzick. 1979. The reliability of dichotomous judgments: unequal numbers of judges per subject. Applie 4 Psychological Measurement 3: 537-542. Fleiss, J. L., J. C. M. Nee, and J. R. Landis. 1979. Large sample variance of kappa in the case of different sets of raters. Psychological BuIletin 86: 974-977. Gould, W. 1997. stata49: Interrater agreement. Stata Technical Bulletin 40: 2-8. Reprinted in Stata TechnicaJ Bulletin Reprints, rot. 7, pp. 20-28. Landis, J. R. and G. G. Koch. 1977a. The measurement of observer agreement for categorical data. Biometrics 33: 159-174. 1977b. A one-way components of variance model for categorical data. Biometdcs 33: 671-679. Steichen, T. J. and N. J. Cox. 1998a. sg84: Concordance correlation coefficient. Stata Technical Bulletin 43: 35-39. Reprinted in Stata Technical Bultetin Reprints, vol. 8, pp. 137-143. 1998b. sg84.t: Concordance correlation coefficient, revisited. Stata Technical Bulletin 45: 21-23. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 143-145. 2000. sg84.2: Concordance correlation coefficient: update for Stata 6. Stata Technical Bulletin 54: 25-26. Reprinted in Stata Technical Bulletin Reprints, vol. 9. pp. 169-170.

Also See Related:

[R] tabulate

i

Title !

i

• kdenslty

Univariate kernel density estimation

i

Syntax kdensity varname [weight] [ifexp][inrange] [, nJ _raphgenerate(neuwarznewvardinsity) n(#) _width(#) [ ibiweight I cQsineI eppanlgausl] parzenl rectangle I triangle ] ! n(_rmal stud(#) at(varz) s_ymbol(...) _connect(...) title(string) [ [

fweigh_s

ggoph_optwns ? i J i and _weights are allowed;see [U] 14.1.6weighi.

Descriptien kdensity!produces

kernel density estimates and graphs the result. /

Options i

nograph suppresses the graph. This option is often Used in combination with the generate () option. generate(n_wvarz newvardensitv) stores the results of the estimation, newvardensity will contain

i t_ I

the densit>l estimate, newvarz will contain the pbints at which the density is estimated. n(#) specifie_ the number of points at which the d_nsity estimate is to be evaluated. The default is min(N; 501, where N is the number of observations in memory.

i i ! i

width(#) sNcifies the halfwidth of the kernel, the width of the density window around each point. If w() is hot specified, then the "optimal" width is calculated and used. The optimal width is the width [hat would minimize the mean integrated square error if the data were Gaussian and a Gaussiar_ kernel were used, so is not optimal in any global sense. In fact, for multimodal and highly skeived densities, this width is usually too wide and oversmooths the density (Silverman

i

1986). bi_reight,

I cbsine,

default, e_t_, [ l i

I i

i

gauss,

parzen,

rectangle,

and triangle

specify the kernel. By

specifying the Epanechnikov kernel, is used.

normal requd?ts that a normal density be overlaid on the density estimate for comparison. stud(#)

i.

speOfies that a Student's _ distribution with # degrees of freedom be overlaid on the density

estimate f_r comparison, at(varz) specifies a variable that contains the v_lues at which the density should be estimated. This optiot0 allows you to more easity obtain density estimates for different variables or different subsamples of a variable and then overlay the e_t_mated densmes for comparison. symbol(...)

i

epan,

!is graph,

is symbollo);

two,ray's symbol()

see [G]graph options.

°pti°h for specifying the plotting symbol. Tile default :

i

(

connect(...)isgraph, twoway'sconnect ()estimation optionforhow pointsareconnected. The default is 162 kdensity -- Univariate kernel density connect (1), meaning points are connected with straight lines: see [G] graph options. title(string) is graph, twoway's title() option for speci_'ing the title. The default title is "Kernel Density Estimate"; see [G] graph options. graph_options

are any of the other options allowed with graph,

twoway; see [G] graph

options.

Remarks Kernel density estimators approximate the density f(z) from observations on z. Histograms do this too, and the histogram itself is a kind of kernel density estimate. The data are divided into nonoverlapping intervals and counts are made of the number of data points within each interval. Histograms are bar graphs that depict these frequency counts the bar is centered at the midpoint of each intervalwand its height reflects the average number of data points in the interval. In more general kernel density estimates, the range is still divided into intervals and estimates of the density at the center of intervals are produced. One difference is that the intervals are allowed to overlap, One can think of sliding the intervalPcaUed a window along the range of the data and collecting the center-point density estimates. The second difference is that, rather than merely counting the number of observations in a window, a weight between 0 and 1 is assigned--based on the distance from the center of the window and the weighted values are summed. The function that determines these weights is called the kernel. Kernel density estimates have the advantages of being smooth and of being independent choice of origin (corresponding to the location of the bins in a histogram).

of the

See Salgado-Ugarte, Shimizu, and Taniuchi (1993) and Fox (1990) for discussions of kernel density estimators that stress their use as exploratory data analysis tools.

Example Goeden investigate histogram. is roughly

(1978) reports data consisting of 316 length observations of coral trout. We wish to the underlying density of the lengths. To begin on familiar ground, we might draw a In [G] histogram, we suggest setting the bins to min(v/-_. 10-loglcn ). which for n = 316 18:

graph length, xlab ylab bin(18)

2-

15"

05"

m 0

length

kdensity -- Univariatekernel density estimation

163

The kernel density estimate, on the other hand, is smooth. . kdens_ty

length,

xlab

ylab

006 -_

004

121

£,02

i

\

1

o

Kernel

Density

tdngth Estimate

Kernel den_ity)stimators are. however, sensitive to an assumption just as are histograms. In histograms, we specify a _umber of bins. For kernel density estimators, we specify a width. In the graph above, we used the d_fault width, kdensity is smarter than graph, histogram in that its default width is not a fixed _:onstant. Even so, the default width is not necessarily bei. i kder_sity !ayes the width in the return scalar width, so typing display Doing this, wd discover that the width is approximately 20.

r(width)

reveals it.

! i

Widths are(ketail. isimilarThe to units the inverse of thearenumber of ofz, bins in histogram in analyzed. that smaller provide more of the width the uhits the avariable being The widths width is specified as ia halfwidth, meaning that the kernel density estimator with halfwidth 20 corresponds to sliding a w!ndow of size 40 across the data.

I

We can specify halfwidths for ourselves using the t the density as imuch. • kdens_ty

length,

epan

width()

i ]

option. Smaller widths do not smooth

w(lO)

xlab ylab

i

I

]

OOB I

oo6_I

\

/

,004

/

e

_oo

5

_Jo

,oo I_ng(h

Kernel

Density

Estimate

s_o

do

• kdensity length, epam xlab ylab w(15)

164

kdensity -- Univariate kernel density estimation

•

006

/_

.004 >.

j

0 200

"\ 3(_0

Kernel

4(_0 length

Density

560

6;0

Estimate

q

> Example When widths are held constant, different kernels can produce surprisingly different results. This is really an attribute of the kernel and width combination; for a given width, some kernels are more sensitive than others at identifying peaks in the density estimate. We can see this when using a dataset with lots of peaks. In the automobile dataset, we characterize the density of we £ght, the weight of the vehicles. Below, we compare the Epanechnikov and Parzen kernels. kdensity weight, epan nogr g(x epan) kdensity weight, parzen nogr g(x2 parzen) • label vat epan "Epamechnikov Density Estimate" • label vat parzen "Parzen Density Estimate" • gr epan parzen x, xlab ylab c(ll)

o Epanechnikov .0008

Density

Estimate

_ Parzen

Density

Estimate

"_

oooo

i'll!

0

1ooo

2o'oo

3ooo Weight

(l_s.)

4otoo

5ooo

!

kdensRy-_ Univariatekerneldensityestimation

165

!

We did not s_ecify a width and so obtained the d_fault width. That width is not a function of the selected kerneil but of data. See the Methods and Formulas section for the calculation of the optimal

I

width.

q

> Example In examining the density estimates, we may wi_h to overlay a normal density or a Student's t • ari_" Mng automobile weights, we can get an idea of the distance from normality density for col_p _u,,. U__ with the normal option. , kdens_ty

weight,

epam

normal

xlab ylab

,0006

i

.ooo4

,ooo2 t

°t 1 1000

il 2000

I 3_00 Weigh!

Kernel

Density

I 4000

5000

(}bs,)

IEstimate

Example Another conmon desire in examining density estimates is to compare two or more densities. In this example, _,e will compare the density estimatesof the weights for the foreign and domestic cars. I

kdensi_y

i

.• kdensi_y kdens@y

i

weight,

negr

weight weight

gen(x

fx)

if gen(f_O) if foreign==O, foreigxl==l, nogr nogr gen(fXl)

label

_ar fxO

"Domestic

label

_ar fxl

"Foreign

at(x) at(x)

cars" cars"

(Continued on twxt page)

"

166

• gr

fxO fxl c(ll) s(TS) xlab ylab kdensity --x, Univariate kernel density estimation

i :

Domestic

cals

D Foreign

cars

OOl.

{

I !

.0005

l

_a_

fX

'

0" 1000

20100

3000' Weight

40t00

5000r

(Ibs.)

<1

0 Technical Note Although all the examples we included had densities

less than I. the density may exceed t.

The probability density f(z) of a continuous variable z has the units and dimensions of the reciprocal of z. If z is measured in meters, f(x) has units 1/meter. The density is thus not measured on a probability scale, so it is possible for f(x) to exceed I. To see this, think of a uniform density on the interval 0 to 1. The area under the density curve is 1: this is the product of the density, which is constant at 1, and the range, which is 1. If then the variable is transformed by doubling, the area under the curve remains 1. and is the product of the density, constant at 0.5, and the range, which is 2. If conversely the variable is transformed by halving, the area under the curve also remains at 1, and is the product of the density, constant at 2, and the range, which is 0.5. (Strictly, the range is measured in certain units, and the density in the reciprocal of those units, but the units cancel on multiplication.) D

Saved Results kdensity saves in r(): Scalars r(width) kernel bandwidth r(n) number of points at which the estimate was evaluated r(scale) density bin width Macros r(kernel)

name of kernel

g

i

kdensity_ Univariatekerneldensityestimation

167

Methods iar,d Formulas kdensit_r is implemented as an ado-file. A kernel density estimate is formed by summihg the weighted values calculated with the kernel

}

function K is in

= nh"i=1 --

l

:

where we may define various kernel functions, kdens±ty includes seven different kernel functions, The Epanec_nikov is the default function if no Otherkernel is specified and the most efficient in minimizing _he mean integrated squared error. Kernel

l i

FOrmula

Biweight

K[z] =

ig 0ta(1

Cosine

K[z] = {1 + eds(27rz) 0

z2) _

= i

i !

l i

1I

g 0 { 3(1

lz2]/x/_

if lzl < 1/2 otherwise iflzl
Epanechnikov

K[z]

Gaussian

K[z]=

Parzen

K[z] =

8(1 _ [z[)3/3 { 04 _ 8_zz + 8[z13

if 1/2 < !z[ < 1 if [z[ _<1/2 otherwise

Rectangular

K[z]

,,f1/2 t 0

if Iz! < 1 otherwise

0

otherwise

l

_

,,

if Izl < 1 otherwise

_-z_ /2

Triangular K[z] = { 1 -[z I if Izi < 1 From the definitions given in the table one can see that the choice of h will drive how many values are included in estimating the density at each point. This value is called the window width or bandwidth I_fthe window width is not specified, then it is determined as

m = rain i

h=

varian_,

interquartile 1.349 rang%

0.9m

1

i i

.. i

)

i

where :r is tile variable tbr which we wish to estimate thekernel and n is the number of observations. Most researchers agree that the choice of kernel is not as important as the choice of bandwidth,

i

i

There is a grea_deal of literature on choosing bandwidths under various conditions: see. for example, Parzen (196_) or Tapia and Thompson (1978}. Als6 see Newton (1988) for a comparison with sample

i i

i

spectral den, itv estimation in time-series applications.

i

i

Acknowledgments 168 kdensity -- Univariate kernel density estimation We gratefully acknowledge the previous work by Isa/as H. Salgado-Ugarte Autonoma de Mexico. and Makoto Shimizu and Tom Taniuchi Ugarte, Shimizu. and Taniuchi (1993), Their article provides subject of univariate analysis.

kernel

density

estimation

and presents

of Universidad

Nacional

of the University of Tokyo; see Salgadothe reader with a good overview of the arguments

for its use in exploratory

data

References Fox, J. 1990. Describing univariate distributions. In Modem Methods of Data Analysis. ed. J. Fox and J• S. Long, 58-125. Newbury Park, CA: Sage Publications. Goeden, G. B. t978. A monograph of the coral trout, Plectropomus leopardus (Lac6pbde). Res. Bull. Fish• Serv. Queens1. 1:42 p. Newton, H. J. t988, TIMESLAB: A Time Series Analysis Laboratory. Belmont, CA: Wadsworth & Brooks/Cole. Parzen, E. 1962. On estimation of a probability density function and mode. Annals of Mathematical Statistics 32: 1065-1076. Salgado-Ugarte, I. H., M. Shimizu, and "12,Taniuchi. 1993. snp6: Exploring the shape of univariate data using kernel density estimators. Stata Technical Bulletin 16: 8-19. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 155-173. 1995. snp6•t: ASH, WARPing, and kernel density estimation for univariate data. Stata Technical Bulletin 26: 23-31. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 161-172. 1995. snp6.2: Practical rules for bandwidth selection in univariate density estimation. Stata Technical Bulletin 27: 5-19. Reprinted in Stata Technical Bulletin Reprints, vot. 5, pp, 172-190. • 1997. snpl3: Nonparametric assessment of multimodality for univariate data. Stata Technical Butletin 38: 27-35. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 232-243. Scott, D. W. I992. Multivariate Density Estimation: Theory, Practice, and Visualization. New York: John Wiley & Sons, Silverman, B. W. 1986. Density Estimation for Statistics and Data AnaIysis. London: Chapman & Hall. Simonoff, J. S. 1996. Smoothing Methods in Statistics. New York: Springer-Verlag, Steichen, T. J. 1998. gr33: VioIin plots. Smm Technical Butletin 46: 13-18. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 57-65, Tapia, R. A. and J. R. Thompson. 1978. Nonparametric Probability Density Estimation. Baltimore: Johns Hopkins University Press. Wand, M. P and M, C. Jones. 1995. Kernel Smoothing. London: Chapman and Hall.

Also See Related:

[R] hist

Background:

Smta Graphics

Manual

Title I

ksm

--

Sm°°thing

iJ_cluding

l°v_;ess

i

Syntax ksm 3,v,tr xvar [ifexp] ]

[inrange]

[, line weight !o,wessb_wwidth(#) loogit

a_dj_stgenerate(newvar)nograph graph_options ]

i

Description

! ii

ksm car'ies out unweighted and locally weighted smoothing of yvar on .n,ar, displays the graph, and optionally saves the smoothed variable. Among ksm's capabilities are lowess (robust locally weighted r,:gression, Cleveland 1979). See Cleveland (1993, 94-101) for a discussion of lowess.

i i

%Vsmini: ksm is computationally intensive and may therefore take a tong time to run on a slow computer. Lowess calculations on 1,000 observations, for instance, require estimating 1.000

l

regressions_.

i

Options line speci_es running-line least-squares smoothing; the default is running mean. weight

specifies use of Cleveland's (1979) tricube weighting function: the default is unweighted.

lowess is_equivatent to specifying line I smootheir. bwidt;h(#)i

weight

and requests Cleveland's

specifies the bandwidth. Centered subsets of bwidth.

lowess running-line

3\r obse_'ations

are used for

} i

calcula@g smoothed values for each point in the data except for the end points, where smaller, uncente_d subsets are used. The greater the bwidth, the greater the smoothing. The default is

i i

0.8, logit transforms the smoothed war into logits. Predicted values less than .000t or greater than .9999 ar_ set to I/N and 1 - l/N. respectively, before taking logits, adjust adiusts the mean of the smoothed war to equal the mean of yvar by multiplying by an appropriate factor. This is useful when smoothing binary (0/1) data. i generate (inewvar) creates newvar containing the smoothed values of yvar in addition to or instead of displaying the graph. i t

t

i

nograph s_ppresses displaying the graph. ,eraph_optiSns are any of the options allowed wiih graph, 1{69

twoway; see [G] graph options.

] i

i ] i i ]

170

,,

ksm-

Smoothing including Iowess

Remarks The most common use of ksm is to provide lowess--locally weighted scatterplot smoothing. The basic idea is to create a new variable (newvar) that, for each yvar Yi, contains the corresponding smoothed value, The smoothed values are obtained by running a regression of war on xvar using only the data (zi, yi) and a small amount of the data near the point. In lowess, the regression is weighted so that the central point (zi,Yi) gets the highest weight and points farther away (based on the distance tzj - zil) receive less weight. The estimated regression is then used to predict the smoothed value _'i for Yi only. The procedure is repeated to obtain the remaining smoothed values, which means that a separate weighted regression is estimated for ever 3, point in the data. Lowess is a desirable smoother because of its localitymit tends to follow the data. Polynomial smoothing methods, for instance, are global in that what happens on the extreme left of a scatterplot can affect the fitted values on the extreme right.

Example The amount of smoothing is affected by the bwidth different values. For instance. • ksm hl

depth,

lowess

ylab

Lowess

xlab

and you are warned

s(Oi)

smoother,

bandwidth

= .8

14 o co

o

o

oo 0

12

o

o

-

o

0 o

000

°o

% 10

O0

0

0 0

0

_:

0

/i,

o

O0

_/

cO cOo

o o

o

8 o

o o

o

o

co

o

6

0

,60

26° depth

Now compare that with

(Continued

on next page)

360

,;o

to experiment

with

r i

,.o_

,,,

]

. ksmlhl depth, lowess ylab xlab s(Oi) bWidth(.4) Lowess

smoother,

bandwidth=

.4

i

o cl_

o

___oo

0

10

O0

_

0

0 0

O 0

0

o o oo fc_

o

eo

o°o

0

0

0%00//" o_ ooo

/ 0

C

0

O

0

oo

0

8

I c

c

o

o

co

6

I 300

i

l

4{_0

depth

In the first c _se,the default bandwidthof 0.8 is used, meaning that 80% of the data is used in smoothing each ,p°int- In the second case, we explicitly specifieda bandwidth of 0.4. Smaller bandwidths follow the ofigmalldata more cteselv.

Example !

Two ks_ options are especially useful with binary (0/1) data: adjust and logit, adjust adjusts the resultin_ curve (by multiplication) so that themean of the smoothed values is equal to the mean

l l

of the unsm_oothedvalues, logit specifies the smoothed curve is to be in terms of the log of _he odds ratio: i

i

. ksmiforeign mpg, lowess ylab xlab jitter(5) Lowess

smoother,

bandwidth

adjust

= .8

75

_//

& • o u.

5 /

i

// 25

/P--/

,

c

i i

1

,

X(o ,o ,o

o 2;

_ o

M_)leage (mpg)

_'o

,;

I

• ksm

f

172

foreign

mpg,

lowess

ylab

xlab

logit

ksm -- Smoothing LowesS including smoother, Iowess bandwidth

yline(O)

= .8

)l !

_g,

o

/ /t-

_

/

o,

/

/

/ -4

lo

2'0

3_o Mileage

4_

(mpg)

With binary data, if you do not use the logit option, it is a good idea to specify graph's jitter() option; see [G] graph options. Since the underlying data (whether the car was manufactured outside the United States in this case) take on only two values, raw data points are more likely to be on top of each other, thus making it impossible to tell how many points there are. graph's jitter() option adds some noise to the data to shift the points around. This noise affects only the location of points on the graph, not the lowess curve. When you do specify the logit

option, the display of the raw data is suppressed. <1

Q Technical Note ksm can be used for other than lowess smoothing. Lowess can be usefully thought of as a combination of two smoothing concepts: the use of predicted values from regression (rather than means) for imputing a smoothed value and the use of the tricube weighting function (rather than a constant weighting function), ksm allows you to combine these concepts freely. You can use line smoothing without weighting (specify line), or mean smoothing without weighting (specify no options), or mean smoothing with tricube weighting (specify weight). Specifying both weight and line is the same as specifying lowess. Q

Methodsand Formulas ksm is implemented

as an ado-file.

Let Yi and z_ be the two variables and assume the data are ordered i = 1.... , N 1. For each Yi, a smoothed value _t_ is calculated.

_'

so that z_ _< zi+l

for

The subset used in calculation of fl_ is indices i_ -- max(l, i k) through i.__= min(i + k, N), where k - L(N.bwidth-0.5)/2j. The weights for each of the observations between j -- i ...... i+ are either 1 (default) or the tricube (weight):

ksm -- Smoothingincludingi_

i

1_

where A = 1.000t max(z+ - x_, x, - z_). The smoothed value y] is then the (weighted) mean or the (weighted) regression prediction (line). !

Ackledgrnent

]

ksm was written by Patrick Royston of tile MRC Clinical Trials Unit, London.

i

ReferenCes ! Chamber_,J. M., W. S. Clevi_land,B. Kleiner,and E A. Tukey.I983. GraphicalMethodsfor Data AnalySis.Belmont, CA: WadsworthInternatiOnalGroup. Clevetan&W. S. 1979. Robust locally weightedregressionand smoothingscatterplots.Journal of t_ Amer/can Staris_catAssocia6on 74i 829-836. , 1993. VisualizingData. Summit,NJ: HobartPress. ............ . 1994. The Elementsdf GraphingData. Summit,NJ! HobartPress. Goodatl,i'C. 1990.A surveyOfsmoothingtechniques.In ModemMethodsof Data Analysis. ed. L Fox and J. S. Long, 126-!76. NewburyPark;CA: Sage Publications. Royston, P. 199t. gr6: Lowess smoothing.Stata TechnicalBulletin 3: 7-9. Reprinted in Stata TechnicalBulletin

i

Re ,s, voL1,or.41- 4.

Salgado'Ugarte,1. H. and M. Shimizu.1995.snp8: Robusiscatterplotsmoothing:enhancementsto Stata's ksm. Staia TechhicalBulletin25: 23-26. Reprintedin Stata TechnicalBulletin Reprints,vol, 5, pp. 190-194. Sasieni.P. 1994. snpT:Naturalcubic splines.Stata TechnicalBulletin22: 19-22. Repnntedm Stata TechnrcalBulletin Reprints,vol. 4, pp. 17b-174.

AlSo [

!

Related:

[R] ipOlate, [lk] smooth

BackgrOund:

Stata Graphics Manual

I i(le

ksmirnov -- Kolmogorov-Smimov I

equality of distributions I test

II

]

Syntax ksmirnov

varname = exp [if exp] [in range]

ksmkrnov varname [if exp] [in range], by(groupvar)

[ e_xact ]

Description ksmirnovperforms one- and two-sample Kolmogorov-Smirnov tests of the equality of distributions. In the first syntax, varname is tile variable whose distribution is being tested and exp must evaluate to the corresponding (theoretical) cumulative. In the second syntax, groupvar must take on two distinct values. The distribution of varname for the first value of groupvar is compared with that of the second value. When testing for normality, please see [R] sktest and [R] swilk.

Options by (groupvar) is not optional, in the second syntax. It specifies a binaD' variable that identifies the two groups. exact specifies the exact p-value is to be computed. This may take a long time if n > 50.

Remarks D Example You have data on x that resulted from two different experiments, labeled as group==1 group==2. Your data contain

and

list group 1. 2. 3. 4. 5. 6, 7.

x 2 0 3 4 5 8 10

2 1 2 1 1 2 2

You wish to use the two-sample Kolmogorov-Smirnov in the distribution of z for these two groups:

test to determine if there are any differences

- ksmirnov x, by(group) Two-sample Kolmogorov-Smirnov Smaller group 1: 2: Combined K-S :

D O.5000 -0. 1667 0,5000

test for equality of distribution P-value

Corrected

O.424 O.909 O. 785

O.735

174

functions:

ksmimov-- Kolmogorov-Smirnov equalityof distributions test

175

Thefirst lineteststhe hypothesis thatx for group] containssmaJtervaluesthangroup2. Thelargest difference between the distribution functions is 0.5. The approximatep-value for this is 0.424, which is not fignificant. The second line tests the hypothesis that x for group 1 contains Iarger values than group 2. The largest_ difference between the distribution p-vaLuefor this small difference is 0.909. functions in this direction is 0.1667. The approximate Finally,the approximate p-value for the combined test is 0.785, corrected to 0.735. The p-values ksmirnov calculates are based on the asymptotic distributions derived by Smimov (1939). These approximations are not very good for small samples (n < 50). They are too con_servative--real p-values tend to be substantially smaller. We have also included a less conservative approximationfor the nonidirectionalhypothesis based on an empirical continuity correction. That is the 0.734 reported in the third column• That number, too, is only an approximation. An exact value can be calculated using the exact option: • ksmirnov x, by(group) exact Two-sample Kolmogorov-Smirnov test for eo_ualityof distribution functions: Smaller group D P-value Exact 1;: 2: Combined K-S :

O.5000 -0.1667 O.5000

O. 424 0.909 O.785

O.657

I> Example Lefs now test whether x in the example above is distributed normally. Kolmogorov-Smirnov is not a particularly powerful test in testing for normality and we do not endorse such use of it; see [R] sktest and [R] swilk for better tests. In any case. we will test against a normal distribution with the same mean and standard deviation: • Summarize x Variable

Obs

Mean

Std. Dev.

x 7 4.571429 3.457222 _smirnov x = norm((x-4.571429)/3.457222)

Min

Max

0

10

One-sample Kolmogorov-Smirnov test against theoretical distribution norm( (x-4.571429)/3.457222) Smaller group x: Cumulative : Combined K-S:

D O. 1650 -0.1250 O.1650

P-value

Corrected

0.683 O. 803 O,991

O. 978

Since Stata has no way of knowing that you based this calculationon the calculated mean and standard deviation of x. the test statistics will be slightly conservative in addition to being approximations, Nevertheless. they cleartv indicate that the data cannot be distinguished from normally distributed data. q

]

, .v

i

rtoJls.

Jwvw

--

._v_lv_jvzvv--_.TlUnllUlVV

V_lMClllty

UI

Ul_itrlDU[lOrl$

[eS[

Saved Results ksmirnov

saves in r():

Scalars _

r(D_l)

D from line l

r(D)

combined D

r(p_l)

p-value from line 1

r(p)

combined ;,-value

r(D._2) r(p_2)

D from line 2 p-value from line 2

r(p_exact)

combined significance (x 2 or exact)

name of group from line 1

r(group2)

name of group from line 2

Macros r(groupl)

Methodsand Formulas ksmirnov is implemented as an ado-file. In general, the Kolmogorov-Smimov test (Kolmo_orov 1933; Smirnov 1939; also see Conover 1999, "Statistics of the Kolmogorov-Smimov type", 42)-465) is not very powerful against differences in the tails of distributions. In return for this, it is f_irly powerful for alternative hypotheses that involve lumpiness or clustering in the data. The directional hypotheses are evaluated with the statistics

where Fix ) and G(x) are the empirical distribution t_unctions for the sample being compared. The combined statistic is

The p-value for this statistic may be obtained by evaluating the asymptotic limiting distribution. Let m be the sample size for the first sample, and let rt be the sample size for the second sample. Smirnov (1939) shows that C_

lira rr_, _z_--40G

Pr{v/mn/(m.

n)Dra,n< z} = i-

2'

i=1

(-

1) i-1 exp (-

2i2z 2)

The first 5 terms form the approximation Pa used b_ Stata. The exact p-value is calculated by a counting algorithm; see Gibbons (1971, 127-131). A Corrected p-value wa_ obtained by modifying the asymptotic p-value using a numerical approximatic_n technique Z = ¢-1 (Pa)+

1.04/rain(m,

n)+ 2.09i/max(re,

p-value =
n)-

].35/v/_m/(m

+ n)

IL ksmimov

•

Kolmogorov-Smirnov

equality of dJstdbution,s test

177

References Conover, W. J. 1999. Pr_cOcaINonparameNc Statistics. 3d ed. New York: John Wiley & Sons. Gibb0n_, J. D. Y971.Nonparametric S_6srical Inference. New York: McGraw-Hill. Kolrn_g0rov, A. N. 1933. Sulla determinazione empirica di una legge di distribuzione. Giomale dell' fstituto Italiano degli Am_ari 4:83-91 Smirr_v, N. V. 1939. Estimate of deviation between ernt_irical distribution functions m two independeat samples (in R_sst_). Bulletin Moscow University 2(2): 3-16.

I

i

AlSo See

'

Related:

[R]runtest,[R]sktest,[R]swilk

ll_

! ltle [ kwallis -- [ Kruskal-Waltis equality of population,_•i rank test [ I /Ill

]

I

i

Syntax kwallis

varname [if exp] [in rangeJ, by(grou_var)

Description kwallistests the hypothesis that several sample_ are from the same population. In the syntax diagram above, varname refers to the variable recortling the outcome and groupvar refers to the variable denoting the population. Note that the by () "option" is not optional.

Remarks > Example You have data on the 50 states. The data contain tlhemedian age of the population medage and the region of the country region for each state. You _,vishto test for the equality of the median age distribution across all four regions simultaneously: • kwallis medage, by(region) Test: Equality of populations region NE N Cntrl South West chi-squared = probability =

_Obs 9 12 16 13

(Kruskal-Wallis _test)

_KankSum 376.50 294.O0 398.00 206.50

17.041 with 3 d.f. 0.0007

chi-squared with ties = probability = 0.0007

17.062 with 3 d.f.

From the output we see that we can reject the hypotl_esis that the populations are the same at any level below 0.07%.
Saved Results kwallissaves in r(): Scalars r(df)

degrees of _freedom

r (chi2)

X2

r(chi2__adj)

X 2 adjustedfor ties

178

kwallis -- Kruskal-Wallis equality of populationsrank test

179

MMhodSand Formulas kUallisis implemented as an ado-file. The Kruskat-Wallis test (Kruskat and Wallis 1952; atso see Conover 1999. 288-297 or Airman 1991, 213-21'5) is a multiple-sample generalization of the two-sample Wilcoxon (also called Mann-Whitney) rank sum test (Wilcoxon 1945: Mann and Whitney 1947). Samples of sizes n3, 3 = I,..., m, are combined and ranked in ascending order of magnitude. Tied values are assigned the average ranks. Let n denote the overall sample size and let Rj denote the sum of the ranks for the jth sample. The Kruskat-Wallis one-way analysis-of-variance test H is defined as

H=

12 _., P_ - 3(n + 1) n(n + l ) j_l' nj

The sampling distribution of H is approximately X2 with m - 1 degrees of freedom.

tl raneas A_ttman. D. G. I991. Practical Satistics

for Medical Research.

Conover. W. J. 1099. Practical Nonparametric

London: Chapman & Hall.

Statistics. 3d ed. New York: John Wiley & Sons.

Ktuskal, W. H and W A. Wallis. 1952. Use of ranks in one-criterion Stalistical Association 47: 583-621.

variance anah,sis. Journal

of the American

Mann. H?B. and D. R. Whitney. 1947. On a test of whether one of two random variables is stochastically the other. Annals of Mathematical Sta6stics 18: 50-60. Wilcoxon. F. 1945. Individual

comparisons by ranking methods. Biometrics

Also Sea Related:

[R] nptrend, [R] oneway, [R] runtest, [R] signrank

1: 80-83.

larger than

Title label -- Label manipulation I I O

II

g_

I

i

Syntax la_bel data

["label"!

label define Iblname # "label"

[# "label"...

] [, _addmodify nofix ]

label dir label drop {Ibtname label

list

[lblname

... ]l-all

[Iblname [Iblname ... ] ]

label save [Iblname [Iblname.,. label

}

values

varname

[lblname]

label variable varname

]] using [, nofix

filen_me

[, replace ]

]

["label"]

Description label data attaches a label (up to 80 characters) to the dataset in memory. Dataset labels are displayed when you use the dataset and when you describe it. If no label is specified, any existing label is removed. label define defines a list of up to 65,536 (1,000 for Small Stata) associations of integers and text called a value label. The value label is attached te variables by label values. label

dir

lists the names of value labels stored in memory.

label

drop

eliminates value labels.

label

list

lists the names and contents of value labels stored in memory.

label

save

saves value labels in a do-file.

label values attaches a value label to a variable. If no value label is specified, any existing value label is detached. The value label, however, is r_ot deleted. label variable attaches a label (up to 80 characters) existing variable label is removed.

to a variable.

If no label is specified, any

Options add allows additional #++ label correspondences to be added to Iblname. If add is not specified. only new Iblnames may be created. If add is specified, you may create new lblnames or add new entries to existing Iblnames.

i

180

i

allows modification or deletion of existing #_ label correspondences and also allows additional correspondences to be added, Specifying modify implies add even if you do not type tl_eiadd option.

modify

no:fix prevents display formats from being widened according to the maximum length of the value label. Consider label values myvar mylab and pretend that myvar has a Y.9.0gdisplay format right now. Pretend that the maximum length Ofthe strings in mylab is t2 characters Then label values would change the format of myvar from Y.9.0g to %t2.0g. no:fix prevents this, nofix is also allowed with label define, but it is relevant only when you are modifying an existing value label. Withoutthe no:fix option, label define finds all the variables that use this value label and considers widening their display formats, no:fix prevents this. replace

allows filenalne to be replaced even if it already exists.

Remarks See:[U] 15.6 Dataset, variable, and value labels for a complete description of labels. This entry deals Onlywith details not covered there. label dir lists the names of all defined value labels, The label list contents of_a value label.

command displays the

Example Although describe shows the names of the value labels, those value labels may not exist. Stata does not consider it an error to label the values of a variable with a nonexistent label. When this occurs. Stata still shows the association on describe but otherwise acts as if the variable's values are Unlabele&This way, you can associate a value label name with a variable before creating the corresponding label, Similarly, you can define labels that you have not yet use& label dir shows you the labels that you have actually defined: label dir y_sno _exlbl

We have two value labels stored in memory: one called yesnoand the other called sexlbl We can display the contents of a value label using the labellistcommand: label list yesno

y_sne: 1 yes 2 no

The value label yesno labels the values l as yes and 2 as no. If you do not specify the name of the value label on the label list value labels is produced: label list y,esno_: 1 yes 2 no sexlb_

: 0 Male 1 Pemaie

command, a listing of all

r

182

label -- Label manipulation

[] Technical Note Since Stata can have more value labels stored in memory than are actually

used in the dataset,

you may wonder what happens when you save the dat_set, tn that case. Stata stores with the dataset only those value labels actually associated with variables. When you use a dataset, Stata eliminates the dataset.

all the _mlue labels stored in memory before loading []

You can add new codings to an existing value label _sing the add option with the label define command. You can modify existing codings using the modify option.

_>Example The label yesno codes 1 as yes and 2 as no. Perhaps at some later time you wish to add a third coding: 3 as maybe. Typing label define without any options results in an error: label label

define

yesno

yesno

already

3 maybe defined

r(llO) ;

If you do not specify the add or modify options, label define can only be used to create new value labels. The add option lets you add codings to _n existing label: . label label

define list

yesno

3 maybe,

add

yesno

yesno : 1 yes 2 no 3 maybe

Perhaps you have accidentally mislabeled a value. For instance. 3 may not mean "maybe" may instead mean "don't know", add will not allow you to change an existing label: label

define

yesno

invalid attempt r(180) ;

3 "don't

to modify

know",

but

add

label

Instead, you specify the modify option: label

define

label

list

yesno

3 "don't

know",

modify

yesno

yesno: I yes no 3 don_t

know

In this way, Stata attempts to protect you from yourself. If you type label define without any options, you can only create a new value label--you cannot accidentally mutilate an existing one. If you specify the add option, you can add new labelings to a label, but you cannot accidentally change one of the existing labelings. If you specify the modify optiom which you may not abbreviate, you can do whatever you want. You can even use the modify option to eliminate numeric code to a null string, that is. '....

existing labelings.

To do this. you map the

label-

Label m_mipul_lo.

183

1

. label define yesno 3 .... , modify label list yesno yesno: i yes 2 no

You can eliminate entire value labels using the label

drop command.

Example We currentlY have two value labels stored in memory--sexlbl and yesno.The label dir comman_ reports that: • label,dir

y_sno sexlbl

The da_a_t that we have in memory uses only one of the labels

sexlbl,

describe

reports that:

describe Centains data from emp.dta obs: vats: size:

7 4 224 (99.87,o_ memory free)

variable name

storage type

displ&y format

n_me

str16

7,16s

empno sex salary

float float float

X9.0g X9.0g Y,9.0g

S_rted

value label

sexlbl

1992 Employee Data 14 Jul 2000 14:28

variable label

Employee mumber O--male;l=female Annual salary, exclusive of bonus

by:

We can:eliminate the yesno label by typing label drop yesno: • ilabelidrop yesno • !la_el dir sexlbl

We could elin_inate all the value labels in memory by typing • .label_rop _ail label Idir

Remember that the value label sexlbl,which no longer exists, was associated with the variable sex. E_en after dropping the value label, sexlbl is still associated with The variable:

_04

taoe_-- Lane_manlpulatton . describe

i i

obs: Contains vats: size:

data

variable

name

7 :from emp.dta 4 224 (99.8_

1992 of memory

2000

Data 14:28

free)

storage

display

value

t_e

format

label

name

sir16

empno

float

7.16s _9.0g

sex

float

7,9.0g

salary

float

_,9.0g

Sorted

Employee

14 Jul

sexlbl

variable

label

Employee

number

O--_nale; 1=re, hale Annual salary, bonus

exclusive

of

by :

As stated earlier, Stats does not mind if a nonexistent value label is associated with a variable. When Stats uses such a variable, it simply acts as if it is not labeled: • list

1. 2.

in

1/4 name

empno

sex

salary

Hank Rogers Pat Welch

57213 47229

0 1

24000 27000

57323

0

24000

57401

0

24500

3.

Bob

4.

Richard

Underhill Doyle

q The label save command creates a do-file containing label define commands for each label you specify. If you do not specify the Iblnames, all value labels are stored in the file. If you do not specify the extension for fitename, . do is assumed.

_, Example Labels are automatically stored with your dataset when you save it. Conversely, the use command drops all labels before loading the new dataset. You may occasionally wish to move a value label from one dataset to another. The label save command allows you to do this. For example, assume we currently have the value label yesno label

list

in memory:

yesno

yesno: I yes 2 no 3 maybe

You have a dataset stored on disk called survey, dta to which you wish to add this value label. One alternative is to use survey and then retype the label define yesno command. Retyping the label would not be too tedious in this case, but if the value label in memory mapped, say, the 50 states of the union, retyping it would be irksome, label save provides an alternative: label save yesno using file ynfile.do saved

ynfile

Typing label save yesno using ynfile the definition of the yesno label.

caused Stata to create a doqilc called ynfile,

If we want to see the contents of the file, we can use the Stata type

command:

do containing

label-

Label manip?iation

IFK

type ynfile.do l_be_ define yesno 1 ""yes'", modify labe_ define yesno 2 "no"", modify label define yesno 3 ""maybe"', modify

Weean:now use our new dataset, survey.dta: • :us_ survey (]Ioudeholdsurvey data) • la_el dir

Using the new dataset causes Stata to eliminate all value labels stored in memory. _ label yesno isnow gone.Sincewe saveditinthefile ynfile,do,however,we cangetitbackby typingeither do runexecute ynfile.Ifwe execute. Ifwe type run,y_file:,or _he file will silently: typedo.we willsecthecommandsinthefile , runiynfile . -lab_l yeSno

dir

The libel islnow just as if we had typed it from the keyboard. q

0 Techni_aiNbte Yola can also use the label save command to make the editing of value labels easier. You can save a: label in a file. leave Stata and use your word processor or editor to edit the label, and then return ,to Stata. Using do or run, you can load the edited values,

Gleason, J. R. J998. dm56: labels editor in Stata Technical ButletmA Reprints, vol, for 8, Windows pp. 5-t0. and Macintosh. Stata Technical Bulletin 43: 3-6.

--'vo1.1999"9, p.dm_6"l:]_.Update to labedk

Stata Technical Bulletin 5I: -.') Reprinted in Stata Technical Bulletin Reprints,

Weesie, L 1997. din47:vot. Verifying value label mappings. Stats Technical Bulletin 37: 7-8. Bull_tin Reprints, 7, pp. 39-40.

Also:See Background:

Reprinted

[u] 15.6 Dataset, variable, and value labels

Reprinted in Stata Technical

ladder -- Ladder of powers

Title []

Syntax ladder

varname [if

gladder

varname

qladder

varname

symbol(string)

exp]

[in range]

[if exp] [in range] [if

exp]

[in

margin(string)

range]

[" g_enerate(newvar) [, bin(#)

graph_options

noadjust

]

]

[, grid scale(#)

saving([ilename[,

replace])

]

by ... : may be used with ladder; see [R] by.

Description ladder searches a subset of the ladder of powers (Tukey 1977) for a transform that converts varname into a normally distributed variable, sktes_: is used to test for normality; see [R] sktest. Also see [el boxcox. gladder

displays nine histograms

of transforms Of varname according

to the ladder of powers.

qladder displays the quanfites of transforms of vamame according to the ladder of powers against the quantiles of a normal distribution.

Options generate(newvar) saves the transformed values co_esponding to the minimum chi-squared value from the table. Its use is not, in general, recommended since generate() is quite literal in its interpretation noadjust

of minimum, thus ignoring nearly equal but perhaps more interpretable

is the noadjust

option to sktest;

see [R] sktest.

bin (#) specifies the number of bins for the histograms. for you (see Methods and Formulas below). graph_options grid scale

transforms.

If not specified, an intelligent choice is made

are any of the options allowed with graph,

histogram;

see [G] histogram.

adds grid lines at the .05, .I0, .25, .50, .75, .90, and .95 quantiles. (#) specifies the size of text used to label the graphs, scale(1.25)

symbol(string)

is the default.

specifies the symbol used in the graph.

margin(string) specifies the margin to be placed around each graph as a percenl of graphical area. The default is 0. saving_Iename[,

replace

]) saves the graph.

i

186

,

lad_ler-- Ladder of powers

18/

IrrrkS Example You have data on the mileage rating of 74 automobiles and wish to find a transform that makes the varihble n_rmally distributed:

.-ila e =pg T_an_Io_mation

formula

chi2(2)

P(chi2)

cube square raw squaxe-_oot log reciprocal root r_icil_rodal

mpg'3 mpg'2 mpg sqrt(mpg) log (_g) I/sqrt(mpg) llmpg

43.59 27.03 10.95 4.94 O.87 0.20 2.36

(}. 000 O. 000 O.004 O.084 O. 647 0.905 O.307

recipro_al square rdci_rodal cube

i/(mpg-2) I/(mpg'3)

1I. 99 24.30

O. 002 O, 000

Had we t>ped '_adder mpg, gen (mpgx). the variable mpgx would have been automatically generated for us containing 1/_ m,'_pg.This is the perfect example of why you should not. in general, specify the generate() option. Note that we also cannot r_iect the hypothesis that the reciprocal of mpg is normally distributed and i/mpg gallons per mile has a better interpretation. I! is a measure of energy ¢onSun_ption. q

Example glad_lerexplores the same transforms as ladderbut presents results graphically: • gladd_

mpg

Histograms

Mi{eage(@pg) by Transformation

q

188

ladder -- Ladder of powers

Q Technical Note

!_

gladder is useful pedagogically, but some caution must be exercised when using it for research work, especially with large numbers of observations; For instance, consider the following data on the average July temperature in degrees Fahrenheit for 1954 U.S. cities:

ii+.i

. ladder tempjuty Transformation

formula

chi2 (2)

cube square raw square-root log reciprocal root reciprocal reciprocal square reciprocal cube

tempjuly_3 tempjuly'2 tempj uly sqrt (tempiuly) log (tempjuly) 1/sqrt (tempjuly) I/tempjuly 1/(tempjuly'2) I/(tempjuly'3)

47.49 19.70 3.83 1.83 5.40 13.72 26.36 64.43

P (chi2) O.000 O.000 0.147 O.400 O. 067 O.001 O. 000 O. 000 O. 000

The period in the last line indicates that the ;g2 is very large; see [R] sktest. From the table, we see that there is certainly a difference, normality-wise, between the square and square-root transform. If, however, you can see the difference between the transforms in the diagram below, you have better eyes than we do: • gladder tempjuly

eubo

squir*

_9_23

B_eo28

menl_+_/

337_

sqn

5B,_

,og

_

_q+l

+_ .+?044

+ + 7 S+2_14

-+

+ i ++7471

o 4.0+++P'

m¥@rll

'

*

• ++1+m3

+

l/Ioum+l

_S89_

i . °,+++,,

+++055

.2704

t

t I +o,o,,,+,

. +_+3+31+p

!tcm_l

:+J..... oo_.,+

Average

Histograms

duty

+,_°, +J

tem_eralure

by Trafisformation

CI

Example

+i

A better graph for seeing normality is the quantile-normal

graph which can be produced by qladder.

ladder-- Ladder of powers

,1

qladder tempjuly

g2_}0_6 I

--a_

1381B1 15 _ 13B'_31

B7 BO._6

.... 721058

r

3110.81 311:0 81

sqrl

7 68

58,1 5B.1481

8215.65

Jog ....

53

9.63357

4.0J_963

_nverse

-.017212 _ ..016441

93,6

tlsqrt

4.54141

-.128761

I tsquere

$.

t -000296 - 01035

-.00{_263

Average

Ouantile-Normat

-,000098

91.9594

-.102562

1/cube

-5.1e.06 -4.2e-06

-7 4e-07

Ju y temperature

PlOts by Transformation

This graph shows that for the square transform, the upper tail, and only the upper tail, diverges from,what would be expected, This is detected bs_sktest as a problem with skewness, as we would learn from using sktest to examine tempjuly squared and square-rooted, ,3

:SavedtResults l_dd_r saves in r(): Scalars r(l_)

number of observations

r(cube)

X_ for cube transformation

r(P_cube)

significance level for cube transformation

r(square)

_2 for square transformation

r(P_square)

significance level for square transformation

r (raw)

X_ for untransfom_ed: data

r(P_raw)

significance level for iuntransformed data

r(sqrt)

_:e for square-root

r(P_sqrt)

significance level for square-root

r(log)

.x: for log transformation

r(P_log)

significance level for log transformation

r(invsqrt)

:ts for reciprocal root transformation

r(P_invsqrt)

significance level for reciprocal

r(inv)

X2 for reciprocal transformation

r(P_inv)

significance level for reciprocal

r(invsq)

;_2 for reciprocal square transformation

r(P_invsq)

significance

r(invcube)

k 2 [br reciprocal cube transformation

r(P_invcube)

s_gnificance level for reciprocal cube transformation

transformation transformation

root transformation transfommtion

level tbr reciprocal square transformation

190

ladder -- Ladder of powers

Methods and Formulas ladder,gladder,and qladder are implemented For ladder, results are as reported by sktest; transform with the minimum X 2 value is chosen.

as ado-files. see [R] sktest. If generate

() is specified, the

gladder sets the number of bins to min(v/-_ , 10 log_o n), rounded to the closest integer, where n is the number of unique values of varname. See [G] histogram for a discussion of the optimal number of bins. Also see Findley (1990) for a ladder-of-powers variable transformation program that produces one-way graphs with overlaid box plots, in addition to histograms with overlaid normals. Buchner and Findley (1990) discuss ladder-of-powers transformations as one aspect of preliminary data analysis. Also see Hamilton (1992, 18-23).

Acknowledgment qladder

was written by Jeroen Weesie, Utrecht University,

Netherlands.

References Buchner, D. M. and T. W. Findley. 1990. Research in physical medicine and rehabilitation: viii preliminary data analysis. American Journal of Physical Medicine and Reliabilitation 69:154-169. Findley, T. W. 1990. sod3: Variable transformation and evaluation. Smta TechnicalBu!letin 2: t5. Reprinted in Smm TechnicalBulletin Reprints, vol. 1, pp. 85-86. Hamilton. L. C. 1992. Regression with Graphics. Pacific Gro_'e.CA: Brooks/Cole Publishing Company. Tukey, J. W. 1977. Exploratory Dam Analysis. Reading, MA: Addison-Wesley Publishing Company.

Also See Related:

IR] hoxcox, [R] diagplots,

Background:

Stata Graphics Manual

[R] Insl_ew0, [R] Iv, [R] sktest

Title

level _ Set default confidence level

Syntatx S_t!leVel#

DescriptiOn •

i

set It#el specifies the default confidence level for confidence intervals for all commands that repoia Confidence intervals. The initial value is 95. meaning 95ck confidence lnter_a],. " , _ # may . be betweeri 10 and 99.

Remarks i

I

TO change the width of confidence intervals reported by a particular command, it is not necessary to re_et!theldefault confidence level. All commands that report confidence intervals have a level (#) option. 'W_en you do not specify the option, the confidence intervals are calculated for the default level sei b_ set level

or 95% if you have not reset it.

> Example You tlse the c± command to obtain the confidence interval for the mean of mpg: . ,ci

Impg

mpg _ariable

I

74 Obs

21.2973 Mean

.6725511 Std. Err.

19.9569 [95Y, Conf.

22.63769 Interval]

[90]/,Conf.

Interval]

To obtain _0% confidence intervals, you could type • {el_pg, level(90) _ariable

Obs

mpg

74

Mean 21.2973

Std.

Err.

.6725511

20.17683

Std.

[90Y, Conf.

22.41776

Or iset i level

90

• :ci _pg g,_riable

Obs

mpg

74

Mean 21.2973

Err.

,6725511

20. 17683

Interval] 22.41776

If y'ou opt for the second alternative, the next time you estimate a model (.say with regress). OOe/_ confidence intervals will be reported. If you wanted 95% confidence intervals, you could specify' level (95) I on the estimation command or you dould reset the default by typing set level 95. 1

191

Also See

192 level -- Set default confidence level Complementary: [R] query

ii

Related:

[R] ci

Background:

[U] 23 Estimation and post-estimation commands. [O] 23.5 Specifying the width of confidence intervals

•

,

!

' J i

-- Quick reference for limits ,i i, ri i I i r i_lll

H

I,,,,lliHillllll,

i

i

!

Descdp{tidn ' i T!fisientry provides a quick reference for the size limits in Stata. Note that most of these limits are So _igt that you will never encounter them.

Remarks Maximum size limits for Small Stata and Intercooled Stata

_

l

_

,

Small Stata

Nu_be[ of observations Nu_bei of variables WiSh 0f a dataset Valde df matsize : I Numberof characters _n a command

about t,000 99 200 40

67,800

50

50

t,600 100 8

1,600 100 8

66 50

200 t 50

256 80 5

512 80 5

3,400

67,784

32

32

1,000 37,296

3,500 135,600

_ngthlof variable name Length tof ado-command name

32 32

32 32

Eer_gthtof global macro name _ngth!ofi local macro name

32 31

32 3t

Lengthi of a string variable

80

80

Numberf 1)ilirr ited°fconditionSby memory.inan if statement (Confi,ued on nero page)

30

100

Nu,be t of elements in a numlist Nurbbe i of unique time-series operators in a command Numbe_cof seasonal suboperators per time-series operator Nur0be [i of dyadic operators in an expression Nurpbe I of numeric literals in an expression Nur0'be_-,of string literals in an expression _ngth of string in string expression Numbe[ of sum functions in an expression Numbe_ of characters in a macro Nu_be_ of nested do-files Nu,be_ of lines in a program Nu_be[ of characters in a program

I

2.147,483,647 (1) 2,047 8,192 8(30

3,500

Nu_be i of options for a command

[i

Intercooled Stata

1

]

1 ' 1

i

1 ] t

Z

193 i i

!

194

limits -- Quick reference for limits Maximum size limits for specific commands Small State

Intercooled Stata

8 4

8 4

anova

Number of terms in anova model test statement Number of terms in the repeated() option char

Maximum length of a single characteristic

3.400

67,784

constraint Number of constraints encode

1,000

1,000

1,000

65,536

and decode

Number of uniquevaluesfora string variable estimates hold Number of stored estimation results

10

10

13

13

_.N/2

_.NI2

graph (See State Graphics Manual for graph limits) greigen Number of eigenvalues

plotted by greigen

grmeanby Number of unique values in varlist hist Number of unique values in varname impute Number of variables in varlist

(Table continued

on next page)

50

50

31

3t

!

V

limits -- Quick reference for lhnits

195

Maximum size limits for _pecific commands, continued

Small Intercooted St,am Stata iRecord length without data dictionary Record length with data dictionary (_)

none 7.998

none 7,998

none 7,998

none 7.998

80 80 80 32 1.000

80 80 80 32 65.536

Infix !Record length without data dictionary Record length with data dictionary (2) labal Length of dataset label Length of variable tabel Length of value label Length of name of value label Number of codings within a single Value label mat :ix Size of a single matrix

40x40

tnajimize options Number of iterations specified with iterate()

16.000

16.000

10

10

20

50

8

8

t.000 9.999 9.999

67.784 9.999 9.999

1.600

1.600

_l_git Number of outcomes in model

20

50

Op_obit Number of outcomes in model

20

50

800×800

mer_e

, ,

Number of variables that you can specify in a match-merge ml_git Number of outcomes in model nl4git and nlogittree Number of levels in model i_o_es

,

! Maximum length of a single note _ Number of notes attached to _dta Number of notes attached to each variable

nu_list i Number of elements in the numeric list

(2) :or Stata for Unix. the maximum record length is 19.998.

i'l

i _!

Maximum size limits for specific commands, limits -- Quick reference for limits

196

continued

plot Number of columns specified with column() option Number of lines specified with lines() option reg3, set

sureg, and other system estimators Number of equations

Small Stata

[ntercoolcd Stata

133 83

133 83

40

800

500K

500K

5

5

4

4

3.000

3,000

500 160 20

3,000 300 20

375

375

40

800

adosize

Maximum

amount of memory that ado-files may consume

sts graph Number of by variables (3) tabdisp and table Number of by variables Number of margins; that is, the sum of the rows, columns, supercolumns, and by groups tabulate Number of rows for a one-way table (4) Number of rows for a two-way table (4) Number of columns for a two-way table tabulate,

summarize

Number of cells xt estimation commands Number of time periods

(3) May be restricted to fewer depending on other options specified. (4) For lntercooled Stata for the Macintosh, limits are 2.000 for the number of rows for a one-way table and 180 for number of rows for a two-way table.

Also See

[I

Related:

[R] matsize,

[R] memory

Background:

[U] 7 Setting the size of memory

Title

,

[

i

I

II

0

I

U

I[

I

lin_o,exp [,level

iF

I

IU i

i

I

[I "

(#) or hr irr rrr e__form]

exp is apy 1:near combination of coefficients that is valid syntax for test: riot _ont_fin any additive constants or equal signs.

S¢¢ [R] test, Note, however, that exp must

Descrilti(m

'

li_o_, computes point estimates, standard errors, t or z statistics, p-values, and confidence intervals )r a linear combination of coefficients after any estimation command except a.nova. Results can optionally be displayed as odds ratios, hazard ratios, incidence rate ratios, or relative risk ratios. I

The. s_ y estimation commands for survey data have their own special command, svylc, estimating linear combinations: see [R] svylc.

OptiOnS leveli( ) specifies the confidence level, in percent, for confidence intervals. I

'

i

"

l

,

l i

[

Syntax

I

i

c m -- Linear combinations of estimators

or ;Is s_t by set level:

The default is level

for

(95)

see [U] 23.5 Specifying the width of confidence intervals.

or. hr_, ilr, rrr, and eform all do the same thing:, they all report coefficient estimates as exp(3) rather than ft. Standard errors and c?nfidenc_ intervals are similarly transformed. Note that or is the default after logistic.The onl 3 differehce in these options is how the output is labeled.

, :

Option

Label

Exl_lanation

Example commands

or

Odds Ratio

Odds ratio

logistic,

hr

_Iaz. Ratio

Ha_d

stcox,

irr

IRR

Incidence Rate Ratio

poisson

rrr

_

Relmive Rate Ratio

mlogit

eform

exp(b)

Gerieric label

ratio

logit streg

Remarks After itting a model and obtaining estimate_ for coefficients/3t,/32, .... 3_-, one often wants to view esti! aates for linear combinations of the 3i!, such as 31 32. lincomcan display estimates for an)" lilaem combination of the form ¢131 + C2/_2 + " '- * Ck'3k. Any valid works. estimation command for which iinco: works after ans' test except anova expressio_ tbr test Syntax l (see [R] test-_is a valid expression for lincom,There is only one exception to this rule: lincom does not allow ddditive constants: i.e.. it cannot display estimates for co -t- C131_-" "+ ck/3_,-when co _ O. line@

is useful for viewing odds ratios, lhazard ratios, etc.. for one group ti.e., one set _f

covariatel)1 relative to another group (i.e.. another set of covariates). See examples below 197

D Example We estimate a linear regTession: i!I_

- regress y xl x2 x3

l Source I

!" ili_ !i

Model Residual

,

Total

l

SS

.

df

MS

3259,3561 1627.56282

3 144

i086,45203 II .3025196

4886.91892

147

33.24434_4

Y

Coef.

t

/ i

= = = = =

148 96.12 0.0000 0.6670 0.6600 3.3619

Std. Err.

I

l xt x2 x3 _cons

Number of obs F( 3, 144) Prob > F R-sql/ared Adj R-squared Root MSE

1.457113 2. 221682 -.006139 36.10135

1. 07461 .8610358 .0005543 4.382693

1.36 2.58 -11,08 8,24

P> [t I O. 177 O. 01I O,000 O,000

[95Y,Conf, Interval] -. 6669339 • 5197797 -,0072345 27.43863

3. 581161 3. 923583 -.0050435 44. 76407

To see the difference of the coefficients of x2 and xl, type lincom x2 - xl (1)

- xl

+ x2 = 0.0

y

Coef.

(1)

Std. Err.

.7645682

• 9950282

The expression can be any linear combination

t O. 77

P> ItI O. 444

[95Y,Coal. Interval] -1.20218

2. 731316

without a constant.

lineom 3,xl + 500,x3 (I)

3.0 xl + 500,0 x3 = 0.0

y

Coef.

(1)

1. 301825

Std. Err. 3. 396624

t O. 38

P>It I O. 702

[957,Conf. Interval] -5. 411858

8.015507

Expressions with additive constants are not allowed lincom xl

-

1

additive constant terms not allowed r (198) ;

norarenonlinear expressions. • lincom X2/xl not possible with test r(131) ;

<3 Q Technical Note lincom uses the same shorthands for coefficients as does test (see [R] test). When you type xl, for instance, lincom knows that you mean the coefficient of xl. The formal syntax for referencing this coefficient actually _b [xl], or alternatively _coef [xl]. So, more forma]ly, in the last example we could have istyped 2incom 3*_b[xl] + 500*_b[x3]

l

rl_

iincom-

!

Linear combinationsof estimators

t99

OddsAfter rati(,s andregression, incidence ratios l_gistic the orrate option can be specified I i

, _

with lincom to display odds ratios for any effect.: InCidence rate ratios after commands such as poisson can be obtained in a similar fashion by spe_cif_ing the irr option.

> Example t

Consider the low birth weight dataset from H0smer and Lemeshow (1989, Table 4. I). We estimate a logistic regression of low birth weight (variable low) on the following variables:

i

_

Vail _ble

Description

Coding

age

age in vears

bla,:k

race black

1 if black, 0 otherwise

Oth,;r

race other

1 if race other, 0 otherwise

Smol:e

smoking status

1 if smoker, 0 if nonsmoker

tit

history of hypertension

1 if yes. 0 if no

u±

uterine irritability

t if yes, 0 if no

t'wd

maternal weight before pregnancy

] if weight < t10 lb., 0 otherwise

aige_.wd

age × twd

smol:elwd ptd

smoke history x of lwd premature labor

l if yes.. 0 if no

i . l

We firsI estimate a model without the interaction terms agelwd

and smoketwd (Hosmer and

Lemest_ovq1 1989, Table 4.8) using logit ,

Io [it low

age lwd black

other

smoke p_d

ht ui

-117.336

I%er .tion O:

log likelihood

=

I_er _tion i:

log likelihood

= -99.4311]_4

l_eration

2:

log likelihood

= -98.785718

Ieerltion

3:

log likelihood

=

I%er _tion 4:

log likelihood

= -98.777998

L_gi

-98.7_8

estimates

Number

! i

=

189

LR chi2(8)

=

37.12

> chi2

=

0.0000

:

0.1582

Prob Log

.ikelihood

: -98.777998

Pseudo

of obs

R2

, I

:

,

low

i

age twd

i

-. 0464796 .8420615

Std. Err. .0373888 .4055338

z

P>[zI

-1.24 2.08

0.214 O. 038

[957, Conf. -. 1197603 ,0472299

Interval] .0268011 1,636893

black

I.073456

.5150752

2.08

O. 037

.0639273

2.082985

other smoke

.815367 .8071996

.4452979 .404446

1.83 2. O0

0.067 O. 046

-. 0574008 .0145001

I. 688135 i.599899

ptd

i i

Coef.

i .281678

.4621157

2.77

0,006

.3759478

2. 187408

ht

:1. 435227

.6482699

2.21

0.027

.1646415

2. 705813

ui _cons

.65762S6 -1.216781

.4666192 .9556797

I. 41 _1.27

-. 2569313 -3.089878

1. 572182 .6563!7

O. 159 0.203

To _et the odds ratio for black smokers relative to white nonsmokers (the reference group), type

• lincom (1) i: I

200

black

black

+ smoke,

+ smoke

or

= 0.0

Iincom -- Linear combinations of estimators

ili

low

0dds

Ratio

Std.

z

Err.

il

P>,z,

[957, Conf.

Interval]

o0o, lincom computedcxp(_black+ blacknonsmokers,type lincom (I)

smoke

- black

- black, + smoke

low

Odds

(1)

_smoke)

_-

6.56.To seetheoddsratioforwhitesmokersrelative to

or

= 0.0

Ratio

Bid.

.7662425

Err.

z

.4430176

-0.46

P>IzJ

[95% Conf.

0.645

.2467334

Interval] 2.379603

Now let's add the interaction terms to the model (Hosmer and Lemeshow 1989, Table 4.10). This time we will use logistic rather than legit. By defaulL logistic displays odds ratios. . logistic Logit

Log

low

age

black

other

smoke

ht ui

Iwd

estimates

likelihood

=

low

-96.00616

Odds

Ratio

Std.

Err.

z

ptd

agelwd

smokelwd

Number of obs LR chi2(10) Prob > chi2

= = =

189 42.66 0.0000

Pseudo

=

0.1818

R2

P>|zl

[95_ Conf.

Interval]

,8408967 1.068277

1.005344 8,167462

age black

.9194513 2.95383

.041896 1.532788

-1.84 2.09

0.065 0.037

other

2.137589

.9919132

1.64

0.102

.8608713

5,307749

smoke

3.168096

1.452377

2.52

0.012

1.289956

7.780755

ht

3.893141

2.5752

2.05

0.040

1.064768

14.2346

ui

2.071284

.9931385

1.52

0.129

.8092928

5.301191

lwd

.1772934

.3312383

-0.93

0.354

.0045539

6.902359

ptd

3.426633

1.615282

2.61

0.009

1.360252

8.632086

1.15883 .2447849

.09602 .2003996

1.78 -1.72

0.075 0.086

.9851216 .0491956

1.36317 1.217988

agelwd smokelwd

Hosmer and Lemeshow (1989, Table 4.13) consider the effects of smoking (smoke -:- 1) and low maternal weight prior to pregnancy (lwd = 1). The effect of smoking among non-low-weight mothers (lwd -- 0) is given by the odds ratio 3.17 for smoke in the logistic output. The effect of smoking among low-weight mothers is given by • lincom (I)

smoke

smoke

low (1)

+ smokelwd

+ smokelwd

Odds

= 0.0

Ratio

.7755022

Std.

Err.

.5749508

z

P>Izl

[957 Conf.

-0.34

0.732

.1813465

Note that we did not have to specify the or option

After

logistic,

lincom

Interval] 3.316322

assumes or by default.

The effect of low-weight (Iwd = 1) is more complicated since we fit an age x lwd interaction. We must specify' the age of mothers for the effect. The effect among 30-year old nonsmokers is given by t

i _• !|_

i

'

_

i

lin¢om-- Linearcombinations of estimators

i _incom l_d + 30*agelwd (ii) lwd + 30.0 agelwd

i

i.

low

I

(t)

i

t

=

201

0.0

Odds Ratio 14. ?669

Std,

Err.

13. 56689

z 2.93

P>lz[

[95X Conf.

O.003

2. 439266

Interval] 89. 39625

..........

"

lincom _omputed exp(fllwd+30,_agelwd) = 14_.8.It seemsodd that we entered it as lwd+ 30*agelwd. but remember that lwd and agelwd are just'lincom's (and test'S) shorthand for _b[twd] and

i

_b [age_wd].

i

I

_ i

We could

i

i !

typed

(ii) incomlwd _b[1wd] + 30.0+ agelwd 30*_b[agelwd] = 0.0

low (1)

i

!

have

I

Odds Ratio 14. 7669

Std. Err. 13. 56689

z 2.93

P> Iz I O. 003

[957,Conf. Interval] 2. 439266 89. 39625

,

Multiple- quation models lincpm also works with multiple-equation models. The only difference is how"y'ou refer to the coefficiehts. Recall thai for multiple-equation models, coefficients are referenced using the syntax [e_no] varname where e,_nois the equation number or equation_nameand varname is the corresponding variable name for the cpefficient: see [U] 16.5 Accessing coefficients and standard errors and JR] test for detaih. ExampM, Consider the example from [R] mlogit (Taflov et al. 1989: Wells et al. I989).

!

. _logit insure age male nonwhite site2 site3,

nolog

Mu_tinomial regresszon

Number of obs LR chi2(I0)

= =

615 42.99

Lo i likelihood = -534.36165

Pseudo Prob > R2 chi2

= =

0,0387 O.0000

} !

insure

Coef.

age

-.Ol 1745

.0061946

-I.90

O.058

-.0238862

.0003962

male nonwhite

.5616934 .9747768

.2027465 .2363213

2.77 4.12

O.006 0.000

.1643175 .5115955

.9590693 1,437958

site2 site3 cons

.1130359 -. 5879879 .2697127

•2101903 .2279351 .3284422

O,54 -2.58 O.82

O.591 O. 010 O.412

.2989296 -1.034733 -.3740222

.5250013 -. 1412433 .9134476

I

age male

-.0077961 .4518496

.01t4416 .3674867

-0.68 1.23

0.496 0.219

-.0302217 -.268411

.0146294 I.17211

i

nonwhit e

.2170589

.4256361

O.51

O.610

-.6171725

i.05129

i

site2 site3 _cons

-1.211563 -.2078123 -1.286943

.4705127 .3662926 .5923219

-2.57 -0.57 -2.17

O.010 0.570 0.030

-2.133751 -.9257327 -2.447872

-.2893747 .510108 -. 1260135

i "

'

Std. Err.

z

P> Izl

[95Y,Conf. Interval]

Prepaid

i

Un insure

l (C _tcome insur!==Indem iB the comparison group) i

i' _

202 Linear combinations of estimators To see thelincom estimate-- of the sum of the coefficient of male and the coefficient Prepaid outcome, type • lincom (I)

of nonwhite

for the

[Prepaid]male + [Prepaid]nonwhite

[Prepaid]male + [Prepaid]nonwhite = 0.0

insure I (1)

Coef.

I

Std. Err.

1.53647

.3272489

z

P>Izf

4.70

0. 000

[95Y,Conf. Interval] lr

.

.8950741

2.177866

To view the estimate as a ratio of relative risks (see [R] miogit for the definition and interpretation), specify the rrr option. lincom (1)

[Prepaid]

male

[Prepaid]male

+ [Prepaid]

insure I (i) I

nonwhite,

+ [Prepaid]nonwhite

KPd{

rrr = 0.0

Std. Err.

4.648154

z

1.521103

4.70

P> [zl

[95X Conf. Interval]

0.000

2.447517

8.827451

q

Saved Results lincom saves in r(): Scalars r (estimate) r(se)

point estimate estimate of standard

r(df)

degrees

error

of freedom

Methods and Formulas lincom is implemented

as an ado-file.

References Hosmer, D. W., Jr,, and S. Lemeshow. edition forthcoming in 2001,)

f989. Applied

Logistic

Regression.

Tarlov, A. R,, J. E. Ware. Jr,, S. Greenfield. E. C. Nelson, E. Perrin. study, Journal of the American Medical Association 262: 925-930,

New York: John Wiley & Sons. (Second

and M. Zubkoff.

t989.

The medical

outcomes

Wells. K. E., R. D. Hays, M, A. Burnam, W, H. Rogers, S. Greenfield, and J. E. Ware. Jr. t989. Detection of depressive disorder for patients receiving prepaid or fee-for-service care. Journal of the American Medical Association 262: 3298-3302,

Also See Related:

[R] svylc, [R] svytest, [R] test, [R] testni

Background:

[u] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and pOst-estimation commands

I

rR r-

Title

II

l[nktes -- Specification link test for single-equation models i

!

i

I [

I

I

J

[

iI

I

i

I [

I

T[

I

I I

Syntax

i

lin.kt,

st [if exp] [in range] [, estimation_options ]

_qaen i: cap and in range are not specified,the link _estis performedon the same sample as the previous estimation.

Descripti(,n iin.kt_ st performs a link test for model specificationafter any single-equationestimationcommand such as 1, .gistic.

i

etc.;

regress,

see [R]

estimation commands.

I

:: ! ;

Options estimation_options must be the same option_ specified with the underlying estimation command.

] :

i

Remarks • i

The fotm of the linkltest implemented here if based on an idea of Tukey (1949) which was further descfibedlby Pregibon !(1980), elaborating on ,/,ork in his unpublished tl_esis (Pregibon t979). See Methods _nd Formulas! below for more details.

We at mpt to explifin Exampletl , the mileage ratings Of cars in our automobile dataset using the weight. engine displacement, a_d whether the car is manufactured outside the U.S.: r,_gress mpg

Source Model ;

w_ight

i !1619,71935

Residual

_ Total

mpg i

weight

23.740114 ; 12443.45946 _ SS

'

! i

foreign _cons

Coef.

-.0067745

displacement I

displ

.0019286 i-1.600631 41.84795

t

foreign

3

539.906448

F( 3, Prob > F

70

Ii.7_77159 33.4720474 MS

73 dI

Std. Err. .0011665 .0100701 1.113648 2.350704

t

P>It I

70)

= =

45.88 0.0000

R-squared

=

0,6629

Adj R-squared Root Number MSE of obs

= = =

0.6484 3.4304 74

[95X Conf.

Interval]

-5.81

0.000

-.0091011

0.19

0.849

-.0181556

.0220129

-1.44 17.80

0.155 O. 000

-3.821732 37.15962

.6204699 46.53628

-.0044479

204

linktest -- Specification link test for single-equation models

Based on the R 2. we are reasonably pleased with this model. If our model really is specified correctly, then were we to regress mpg on the prediction and the prediction squared, the prediction squared should have no explanatory power. This is what link"cost does: linktest Source

_

SS

df

f

Model Residual

] |

mpg

Number of F( 2,

1670.71514

2

835.357572

Prob

772.744316

71

10.8837228

73

33.4720474

I Total

MS

2

443

I

.45946

Coef.

I

Std.

Err.

obs 71)

= =

> F

74 76.75

=

0.0000

R-squared

=

0.6837

Adj K-squared Root MSE

= =

0.6748 3.299

t

P>Itl

[95_

Conf.

-0.63

0,532

-i.724283

2.09 2.16

0.041 0.034

.6211541 .0026664

Interval]

i

_hat _cons _hatsq

]

-.4127198

t

14.00705 .0338198

We find that the prediction good as we thought.

.6577736 6,713276 .015624

.8988434 27.39294 .0649732

squared does have explanatory, power, so our specification

is not as

Although linktest is formally a test of the specification of the dependent variable, it is often interpreted as a test that. conditional on the specification, the independent variables are specified incorrectly. We will follow that interpretation and now include weight-squared in our model: • gen

weight2

regress

= weight*weight

mpg

Source

weight I

Model

weight2 SS

displ

foreign

df

MS

Number F( 4,

of obs 69)

74 39.37

1699.02634

4

424.756584

Prob

=

0.0000

744.433124

69

10.7888859

K-squared

=

0.6953

Total

2443.45946

73

33.4720474

Adj R-squared Root MSE

= =

0.6777 3.2846

mpg

Coef.

Std.

Residual

Err.

t

P>rtl

> F

= =

[95_

Conf.

Interval]

weight

-.0173257

.0040488

-4.28

0.000

-.0254028

-.0092486

weight2

1.87e-06

6.89e-07

2.71

0.008

4.93e-07

3.24e-08

-.0101625

.0106236

-0.96

0.342

-.031356

foreign

-2.560016

1.123506

-2.28

0.026

-4.801349

-.3186832

_cons

58.23575

6.449882

9,03

0.000

45.36859

71.10291

displacement

.011031

And now we perform the link test on our new model: linktest

Model

1699.39489

Kesidual

i

744. 06457

Total

I

Source

[

2443.45946

SS

2

849.697445

Prob

=

0.0000

71

I0.4797827

R-squared

=

O. 6955

33.4720474

Adj R-squared Root MSE F( 2, 71)

= =

0.6869 3.2372 81.08

Number

=

73

df

MS

> F

of obs

74

'

gi

linktest-- specifi_ationlink test for single-equationmodels

i !•

i

mpg

Coef.

1

i,

hat hatsq

I 141987

.7612218

1.50

0.138

-.3758456

2.659821

i

_cons

- .0031916 -i.50305

.0170194 8.196444

-0.19 -0.18

O.852 O.855

-.0371272 -17.84629

.0307441 14.84019

'

t

P>Itl

[957,Conf. Interval]

! We now pass the link!'test.

! i

Std. Err.

205

•

> Exampl Abo_ we followe_ a standard misinterpretation of the link test when we discovered a problem, we focusted on the exl_lanatory variables of our model. It is at least worth considering varying exactly

i

what thi link test testS. The link test told us it]at our dependent variable was misspecified. For those

i

with _engineeringconsurr_tion__gallonsbackground, mpgperis mile--inindeed a terms strangeofmeasure.andIt woulddisptacement:make more sense to modelan@ergy weight ! _egress gpm _eight displ foreign i Source I 85 df i

i

] Model i Residual

i

Prob > F R_squared

=

0.7659

.01t957628

73

.000163803

Root MSE Adj R-squared

: =

.00632 0.7558

weight displacement I foreign

_cons

Std. Err.

t

P>lt I

[957.Conf, Interval]

.0000144

2.15e-06

6.72

O.000

.0000102

.0000187

.0000186 .0066981

.0000186 .0020531

I.O0 3.26

O.319 O. 002

-.0000184 .0026034

.0000557 .0107928

.0008917

.0043337

0.21

O. 838

-. 0077515

.009535

looks eve _ bit as reasonable as our original model.

inkiest _.

[ [

.003052654 .000039995

Coef.

_

,

Source

.

I li

SS

df

Residual

i .002782409 I Total .011957628 Model l ) i .009175219

gpm [i

Coef.

hat hatsq

I i I i

.6608413 3.275857

li

-_cons

))

.008365

irt a m( eparsimonio_s

MS

Number of obs =

F( 2,

!

i

74 76.33 0. 0000

3 70

gpm

This re+el

Number of obs = F( 3, 70) =

: .009157962 ;: .002799666

Total

! !

i

MS

71

.008039189

73 2

.000163803 .00_587609

Std. Err.:

t

P> It{

74

71) :

117.06

R-squared = Adj R-squ_red : Root = Prob MSE > F =

0.7673 0.7608 .00626 0.0000

[95_ Conf. Interval]

.5152751 4.936655 ;

1.28 0.66

0.204 0.509

-.3665877 -6.567553

1.68827 13.11927

.0130468

0.64

0.523

-.0176496

.0343795

speecmcanon " .

206

linktest -- Specification link test for single-equation models

> Example The link test can be used with any single-equatio/q estimation procedure, not solely regression. Let's mm our problem around and attempt to explain whether a car is manufactured outside the U.S. by its mileage rating and weight. To save paper, we will specify logit's nolog option, which suppresses the iteration log: . logit foreign mpg weight, nolog Logit estimates

Number of obs LR chi2 (2) Prob > chi2

= = =

74 35.72 0.0000

Log likelihood = -27.175156

Pseudo R2

=

0.3966

foreign

Coef.

mpg weight _cons

-.1685869 -.0039067 13.70837

Std. Err.

z

P>]z_

.0919174 .0010116 4.518707

-1.83 -3.86 _ 3.03

0.067 O.000 O.002

[95_. Conf. Interval] -.3487418 -.0058894 4.851864

.011568 -.001924 22. 56487

When you run linktest after logit,the result is another logit specification: linktest, nolog Logit estimates

Number of obs LR chi2(2) Pro5 > chi2

= = =

74 36.83 0.0000

Log likelihood = -26.615714

Pseudo R2

=

0.4090

foreign

Coef.

_hat _hatsq _cons

.8438531 -.1559115 .2630557

Std. Err. .2738759 .1568642 .4299598

z 3.08 -0.99 0.61

P>Iz_ 0.002 0.320 0.541

[95_ Conf. Interval] .3070661 -.4633596 -.57965

1.38064 .1515366 1.105761

The link test reveals no problems with our specification. If there had been a problem, we would have been Virtually forced to accept the misinterpretation of the link test we would have reconsidered our specification of the independent variables. When using logit, we have no control over the specification of the dependent variable other than to change likelihood functions. We admit to seeing a dataset once where the link test rejected the logit specification. We did change the likelihood function, re-estimating the model using probit, and satisfied the link test. Probit has thinner tails than logit. In general, however, you will not be so lucky. q

_3Technical Note You should specify exactly the same options with linktest as you do with the estimation command, although you do not have to follow this advice as literally as we did in the preceding example, logit's nolog option merely suppresses a part of the output, not what is estimated. We specified nolog both times to save paper. : '_

If you are testing a cox model with censored observations, however, you must specify the dead() option on linktest as well. If you are testing a tobit model, you must _pecify the censoring points

i

just as you do with the tobit;

command.

T

_

linktest

!

Specification linktest for single-equation models

207

i I . !

If youiare not sure which options are important, duplicate exactly what you specified on the command. estimatio_ _ If youido not specie' if exp or in range :with li_ktest, Stata will by default perform the link test _n the same s_unpleas the previous estimation. Suppose that you omitted some data when performin_ your estimation, but want to calculate the link test on all the data, which you might do if you belleved the moclelis appropriate for all !thedata. To do this, you would type 'linktest if

:{

e(

i

pl4) -=.'.

SavedRemrs linkt,mtsaves in ::(): Scalars r(t)

t statisticon _!aats_

r(df)degreesof freedom

linkt_stis not an estimation command in the sense that it leaves previous estimation results unchangeql.For instan@, one runs a regressiofi and then performs the link test. %,ping regress without a_gumentsarid the link test still replays the original regression.

i

In ternls of integrati g an estimation commafid with linktest, linktestassumes that the name of the estimation com_nand is stored in e(cmtt) and that the name of the dependent variable in e (depval_). After estirhation, it assumes that the number of degrees of freedom for the t test is given

i

by o(df_._)if the ma!ro is defined. If the estimation co_amandreports Z statistics instead of t statistics, then tinktestwill also report Z aatistics. TheiZ statistic, however, is still returned in r(t) and r(df) is set to a missing

i

vai ,e

I i

!

Methods ,nd ForMulas, linkt.=st is implemented as an ado-file. The link test is based on the idea that if a regression or

i , !

regressior-like equatioriis properlyspecified,on_ should not be able to find any additional independent variables :hatare signil_cantexcept by chance. One kind of specificationerr'or is called,a link error. In regression, this means that the dependent vmiable needs a transformation or "link function to

i !

properly relate to the i_dependent variables. Th_ idea of a link test is to add an independent variable to the eqt ation that is l_specialb likely to be significantif there is a link error,

i

Let

l

Y = f (X/3) be the m( del and _ be!the parameter estimatesl linktest

l } I I ! ! ,i

calculates

_hat= Xa and _hat_q= _hat2 The mod_l is then refit with these two variablesl and the test is based on the significance of _hatsa. This is tNe form suggelted by Pregibon (1979/based on an idea of Tukey (t949). Pregibon (1980) su_ests slightly different method that has tome to be known as "Pregibon s goodness-of-link tesf'. We _referredthe!olderversion because it is universally applicable, straightforward, and a good second-or ier approximation. It is universally applicable in the sense that it can be applied Io any' single-eq_,ationestimation technique whereas Pregibon s more recent tests are estimation-technique specific.!

|

i

=__

Pregibon, D. 1979. Data Analytic Methods for Generalized Linear Models. Ph.D. Dissertation 1980. Goodness of link tests for generalized linear modelS. Applied Statistics 29: 15-24. Tukey, J. W. 1949. One degree of freedom for non-additivity. Biometrics 5: 232-242.

Also See Related:

[R] estimation

commands,

JR] lrtest,

[R] test, [R] testnl

University of Toronto.

_ !

Title

i

1

I

list --

Sy ntax I!

f

i

I i l

'

_list

Iv_rlist! [i:fe_]

i

[n o]_display

nolabel

noobs doublespace

]

Descrlptic_n di ;plays the v

es of variables, If no v_rlist is specified, the values of all the variables are

_,lso see bro_vsein [R] edit.

displayed.

Options . [nojdisplgy forces th_format into display or tabular (nodisplay) format. If you do not specify one its judgment of which would be most one of t_ese two options, then Stata chooses based on re.adabk nolabel

[

[in range][,

by ... : mai, be used with kist; see [R] by. The second vadist may _ntain_ime-seriesoperators;see [U114.4.3"l'ime.seHesvaNi_s.

list I

I

, s v lues of variables

uses the nur eric codes rather than the label values (strings) to be displayed.

noobs sup _ressesprintiJ g of the observation numbers. doublesp_tce requests mt a blank line be inserted between observations. This option is ignored in displa format.

Remarks l

i° ! i

I ! I

list,t,. )ed by itself lists all the observations and all the variables in the dataset. If you specify

varlist, onl those vafia_tes are listed. Specifyifig one or both of in range and if e_p limits the observatiot listed.

:;

_ Examplei

list h. s two outpu

listing a f_w variables, whereas the display format is suitable for listing an unlimited number of variables. _tata chooses automatically between those two formats: Obse_ :vation 1 li_t in 1/2 make rep78 weight

I

formats, known as tabular and display. The tabular format is suitable for

I _ispla-t

AMC Concord 3 2,930 121

price headroom

4,099 2.5

mpg trunk

22 11

length

186

turn

40

gear_r-o

3.58

foreign

Domestic

Observation ri

'

I ,

--,-

2

.,,o..

_,.r._ vauu_ Ul vitrlODleS make AMC Pacer price rep78 3 headroom

weight

3,350

displa-t . list

make

258 mpg

weight

displ

make I. AMC Concord 2. AMC Pacer ;

3. AMC

The

Spirit

length

_

mpg trunk

17S

gear_r~o rep78

_

4,749 3.0

turn

2.53

40

foreign

Domestic

in I/5

mpg 22 17

weight 2,930 3,350

displa~t 121 258 121

rep78 3 3

22

2,640

4. Buick

Century

20

3,250

196

3

5. Buick

Electra

15

4,080

350

4

first case is an example

of display

17 II

format;

[he second

is an example

of tabular

format.

The

tabular format is more readable and takes less space, but it is effective only if the variables can fit on a single line across the screen. Stata chose to list all twelve variables in display format, but when the varlist was restricted to five variables, Stata chose tabular format. tf you are dissatisfied with Stata's choice, you can make the decision yourself. Specify the display option to force display format and the nodisplay option to force tabular format.

<1 0 Technical Note When Stata lists a string variable in tabular output format, it always lists the variable right-justified. When Stata lists a string variable in display output format, it decides whether to li st the variable rightjustified or left-justified according to the display forrnht for the string variable; see [U] 15,5 Formats: controlling how data are displayed. In our previous! example, make has a display format of 7.-18s. describe

make storage

variable

name

make

display

value

type

format

label

strl8

7.-18s

variable

label

Make

Model

and

The negative sign in the 7'-18sinstructs Stata to left+justify this variable. If the display format had been 7,18s, Stata would have fight-justified the variable. Note that it appears from our listing above that :foreign describe it, we see that it is not: describe

foreign

but if we

foreign storage

variable

is also a string variable,

name

display

value

type

format

label

variable

byte

7.8.Og

origin

Car type

label

foreign is stored as a byte, but it has an associated value label named origin;see [U] 15.6.3 Value labels. Stata decides whether to right-justify or left-justify a numeric variable with an associated value label using the same rule as Stata uses for stnng variables: it looks at the display format of the variable. In this case, the display format of 7.8. Og tells Stata to right-justify the variable. If the display format had been 7,-8.0g, Stata would have left-justified this variable.

[3

i

iI

i

i "

_

_

Xst -- List values of variables

_ Technical _ote

! You car_ list the v_riables in any order that you desire. When you specify the varlist, list makes the ttisplay in the order you specify. You 'may also include variables more than once in the vartist.

Example In some !cases, you m_y wish to suppress the Observation numbers. You do this by specifying the

lie

make

mpg wight

noons make opti,,n:

i

displ

foreign

mpg

weight

in 51/55

noobs

: displa-t

foreign

Pont.

Sunbird

24

2,690

151

Domestic

Audi Pont. Audi

_000 Phoenix ;ox

17 19 23

2,830 3,420 2,070

131 231 97

Foreign Domestic Foreign

BWW

.>Oi

25

2,650

121

Foreign

1 _ Example

1

You can Imake the list easier to read by specifOng the doublespaceoption:

I

i lis_ make make '

i

Pont. iPhoenix

19

3,420

231

Domestic

i

Pont. Igunbird

24

2,690

151

Domestic

Audi_000

17

2,830

131 Foreign

Audi

'ox

23

2,070

97

Foreign

BMW 3:!0i

25

2,650

121

Foreign

}

i } ! i_

mpg weight displ foreign in 51/55, noobs double mpg weight 4ispla~t foreign

21Technical Note

You can !suppress the use of value labels by specifying the nolabel option. For instance, the variable foreign in the _:xamples above really contains numeric codes. 0 meaning Domestic and 1 meaning Foreign.When you list the variable however, you see the corresponding value labels rather than the underlyin_ numeric code: lis_

foreign

I

51

iforeign _omestic

i

52.

_omesl;ic

I

211

53. 54.

iF°reign _Foreign

55.

IForeign

Specifying t e nolabel

in

1/55

i 1 1

ption displays the underlying numeric codes:

list

_!

212

_,

_

5i. 52. 53. 54. 55.

foreign

in

51/55,

nolabel

listforeign -- List values of variables 0 0 1 1 1

0

References Riley, A. R. 1993. dml5: Interactively list values of variable,s.Stata Technical Bulletin 16: 2-6. Reprinted in Stata TechnicalBulletin Reprints. vol. 3, pp. 37-41. Royston, P. and P. Sasieni. 1994. dml6: Compact listing Of a single variable. Stata Technical Bulletin 17: 7-8. Reprinted in Smta Technical Bulletin Reprints, vol. 3, pp_41-43. Weesie, J. t999. din68: Display of variablesin blocks. Stata TechnicalBulletin 50: 3-4. Reprinted in Stata Technical Bulletin Reprints. vol. 9, pp. 27-29.

Also See Related:

[R] edit, [P] display,

[P] tabdisp

i

Ins

! i

; i i i _I I

j i

"_

I

0 -- Find z

iit

1

_l

I

I

ire-skewness log or BoxLCox transform

lnske'_O ,,ewvar = ._xp [if exp] [in range] [, level(#) I

_delta(#)

_zero(#) 1

delta(#)

--

Syntax bcskei_O newvar = _.rp [if 1

e_,7_ ] [ill range] [ .

m

level(#)

--

--

zero(#)

] d

Deseripti n of inske_10creates n@var = ln(+exp - k). choosing k and the sign of exp so that the skewness newvar is zero. bcske_FO creates n 'vat= (exp _ - 1)/A, .the Box-Cox power transformation (B x and Cox 1964), ch_osing A so t_at the skewness of ned,vat is zero. exp must be strictly positi_°c. Also see

[R] boxeo

for maximu_n likelihood estimation of A

Options level (#) specifies the confidence level for a confidence interval for k (lnskewO) or A (bcskewO). Unlike usual, the ccnfidence interval is calculated only if level() is specified. As usual, # is _ecified as an integ,.'r; 95 means 95% confidence intervals. The level() option is honored onl>_ if the umber of observations exceeds 7. delta(#) specifies the increment used for calculating the derivative of the skewness function with respect to k (lnske'gO) or A (bcskewO). The default values are 0.02 for lnskewO and 0.0I for bcske_O. zero(#) s_ecifies a vah Eefor skewness to determine convergence that is small enough to be considered zero arld is by defau it 0.001.

Remarks

Example

1

Using dur automobih_ dataset (see [U] 9 Statai's on-line tutorials and sample datasets), we want to generatd a new variab le equal to ln(mpg- k) t6 be approximately normally distributed, mpg records the miles r gallon for _ach of our cars. One feature of the normal distribution is that it has skewness

• in_kewO lnmpg Transfor_

mpg k

[95Y,Cdnf. Interval]

Skewness

(not calculated)

-7.05e-06

....

in(mpg-k)

5.383659

214

InskewO-- Find zero-skewness log or Box-Cox transform

This created the new variable lnmpg = ln(mpg - 5.384): describe Inmpg

variable name

storage type

Inmpg

display format

float

value label

X9.0g

Since we did not specify the level we could have typed

variable label in(mpg-5.383659)

() option, no confidence

interval was calculated.

At the outset,

InskewO inmpg = mpg, level(95) Transform

I

In(mpg-k)

[

k 5.383659

[95_

Conf. Interval]

-17. 12339

Skewness

9.892416

-7.05e-06

The confidence interval is calculated under the assumption that In(mpgk) really does have a normal distribution. It would be perfectly reasonable to use Inskew0 even if we did not believe the transformed variable would have a normal distribution--if we literally wanted the zero-skewness transform--although in that case the confidence inte_'al would be an approximation of unknown quality to the true confidence interval. If we now wanted to test the believability of the confidence interval, we could also test our new variable lnmpg u!sing swilk with the !nnormal option. q

El Technical Note lnskewO (and bcskewO) reports the resulting skewness of the variable merely to reassure you of the accuracy of its results. In our above example, lnskew0 found k such that the resulting skewness was -7- 10-6, a number close enough to zero for all practical purposes. If you wanted to make it even smaller, you could specify the zero () option. Typing lnskewO new=mpg, zero (le-8) changes the estimated k to -5.383552 from -5.383659 and reduces the calculated skewness to -2.10 -11 When you request a confidence interval, it is possibl+ that lnskew0 will report the lower confidence interval as '. ', which should be taken as indicating the lower confidence limit kL = -oc. (This cannot happen with bcskew0.) As an example, consider a sample of size n on z and assume the skewness of z is positive, but not significantly so at the desired significance level, say 5%. Then no matter how large and negative you make kz,, there is no value extreme enough to make the skewness of ln(x - kL) equal the corresponding percentile (97.5 for a 95% confidence interval) of the distribution of skewness in a normal distribution of the same sample size. You cannot because the distribution of ln(z - kL) tends to that of zpapart from location and scale shift--as z --+ oc. This "problem" never applies to the upper confidence limit ku because the skewness of ln(:c - ku) tends to -co as k tends upwards to the minimum value of z.

Example In the above example, using lnskewO a natural

zero

and

we

are

shifting

variable such as temperature zero is indeed arbitrary.

that

with a variable like mpg is probably zero

arbitrarily.

measured in Fahrenheit

On

the

other

hrmd,

use

undesirable,

mpg has

of tnskew0

with

or Celsius would be more appropriate

a

as the

i i

_

lnskewO-- Find zerO-skewnesSlog or Box-Cox transform 215 For a var able like mpg it makes more sense touse the Box-Cox power transform (Box and Cox 1964): y(_)=

Y_-I

A is free to :ake on any ,,alue. but note that y(1) _ y_ bcskew0 works like 1:1skew0:

1. y(0) _ In(y), and yf-l_

= 1 - 1/y.

• bcs_ewO bcmpg = ipg, level(95) 4 i Transform (_pg'L-1)/L

L

[95_,Conf. Interval]

-. 3673283

-1. 212752

Skewness

,4339645

0001898

,i

It is worth n( ting that the _ _,%confidence interval i,cludes ), = - ] (A is labeled L in the output), which has a rather nore pteasin_ interpretation--gallons iper mile--rather than (mpg-'3673 - 1)/(-3673).• The confide;_ce interval, however, is calculated under the assumption that the power transformed

_

variable is rormally distributed. It makes perfect sense to use bcskewO even when one does not believe that the transforrred variable will be normallv distributed, but in that case the confidence interval is an approximaticn of unknown quality, ffone believes thai the transformed data are normally <1

i

distributed, me can alterwatively use boxcox to egtimate 3,: see [R] boxeox.

Saved ReslJits .

lnskewO and bcskew(, save in r() Scalars

!

mma)

h-(InskewO)

r(g_ !} },.

r (I__bda) r(ll_) r(ul_)

)_(bcskewO) lower bound of tonfidence interval upper lxmnd of confidence interval

r(sl:ewness)

resulting skewness of transformed

variable

Methods armdFormulas lnskewOiand

bcskew(

are implemented as ad0-files.

i

Skewness is as calcul_ :ed by summarize: see [R] summarize. Newton's method with numeric. uncentered cerivatives is _sed to estimate k (lnskew0) and )_ (bcskew0?. In the case of lnskew0.

i 1

lhe initial value i_ chosen so that the minimum of :r - k is l and thus ln(z with )_ = 1

Acknowle lnskewOiand

k) is 0. bcskewO star1_

ment bcskewC were written by Patrick Royston of the

MRC

Clinical Trial< [_nit. London.

216

Inskew0 m Find zero-skewness log or Box-Cox transform

References Box. G. E. E and D, R. Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society, Series B 26: 211-243.

Also See Related:

[R] boxcox, JR] swilk

Complementary:

[R] ladder

•

Title log -- E_:ho copy of

!

Syntax:

_

l_og

!

_

usi,g

to file or device

filename

, append

{ol

!

_eplace

[ Zext

smcl ] ]

}

cmdl og cmdlog _sing filenal,e [, append replace ] I

cmdlog

!

set

I

k

{onlofflcl

log_ype

{text_

set lin_size

lffilename

se}

smcl}

#

is ._pecified withoul an extension, one of the suffixes .smcl..log.

The extensibn

.smct

or .txt

is added.

or ,] og is added by log depending on whether the file format is SMCL or ASCII text

The extensibn .txt is add,',dby .cmdlog. In addition to 'ommandlog. _oumay accessthe capabilitie,_of log by pullingdow_ File and choo_ingLog.

Description

[

log allowis you to mal]e a full record of your Stata session. A log is a file containing what you !_

type a_d Staia's output. 1 cmdlog 4lows you to!make a record of what you type during your Stata session. A command

; i

log contains bnly what yo type and so is a subse! of a full log. You can rOake full logs md command logs simttltaneously, one or the other, or neither. Neither is

!

produced un_l you tell St;.ta to start I¢>gging. Command logs are ah,,ays straight ASCII text files and this makes them easy to convert into

, !

do-files. (In ihis respect, it would make more sen.4e if the default extension of a command log file was .do be&use commaz]d lo_osare do-files. The default is .txt. nOt .do. howe_er, to keep you

i i

from accidenialty overwriting your important do-files.) Full logs !are recordedlin one of two formats: SMCL (Stata Markup and Control Language) or

_° i

text (meaning ASCII). The default is SMCL. but s_t logtype can change that, or you can specify an option to state the forrrm you wish. We recommend SMCL because it preserves fonts and colors. SMCL logs c_n be convert,_d to ASCII text or to other formats using the translate command: see [R] translate; translate can also be used to produce printable versions of SMCL IO_S.or you can print SMCL l_gs by pullin_ down File and choosing Log. SMCL logs can be viewed in the viewer, as can any file: !see [R] view.

: ! i

21_

_

zl _

tog -- Ec.o copy of session to file or device

log or cmdlog,

typed without arguments, reports the status of logging.

log using and cmdlog using open a log file. log close and cmdlog close close the file. Between times, log off and cmdlog off, and log on and cmdlog on can temporarily suspend and resume logging. set logtype specifies the default format in which full logs are to be recorded. Initially, full logs are set to be recorded in SMCL format. set linesize specifies the width of the screen currently being used (and so really has nothing to do with logging). Few Stata commands currently respect linesize, but this will change in the future.

Options Optionsfor use with both log and logcmd append specifies that results are to be appended onto the end of an already existing file. If the file does not already exist, a new file is created. replace specifies that filename, if it already exists, is to be overwritten, and so is an alternative to append. When you do not specify either replace or append, the file is assumed to be new and an error message is issued and logging is not started.

Options for use with log text and smcl specify the format in which the log is to be recorded. The default is complicated describe but is what you would expect: If you specify the file as filename.smcl, (regardless of the value of set logtype).

to

then the default is to write the log in SMCL format

If you specify the file asfilename, log, then the default is to write the log in text format (regardless of the value of the set logtype). If you type filename without an extension and specify neither the smcl or text options, the default is to write the file according to the value of set logtype. If you have not reset set logtype, then that default is SMCL. In addition, the filename you specified will be fixed to read filename, sracl if a SMCL log is being created or fiiename, log if a text log is being created. If you specify either of the options text or smcl, then what you specify determines how the log is written. Iffilename was specified without an extension, the appropriate extension is added for you.

Remarks For a detailed explanation

of logs, see [U] 18 Printing

and preserving

output.

Note that when you open a fulI log, the default is to show the name of the file and a time and date stamp: log

using

log

log: type :

opened

L

on:

myfile

C: \data\proj smcl 12 Dec

2000,

Ikmyfile. 19:28:23

smcl

log _ Echo copy of sessionto file or device

i

219

The above information ' ,'ill appear in the log. If you do not want this information to appear, precede

i

the comm_ nd by quiet . qu etly

l

quietly

!

Ly:

log using myfile

'_ill not suppr,;ss an}, error messages qr anything else you need to know.

i

Simila@ when you :lose a fuel log, the default is to show the full information:

i

. lo*

I

i

close

i log- c:\_t_\proj l\_y_ile, s_l

clo!ed on

12

c 2000,

12:32:41

and lhat information wili appear in the log. If you want to suppress that, type quietly log close,

i I

SavedReSults log

and cmdlog sav_ in r ()" Macros

i

r (filename) name of file I

AlsoSee

} _ i {

I

! i

r(s_atus)

on or off

r(type)

text or smcl

i

Complemehtary:

[Ri translate; [R] more, [R] query

Baekgrounh:

17 Logs: Printing and saving output [G:;W] 17 Logs: Printing and saving output, [G:;U] 17 Logs: Printing and saving output, [G!M]

[U 10 -more-- conditions, [Ui 14.6 File-naming conventions, [UI 18 Printing and preserving output

' 1

Title [ I°gistic

-- L°gisfic , regressi°n

,

i

t

Syntax logistic

depvar varlist [weight]

cluster

(varname)

maximize_options lfit

[depvar] all

lstat lroc all

[depvar]

genprob

asis

[if

exp]

[in

range]

[if exp] _in range]

[weight]

[if

(varname)

coef

[, group(#)table

outsample

expl

[in

[. cutoff(*)all

range]

[, nograph

beta(mamame)

]

graph_options

]

[weight]

beta(matname)

offset

robust

]

[weight]

(varname)

[, _level(#)

]

[weight]

beta(mamame)

Isens [depvar]

all

score (newvarname)

beta(matname) [depvar]

[i£ exp] [in range]

[if

gensens

expl

[in

(varname)

range]

[. nograph

genspec

(varname)

graph_options replace

]

by ... : may be used with logistic; see [R] by. logistic allows fweights andpweights; lfit, lstat, lro¢, and lsens allow only fweights; see [U] 14.1.6weight. logistic shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. logistic may be used with sw to perform stepwise estimation; see [R] sw.

(Continued

on next page)

220

yntax fir predict predict

[type] ,ewvarname [if exp] [in range] [, statistic rules

asif

nooffset

]

where slatistic is p xb strip * d_eta * deviance

, ___2 , ddeviemce , hat , number r,esiduals , rstandard

probability of a positive outcome (the default) xib, fitted values standard error of the prediction Pregiborl(198t) A 13influence statistic deviance residual Hosmer and Lemeshow (1989) A X2 influence statistic Hosmer and Lemeshow (1989) A D influence statistic Pregibon (1981) leverage sequential number of the covariate pattern Pearson residuals; adjusted for number sharing covafiate pattern standardized Pearson residuals: adjusted for number sharing covariate pattern

Unstarred statistics are available both in and out of sample; type predict ... if e(sataple) ... if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample even when if e(sample) is not specified.

DeScription logisticesumates a logistic regression of _lepvaron vartist, where depvar is a 0tl variable lot, more precisely, a 0/non-0 variable). Withoutarguments, logistic redisplays the last logistic estimates, logistic displays estimatesas odds ratios; to view coefficients,type logit after running logistic. To obtain odds ratios for any covariate pattern relative to another, see JR] lineom. ].fi_ displays either the Pearson goodness-of-fit test or the Hosmer-Lemeshow goodness-of-fit test is'_at displays various summary statistics, including the classification table, lroc graphs and calculates the area under the ROe curve. lsens graphs sensitivity and specificity versus probability cutoff and optionally creates new variables containing these data lfit, lstat, lroc, and lsens can produce Statisticsand graphs either for the estimation sample or for;any set of observations. However, they always use the estimation sample by default. When weights, if, or in are used with logistic, it ig not necessary to repeat them with,these commands when you want statistics computed for the estimition sample. Specify if, in. or the all option onb,' whe_nyou want statistics computed for a set of obsen_ationsother than the estimation sample. Specify wmghts (only fweights are allowed with these commands) only when you want to use a different set oftweights. Bydefault, if it. lstat, lroc, and lsens use the fastmodelestimated by logistic. Alternatively, the model can be specified by inputting a vector of coefficients with the beta() option and passing the name of the dependent variable depvar to ttie commands, The lfit,

lstat,

lroc. and lsens commands may also be used after logit

or probit.

Here is a list of other estimation commands that may be of interest. See |R] estimation commands for a complete list. See Gould (2000_for a discussion of the interpretation of logistic regression.

I)

222

lOgistic --

LOgiStiC regression

blogit

[R] glogit

Maximum-likelihood logit regression for grouped data

bprobit

[R] glogit

Maximum-likelihood probit regression for grouped data

clogit

[R] ciogit

Conditional (fixed-effects) logistic regression

cloglog

[R] cloglog

Maximum-likelihood complementary log-log estimation

glra

[R] glm

Generalized linear models

glogit

[R] glogit

Weighted least-squares togit regression for grouped data

gprobit

[R] glogit

Weighted least-squares probit regression for grouped data

heckprob

[R] heekproh

Maximum-likelihood probit estimation with selection

hetprob

[R] hetprob

Maximum-likelihood heteroskedastic probit estimation

logit

IR] logit

Maximum-likelihood logit regression

mlogit

[R] mlogit

Maximum-likelihoo_l multinomial (polytomous) logistic regression

nlogit

[R] nlogit

Maximum-likelihood nested logit estimation

ologit

[R] ologit

Maximum-likelihood ordered logit regression

oprobit

[RI oprobit

Maximum-likelihood ordered probit regression

probit

[R] probit

Maximum-likelihood probit regression

scobit

[R] scobit

Maximum-likelihood skewed logit estimation

svylogit

[R] svy estimators

Survey version of logit

svymlogit

[R] svy estimators

Survey version of mlogit

svyologit

[R] svy estimators

Survey version of ologit

svyoprobit

[R] svy estimators

Survey version of oprobit

svyprobit

[R] svy estimators

Survey version of probit

xtclog

[R] xtclog

Random-effects and population-averaged ctoglog models

xtlogit

[R] xtlogit

Fixed-effects, random-effects, and population-averaged

xtprobit

[R] xtprobit

Random-effects and population-averaged

xtgee

[R] xtgee

GEE population-averaged generalized linem: models

logit models

probit models

Options Optionsfor logistic level

(#) specifies

the confidence

or as set by set

level;

level, in percent,

for confidence

see [U] 23.5 Specifying

the width

intervals.

Tile default

of confidence

is level

(95)

intervals,

robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster() be independent

allows between

If you specify

pweights,

cluster(varname) not necessarily cluster(personid) estimated estimated

observations clusters).

which

are not independent

robust

is implied;

see [U] 23.13

specifies within

that

groups, in data

the

observations

Weighted

independent

(although

they must

estimation. across

groups

varname specifies to which group each observation with repeated observations on individuals, cluster()

standard errors and variance-covariance coefficients; see [U] 23.11 Obtaining

used with pweights to produce command in [R] svy estimators

are

within cluster

matrix of the estimators robust variance estimates,

(clusters)

but

belongs; e.g., affects the

(VCE) but not the cluster() can be

estimates for unstratified cluster-sampled data, but see the svylogit for a command designed especially for survey data.

logistic-- Logisticregremsion c_hister () implies robust' by itself.

specifying robust cluster()

is equivalent to typing cluster

223 ()

scorei(newvarname) creates newvar containing uj = 01nLj/0(xjb) for each observation j in the sample. The score vector is _ 01nLj/ab = _ujxj; i.e., the product of ne_'var with each covariate summed over observations. See [U] 23.12 Obtaining scores. asis forces retentionof perfectpredictor variables and their associatedperfectly predictedobservations and may produce instabilities in maximization; see [R] probit (sic). offset (varname) specifies that varname is to be included in the model with coefficientconstrained tolJe 1. coef causes logistic to report the estimated coefficients rather than the ratios (exponentiated coefficieJas),coef may be specified when the _odel is estimated or used later to redisplay results. c0ef affects only how resuks are displayed ahd not how they are estimated. marimize_options control the maximization process; see [RI maximize. You should never have to specify Uhem.

'

Options!forlilt, Istat,troc,andIsens group(#) (ifit onl_y)specifies the number of quantiles to be used to group the data for the Hosmer-Lemeshow goodness-of-fit test. groqp(lO) is typically specified. If this option is not _iven, ttie Pearson goodness-of-fit test is computed using the covariate patterns in the data as groups.

I_

table (If it only) displays a table of the groups used for the Hosmer-Lemeshow or Pearson goodness-of-fit test '_,ithpredicted probabilitieS,observed and expected counts for both outcomes. anldtotal_ for each group, oulzsample (lfit only) adjusts the degrees of freedom for the Pearson and Hosmer-Lemeshow goodness-of-fittests for samples outside of the estimation sample. See the section Samples other thsn_estimation sample later in this entry. all requests that the statistic be computed for all observations in the data ignoring any if or in restrictions specified with logistic. beta(matn_lme) specifies a row vector containing coefficients for a logistic model. The columns of the row vector must be labeled with the corresponding names of the independent variables in the data. The dependent variable depvar must be specified immediately after the command name. See the section Models o/her than last estimated model later in this entry. cutoff (#) (1star only) specifies the value for determining whether an observation has a predicted positive outcome. An observation is classified as positive if its predicted probability is > #. The default is_0.5. nograph (1roe and lsens) suppresses graphical output. eraph_options (1roe and lsens_ are any of the options allowed with graph, lzwoway;see [G] graph options. genprob (va'rname). gensens (varname), and gelaspec (varname) (lsens only) specily the names of new variables created to contain, respectively, the probability cutoffs and the corresponding ser_sitivityand specificity. replace (lsens only) requests tha) if existing variables are specified for genprob(), or geaspec (), they should be ovem'ritten.

'i

gensens(),

1

Optionsfor predict p, the default, calculates the probability

of a positive outcome.

xb calculates the linear prediction. std_p calculates the standard error of the linear prediction. dbeta calculates the Pregibon (1981) A_ influence statistic, a standardized measure of the difference in the coefficient vector due to deletion of the observation along with all others that share the same covariate pattern. In Hosmer and Lemeshow (1989)jargon, this statistic is M-asymptotic, that is, adjusted for the number of observations that share the same covariate pattern. deviance calculates the deviance residual. dx2 calculates the Hosmer and Lemeshow (1989) AX2 influence statistic reflecting the decrease in the Pearson X2 due to deletion of the observation and all others that share the same covariate pattern. ddeviance calculates the Hosmer and Lemeshow (1989) AD influence statistic, which is the change in the deviance residual due to deletion of the observation and all others that share the same covariate pattern. hat calculates the Pregibon (1981) leverage or the diagonal elements of the hat matrix adjusted for the number of observations that share the same covariate pattern. number numbers the covariate patterns observations with the same covariate pattern have the same number. Observations not used in estimation have number set to missing. The "first" covariate pattern is numbered l, the second 2, and so on. residuals calculates the Pearson residual as given by Hosmer and Lemeshow for the number of observations that share the same covariate pattern.

(1989) and adjusted

rstandard calculates the standardized Pearson residual as given by Hosmer and Lemeshow and adjusted for the number of observations that share the same covariate pattern.

(1989)

rules requests that Stata use any "rules" that were used to identify the model when making the prediction. By default, Stata calculates missing fot excluded observations. See JR] legit for an example. asif requests that Stata ignore the rules and the exclusion criteria and calculate predictions for all observations possible using the estimated parameter from the model. See [R] logit for an example. nooffset is relevant only if you specified offset (vamame) for logistic. It modifies the calculations made by predict so that they ignore the Offset variable: the linear prediction is treated as x3b rather than x ab + offset a.

Remarks Remarks are presented under the headings logistic and logit Robust estimate of variance lilt lstat lroc lsens Samples other than estimation sample Models other than last estimated model predict after logistic

]

1 225

- uxjmtm lOgisciadd Iogit i

logistic provides an alternative and preferr_ way to estimate maximum-likelihood logit models, the other Choice being logit described in [R] iogit, First, let us dispose of some confusing terminology. We use the words logit and logistic to mean the same thing: maximum likelihood estimation. To some, one or the other of these words connotes trarisfOrming the dependent variable and using weighted least squares to estimate the model, but that is riot ho'& we use either word here. Thus, the logit and logistic commands produce the same res_tlts, The logistic command is generally preferred to logit because logistic presents the estimates in terms 6f odds ratios rather than coefficients. To a few, this may seem a disadvantage, but you can type logb:t without arguments after logistic to see the underlying coefficients. Nevertfieless. [R] log'it is still worth reading because logistic shares the same features as logit. incl_ud_ngomitting variables due to collinearity or one-way causation. For an introduction to logistic regression, see Lemeshow and Hosmer (1998) or Pagano and Gauvreau (2000, 470-487); for a thorough discussion, see Hosmer and Lemeshow (t989: second edition_ foghcoming in 2001).

> Example Colisidtr the following dataset from a study of risk factors associated with low birth weight des¢ribed ]n Hosmer and Lemeshow (1989, appendix 1). ., ddscribe Contains data from Ibw.dta ob_: 189 ,vaz]s: Ii :size:

_ari_ble name . ]

3,402 storage type

Hosmer _ Lemeshow data 18 Jul 2000 16:27

(95.!% of memory f_ee) di6play format

valu_ label

variable label

race

fd

int

_8,0g

identification code

]_bw v(ge l%_t Z_tce

byte byte int byte

_8.Og XS.0g _8.Og _8.0g

birth weight<25004 age of mother weight at last menstrual period race

s_nok_ 1_2 _t f_v

b_e byte byte byte byte

_8.04 _8.04 _8.04 _8.04 _,8.04

b_t

int

_8.04

, !

smoked during pregnancy prematttrelabor history (count) has history of hypertension presence, u_erine irritability number of visits to physician during let trimester birth weight (grams)

Sbrt_d by :

They _ant!to investigate thecausesoflow bi_hWeightInthis dataset, race isa categorical variable indicating _vhether a person is white (race = 1), black (race --- 2), or other (race -- 3). The authors want irldichtor_(dummy) variables for race included in the regression. (One of the dummies, of course, 'mu)st be omitted.) Thus, before we can _stimate the model, we must create the race dummy The_e ale a number of ways we could do this.: but the easiest is to let another Stata command, xi. do i! fdr uI. we type xi: in front of our logistic command and in our varlist include not race ) l

;; ._i:i

_o i. race,ug,suc -- cogmuc regression but to indicate we want the indicator

variables for this categorical

variable;

see [R] xi for

the full details. . xi: logistic low age lwt i.race smoke ptl ht ui i.race _Irace_l-3 (naturally coded; _Irace_l omitted) Legit estimates

Log likelihood =

-100.724

Number of obs LK chi2(8) Prob > chi2

= = =

189 33.22 0.0001

Pseudo K2

=

0.1416

,r

low

Odds Katie

Std. Err.

age lwt _Irace_2 _Irate_3

.9732636 .9849634 3.534767 2.368079

.0354759 .0068217 1.860737 1.039949

ptl smoke ht ui

1.719161 6.249602 2.517698 2.1351

.5952579 4.322408 1.00916 .9808153

z

P>lz[

[95_ Conf. Interval]

-0.74 -2.19 2.40 1.9_

0.457 0.029 0.016 0.050

.9061578 .9716834 1.259736 1.001356

1.045339 .9984249 9.918406 5.600207

1.5: 2.36i 2. 1.6_

0.118 0.021 0.008 0.099

.8721455 1.611152 1.147676 .8677528

3.388787 5.523162 24.24199 5.2534

The odds ratios are for a one-unit change in the variable. If we wanted the odds ratio for age to be in terms of 4-year intervals, we would gen age4 = age/4 . xi: logistic Io_ age4 lwt i.race smoke ptl ht ui (ou_utomit_d)

After logistic,

we can type logit

to see the mode in terms of coet_cients

and standard errors:

logit Legit estimates

Log likelihood =

-100.724

low

Coef.

age lwt _Irace_2 _Irace_3 smoke

-.0271003 -.0151508 1.262647 .8620792 .9233448

ptl ht ui cons

.5418366 1.832518 .7585135 .4612239

Std. Err. .0364504 .0069259 .5264101 .4391531 .4008266

z

Number of obs LRchi2(8) Prob > chi2

= = =

189 33.22 0.0001

Pseudo R2

=

0.1416

P>Jzl

[95_ Conf. Interval]

-0.74 -2.t9 2.40 1.96 2.30

0.457 0.029 0.016 0.050 0.021

-.0985418 -.0287253 .2309024 .0013548 .1377391

.0443412 -.0015763 2.294392 1.722804 1.708951

.346249 .6916292 .4593768

1.56 2.65 1.65

0.118 0.008 0.099

-.136799 .4769494 -.1418484

1.220472 3.188086 1.658875

1.20459

0.38

0.702

-t.899729

2.822176

If we wanted to see the logistic output again, we would type logistic without arguments. <3

> Example You can specie' the confidence interval for the odds ratios with the level () option, and you can do this either at estimation time or when you replay the model. For instance, to see our previous models with narrower, 90% confidence intervals,

_

logistic-- Logistic reg_

!

• lqgistic, Logft

227

level(90)

estimates

Log likelihood

=

-100.724

Number of obs LR chi2 {8)

= =

Prob

=

0,0001

=

O. 1416

> chi2

Pseudo

R2

189 33.22

Robust low

Odds

age lwt _Irate

2

Ratio

Std.

Err.

.9732636 .9849634

.0329376 .0070209

z -0.80 -2.13

P>Izl

[95_, Conf.

Interval]

O. 423 O. 034

.9108015 .9712984

1.040009 .9988206

3.534767

I.793616

2.49

O.013

I.307504

9.556051

_Irace_3

2.368079

I.026563

1.99

O.047

i.012512

5. 538501

smoke

2. 517698

,9736416

2.39

O. 017

1.179852

5,372537

ptl ht

1. 719161 6.249602

.7072902 4.102026

1.32 2.79

O. 188 O. 005

.7675715 1. 726445

3. 850476 22. 6231

ul

2.1351

I. 042775

1.55

O. 120

.8197749

5. 560858

<]

RobuStestimateof variance If you specify robust. Stata reports the robust estimate of variance described in [U]23,11 Obtaining rob_ist variance estimates Here is the model previously estimated with the robust estimate of variance: xi: logistic

LOgi_

smoke _ptl ht

low age lwt i.race

i.rate

_trace_l-3 estimates

i_)g likelihood

ui, robust

(liaturaally coded;

_Irace_1

omitted)

Number of obs _ald chi2 (8)

=

-100.724

= =

189 29.02

Proh > chi2

=

0. 0003

Pseudo

=

0.1416

R2

Robust low

Odds Ratio

Std.

Err.

z

P>}zl

[957, Conf.

Interval]

0.423 0.034

.9108015 .9712984

1.040009 ,9988206

r

age lwt iIrace_2 JIrace_3 smoke ptl ht ui

.9732636 .9849634

.0329376 .0070209

-0.80 -2, 13

3. 534767 2. 368079

I.793616 1.026563

2.49 i.99

0.013 O.047

I.307504 1.012512

9.556051 5.538501

2. 517698

.9736416

2.39

0. 017

1.179852

5.372537

1.7t9161

.7072902

1.32

O. 188

.7675715

3. ff50476

6,249602 2. 1351

4. 102026 1.042775

2,79 I. 55

0.005 O. 120

1.726445 .8197749

22.6231 5.560858

Additionally, robust allows you to specify cluster() and is then able, within cluster, to relax the assumpiion of independence. To illustrate this, we have made some fictional additions to the low-birth-Weight data. Pretend [hat these data are not a random sample of mothers but instead are a random sample of mothers+from a random sztmple of hospitals. In fact, that may be true--we do not know the history of these_dam but we can pretend in any case.

i i

H0spital$ specialize and it would not be too incorrect to say that some hospitals specialize in more difficult cases. We are going to show two extremes. In one, all hospitals are alike but we are going to estimate bnder the possibility that they might differ. In the other, hospitals are strikingly different In bc/_hCases, we assume patients are drawn from 20 hospitals. In both examples, we will estimate the same model and we will type the same command to estim_ate!it. !Below are the same data we have been using but with a new variable hospid, which ident_fie_ frbm which of the 20 hospitals each patient was drawn (and which we have made up):

_r.o

" F_

,ug,_uu-

Logm.c

regressl0N

. xi: logistic low age lwt i.race smoke ptliht ui, robust cluster(hospid) i.race _Irace_1-3 (naturally coded; _Irace_l omitted) Logit estimates

Log likelihood =

-100,724

Number of obs Wald chii(8) Prob > chi2

= = =

189 49.67 0.0000

Pseudo R2

=

0.1416

(standard errors adjusted for clustering on hospid) Robust low

Ratio

Std. Err.

age lwt _Irace_2 _Irace_3 smoke

.9732636 .9849634 3,534767 2.368079 2.517698

.0397476 .0057101 2.013285 .8451325 .8284259

ptl ht ui

1.719161 6.249602 2,1351

.6676221 4.066275 1o093144

z

P>_zl

[957 Conf. Interval]

-0.66 -2.6!1 2.2_ 2.42 2.81

0,507 0,009 0.027 0.016 0.005

.898396 .9738352 1,157563 1.176562 1.321062

1.05437 .9962187 10,79386 4.766257 4.79826

1.40 2.82 1.48

0.163 0.005 0,138

.8030814 1,745911 .7827337

3.680219 22.37086 5.824014

The standard errors are quite similar to the standard etTors we have previously obtained, whether we used the robust or the conventional estimators. In this example, we invented the hospital ids randomly. Here are the results of the estimation with the same data but with a different set of hospital ids: . xi: logistic low age lwt i.race smoke ptl ht ui, robust cluster(hospid) i.race _Irace_l-3 (naturally coded; _Irace_1 omitted) Logit estimates

Log likelihood =

-100.724

Number of obs Wald chii(8) Prob > chi2

= = =

189 7,19 0.5167

Pseudo R2

=

0.1416

(standard errors adjusted for clustering on hospid) Robust Std. Err.

low

0dds Ratio

age lwt _Irace 2 _Irate 3 smoke

.9732636 .9849634 3.534767 2.368079 2.517698

.0293064 .0106123 3.120338 1.297738 1.570287

ptl ht ui

1.719161 6.249602 2.1351

.6799153 7.165454 1.411977

z

P>[zJ

[957 Conf. Interval]

-0,90 -1,41 1.43 1.57 1.48

0.368 0.160 0.153 0,116 0.139

.9174862 ,9643817 .6265521 .8089594 .7414969

1.032432 1.005984 19.9418 6.932114 8.548654

1.37: 1.60 1.15

0.171 0.110 0.251

.7919046 .660558 .5841231

3.732161 59.12808 7.804266

Note the strikingly larger standard errors. What happened? In these data, women most likely to have low-birth-weight babies are sent to certain hospitals and the decision on likeliness is based not just on age, smoking history, etc., but on other things that doctors can see but that are not recorded in our data. Thus, merely because a woman is at one of the centers identifies her to be more likely to have a low-birth-weight baby. So much for our fictional example. The rest of this' section uses the real low-birth-weight data. To remind you, the last model we left off with was

r"

logistic-- Loglstlcregression i

i •

229

,

• Xi: logistic low age lwt i.race smoRe ptl ht ui i._ace _Irace_1-3 (naturally coded; _Irace_l omitted) Logit estimates

Log likelihood =

Number of obs LR chi2(8) Prob > chi2 Pseudo R2

-100.724

low

Odds Ratio

age l'.*t _!race_2 _Irace_3 smoke

.9732636 .9849634 3.534767 2.368079 2. 517698

.0354759 .0068217 1.860737 1.039949 1. 00916

1.719161 6.249602 2.1351

.5952579 4.322408 .9808153

ptl ht ui

Std. Err.

z -0.74 -2.19 2.40 1.96 2.30 I.56 2.65 1,65

= = = =

189 33.22 0.0001 0.1416

P> iz I

[95_ Conf. l_terval]

O.457 O.029 O.O16 O.050 0.021

.9061578 .9716834 1.259736 1.001356 1. 147676

t. 045339 .9984249 9.918406 5.600207 5. 523162

O.118 0.008 O. 099

.8721455 1.611152 .8677528

3,388787 24.24199 5. 2534

lilt Ifit Computes i goodness-of-fit tests, either the Pearson X2 test or the Hosmer-Lemeshow i i

test.

By _de(ault, lfit. Istat,lroc, and lsenS compute statistics for the estimation sample using the llast rdodel estimated by logistic. However, samples other than the estimation sample can be spetifibd;t'l.seethe section Samples other than esflmation sample later in this entry. Models other thanthe )last mbdel estimated by logistic can also be specified; see the section Models other than last estimated )model

> Example

i 1

lfi_c, fyped without options, presents the Pearson X2 goodness-of-fit test for the .estimated model. The: Pdarsbn k 2 goodness-of-fit test is a test of:the observed against expected number of responses usir_g ¢elI_ defined by the covafiate patterns; see predict with the numbers option below for the defiiaitibn bfcovariate patterns. ._if_.t L_gi_tic model for low, goodness-of-fit test I number of observations = 189 _umi_r of covariate patterns = Pearson chi2(t73) = Prob > chi2 =

182 i_9.24 0.3567

Our mddell fits reasonably well. We should note, however, that the number of covafiate patterns is close td the{number of observations, making the applicability of the Pearson X2 test questionable, but not riec_ssa_ily inappropriate. Hosmer and Lemeshow (1989) suggest regrouping the data by ordering on the predicted probabilities and then forming, say, I0 nearly equal-size groups. 1fit with the group(_

o_tion does this:

.!l_i_, group(to) L_gis_ic model for low, goodness-of-fit test -_(_abl_collapsed on quantiles of estimated probabilities) i ) number of observations = 189 number of groups = :_ iHosmer-Lemeshow ohJ2(8) = i

Prob > chi2 =

10 @.65 0.2904

,

;

230 Logistic regression Again, welogistic cannot --reject our model. If you specify the tableoption, Ifit displays the groups along with the expected and observed number of positive responses (low-birth-weight babies):

_'

Ifit, group(lO) table Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) _Group

_Prob

_Obs_ i

_Exp_l

_Obs_O

_Exp_O

_Total

1 2

0.0827 O. 1276

0 2

1.2 2.0

19 17

17.8 17.0

3

0.2015

6

3.2

13

15.8

19

4

0.2432

1

4.3

18

14.7

19

5

O. 2792

7

4.9

12

14.1

19

6

O. 3138

7

5.6

12

13.4

19

7 8

O. 3872 0.4828

6 7

6.5 8.2

13 12

12.5 i0.8

19 19

9

0.5941

10

10.3

9

8.7

19

0.8391

13

12.8

5

5.2

18

10

number

of observations

number Hosmer-Lemeshow

=

189

of groups = chi2 (8) =

Prob

> chi2

19 19

i0 9.65

=

0.2984

q

Q Technical Note ifit with the group() option puts all observations with the same predicted probabilities into the same group. If, as in the previous example, we request 10 groups, the groups that lfit makes are [P0,Plo], (Pl0_P20], (P20,P30], -.-, (P90_Pl00], where Pk is the kth percentile of the predicted probabilities, with Po the minimum and Ploo the maximum. If there are large numbers of ties at the quantile boundaries, as will frequently happen if all independent variables are categorical and there are only a few of them, the sizes of the groups will be uneven. If the totals in some of the groups are small, the X 2 statistic for the Hosmer-Lemeshow test may be unreliable. In this case, either fewer groups shtutd be specified or the Pearson goodness-of-fit test may be a better choice. El

> Example The tableoption can be used without the group()option. We would not want to specify this for our current model because there were 182 covafiate patterns in the data. caused by the inclusion of the two continuous variables age and lwt in the model. As an aside, we estimate a simpler model and specify table with lfit: logistic Logit

Log

low

_Irate_2

_Irate_3

smoke

ui

estimates

likelihood low

= -107.93404 Odds

Ratio

Std.

Err.

z

Number of obs LR chi2(4) Prob > chi2

= = =

189 18.80 0.0009

Pseudo

=

0.0801

R2

P>Iz_

[95_

Conf.

1.166749

Interval]

_Irace_2

3.052746

1.498084

2.27

0.023

_Irace_3

2.922593

1.189226

2.64

0.008

1.31646

6.488269

2.945742

1.101835

2.89

0.004

1.41517

6.131701

2.419131

1.047358

2.04

0.041

1.03546

5.651783

smoke ui

7.987368

_

.

i

i

tf_t,

logistic-- Logistic regression

_Exp_O

Total

l

_1 12

O. 1230 0.2533

I3

4.9 1.0

373

35.1 3.0

404

!

!4, ':5

O. 2923 0.2997

15 3

12.6 3.9

28 10

30.4 9.1

43 13

i8 i9

O. 5087 O. 5469

2 2

1.5 4.4

1 6

I. 5 3.6

3 8

_0

O.5577

6

5.6

4

4.4

10

_I

0.7449

3

3.0

1

1.0

4

16 0.4978 !7 0.4998

__ro_p | _I i2 i3 14 i5 !6 17 i8 19 dO _I

I

231

tab

LCgi_tic model for low, goodness-of-fit test __rodp Prob _Obs_l _Exp_I _Obs_O

! I

4 4

4.0 4.5

_Prob O. 1230 O. 2533 O.2907 O. 2923 O. 2997 O. 4978 O. 4998 O.5087 O. 5469

_Irace_2 0 0 0 0 1 O 0 1 0

_irace_3 O O 1 O O 1 0 O 1

O.5577 O.7449

1 0

0 1

number of observations _umter of covariate patterns Pearson chi2(6) Prob > chi2

= = = =

4 5

4.0 4.5

smoke O 0 0 1 0 0 I O 1

ui 0 1 O O O 1 1 I 0

1 1

0 1

i

8 9

18_ 5.71 0.4569

3 Technical l_ote i I

,

tog_st_c and lfit keep track of the estimation sample. If you type logistic if x==l. then when y6u t)pe lfit the statistics will be calculated on the x==l subsample of the data automatically. t

You isho_hldspecify if or in with lfit only when you wish to calculate statistics for a set of : " i observaiion_ other than the estimation sample. Sde the section Samples other than estimation sample later m _h_ entry. If _.thez l_gistic model was estimated with iweights, 1fit properly accounts for the weights . { in it_ cdtcu[ations. (Note: ifit does not allow pweights.) You do not have to specify the weights when y6u Nn ifit.Weights should only be sp_ified with ifit when you wish {o use a different set of v_eig_ts.

(Continued on next page)

,

i J

FI

istat232

logistic -- Logistic regression

> Example istat presents the classification

statistics and classification

table after logistic.

• Istat Logistic

model

for

low True

Classified

I

Total +

l

Total

38

118

156

59 21

130 12

189 33

>=

Sensitivity

Pr(

+l D)

35,59_

Specificity

Pr(-I-D)

.5

90.77Z

+)

63.64_

-)

75.64_

-I D)

64.41_

Positive

predictive

value

Pr(

Negative

predictive

value

Pr(-DI

DI

False

+ rate

for

true

~D

Pr(+[~D)

False

- rate

for

true

D

Pr(

False

+ rate

for

classified

+

Pr(~DI

+)

36.36_

False

- rate

for classified

-

Pr(

-)

24.36_

default, Istat

Isens

-D

Classified + if predicted Pr(D) True D defined as low -= 0

Correctly

By

D

command

D[

9.23_

classified

uses can

Z3.54_

a cutoff of 0.5, although

be used

to review

you

the potenti_

can

vm-y

this with

cutoffs; see isens

the cutoff

() option. The

below.

q

iroc For other receiver operating characteristic

(ROC) commands and a complete description,

see [R] roe.

lroc graphs the (ROC) curvema graph of sensitivity versus one minus specificity as the cutoff c is varied and calculates the area under it. Sensitivity is the fraction of observed positive-outcome cases that are correctly classified; specificity is the fraction of observed negative-outcome cases that are correctly classified. W-hen the purpose of the analysis is classification, one must choose a cutoff. The curve starts at (0, 0), corresponding to c = 1, and continues to (l, t ). corresponding to c -= 0. A model with no predictive power would be a 45° line. The greater the predictive power, the more bowed the curve, and hence the area beneath the curve is often used as a measure of the predictive power. A model with no predictive power has area 0.5: a perfect model has area 1. The ROC curve was first discussed in signal detection theory (Peterson. Birdsall. and Fox 1954) and then was quickly introduced into psychology (Tanner and Swets I954) It has since been applied in other fields, particularly medicine (for instance, Metz 1978). For a classic text on ROC techniques, see Green and Swets (1974).

i

i

logistic-

Logisticregression

233

•_ Exampl ROC;cuVves are typically used when the poinf; of the analysis is classification, which it is not in _

our io_,-bi_th-welght model. Nevertheless, the R0C curve is • i lr!c L_gi_tic i

-

model

for low

n_mb_r of observations a_ea Iunder ROC curve

, I

I

:

189 0.7462

: ;

Area

i

under

ROC curve

= 0.7_62

i O0

/

/

/

6.75

t"

/

0,50

We see ithal the area under the curve is 0.7462.

isens '

q

_ i

lseris also plots sensitivity and specificity; it plots both sensitivity and specificity versus probability cuto_.fc_ T_e graph is equivalent to what you would get from Istat

if you varied the cutoff probability

from! 0io_. • lse_s

Sensi'_ivify

i

i001

"6

0.75

,

o Specificity

i

I

!

I

1

m t_ 050 0,25

i-

i

t

o.00!

t P_obabihty

1 Cutoff

;_ I

Isens will optionally create new variables specificity. 234 logistic -- Logistic regression isens,

genprob(p)gensens(sens)

containing

genspec(sp_c)

the probability

cutoff,

sensitivity,

and

nograph

Note that the variables created will have M + 2 distinct nonmissing covariate patterns plus one for c = 0 and another fore = I.

values, one for each of the M

Samples other than estimation sample Ifit, Istat,Iroc, and Isens can be used withl samples other than the estimation sample. By default, these commands remember the estimation sample used with the last logistic command. To override this, simply use an if or in restriction to select another set of observations, or specify the all option to force the command to use all the observations in the dataset. If you use lfit with a sample that is completely different from the estimation sample (i.e., no overlap), you should also specify the outsample option so that the X 2 statistic properly adjusts the degrees of freedom upward. For an overlapping sample, the conservative thing to do is to leave the degrees of freedom the same as they are for the estimation sample.

> Example Suppose that we wish to develop a model for predicting low-birth-weight babies. One approach to developing a prediction model would be to divide our data into two groups, a developmental sample and a validation sample. See Lemeshow and Le Gall (1994) and Tilford et al. (1995) for more information on developing prediction models and severity scoring systems. We will do this with the low-birth-weight the data into two samples. .

use

lbw,

(Hosmer . set

data)

1

r = uniform()

gen • sort gen

r group

(95 mlssing

= I if _n <= _N/9 values

. replace

group

(95 real

changes

generated)

= 2 if group == . made)

Then we estimate a model using the first sample (group==l), • xi: logistic i.race Legit

Log

First, we randomly divide

clear

_ Lemeshow

seed

data we considered previously.

low

age lwt i.race _Irace_l-3

estimates

likelihood

= -44.293342

smoke

our developmental

ptl ht ui if group==l (naturally coded; _Irace_l

sample.

omitted)

Number of obs LR chi2 (8) Prob > chi2

= = =

94 29.14 0.0003

Pseudo

=

0.2475

R2

°

;

logistic -- Logisticregression

! i

low

Odds Ratio

age

.91542

Std°

z

P>lzl

[95Z Conf.

Interval]

.0553937

-t. 46

O, 144

.8130414

1.03069

3.78442 .01t2295

2.17 -_.25

O. O. 030 025

1. 170327 .9526649

21.90913 .9966874

I

_I:[ace_21 i lwt smoke

.909912

.5252898

-0. I6

O. 870

.2934966

2.820953

l °_ f

i ptl _i_ace_3 i ht ! ui

3. 033543 2.606209 21.07656

1.507048 1.657608 22.64788

_, 23 t. 51 2.84

.988479

.6699458

O. 025 O. 132 O. 005 O. 986

1 • 145718 .7492483 2. 565304 .2618557

8.03198 9.065522 173. t652 3.731409

i!

i i [

To test

_

5.063678 .9744276

Err.

-0,02

_" " " c_hbratlon m the developmental sample, the Hosmer-Lemeshow

goodness-of-fit test is

I group (10) calcul!te l_it d u_ng ifit. for

Log_ist_c model low, goodness-of-fit t_st (T_bleicollapsed on quantiles of estimated probabilities) Inumber of observations = 94 I

10

number of groups = osmer-Lemeshow chi2 (8) = Prob > chi2 =

6_67 0_5721

Note tha we did not specify an if statement _-lth Ifit since we wanted to use the estimation sample. ;inqe ! the test is nonsignificant, we are satisfied with the fit of our model. Running _roc gives a measure of the discrimination: .

oc,

nograph

Logistic model for low 94 0.8158

number of observations = ar_a u.%derROC curve =

No_: we lest the calibration of our model by lJerforming a goodness-of-fit test on the validation sample. We _pecify the outsampleopUon so that the degrees of freedom are 10 rather than 8, Lo@istic . !fitI ifmodel group==2, for low, group(iO) goodness-of-fit table outsample t_st (Table collapsed on quantiles of estimateflprobabilities) _G_ou_ ' II

_Prob 0.0725 O. 1202 O. 1549 O.1888

_Obs_l 1 4 3 1

_ExpI 0.4 O.8 1.3 1.5

_Obs_O 9 5 7 8

_Exp_O 9.6 8.2 8.7 7.5

_Total 10 9 I0 9

O. 3258

4

2.7

5

I

I

O.2609 O.42t7

3 2

2.2 3.7

7 8

6.3 6. 3 7.8

9 10 10

I

181

O. 6265 0.4915 0.9737

4 3 4

5.5 4.1 7.1

6 65

4.5 4.9 1.9

10 9 9

I

235

I number of observations nu.iIlber groups [ of osmer-Lemeshow chi2(lO) Prob > chi2

=

95

= = =

I0' 28',03 0.0018

.._

We must acknowledge that our model does not fit well on the vaIidation 236 logistic -- Logistic discrimination in the validation regression sample is appreciably lower as well. • iroc

if group==2,

Logistic number i

area

model

nograph

for

low

of observations

under

ROC

sample. The model's

curve

=

95

=

O. 5839

,

q

Models other than last estimated model By default, lfit, lstat, lroc, and lsens use the last model estimated by logistic. specify other models using the beta() option.

One can

i> Example Suppose that someone publishes the following logistic model of low birth weight: Pr(low

= 1) = F(-0.02

age - 0.01 lwt + 1.3 black

where F is the cumulative logistic distribution. are the equivalent of what logit produces.

+ 1.1 smoke + 0.= ptl

Note that these coefficients

-t 1.8 ht + 0.8 ui + 0.5) are not odds ratios: they

We can see whether this model fits our data. First, we enter the coefficients as a row vector and label its columns with the names of the independent variables plus _cons for the constant (see [el matrix define and [P] matrix rowname). matrix

input

• matrix

b =

colnames

C-.02

-.01

b = age

lwt

1.3 black

1.1

.5 1.8

smoke

.8 .5)

pt]. ht

ui

_cons

We run lfit using the beta() option to specify b. The dependent variable is entered right after the command name, and the outsample option gives the proper degrees of freedom. . ifit

low,

Logistic (Table

beta(b)

model

for

collapsed number

group(lO) low,

goodness-of-fit

on quantiles

of observaZions

number Hosmer-Lemeshow

outsample

=

of groups = chi2 (I0) =

Prob

> chi2

:

Although the fit of the model is poor, lroc •iroc

low,

Logistic number area

beta(b)

model of

under

for

probabilities)

189 i0 27.33 0.0023

shows that it does exhibit some predictive

ability.

nograph low

observations ROC

test

of estimated

curve

= =

189 0.7275

q

logistic-- Logisticregression

237

predict after logistic p#edictis used after logisticto obtain predicted probabilities, residuals, and influence statistics for tt_e ostimation sample. The suggested diagnostic graphs below are from Hosmer and Lemeshow (1989). Where they are more elaborately explained. Also see Collett (1991. 120-t60) for a thorough discussion cjf model checking.

predict _wiihout options Typing p_edict p after estimation calculates _he predicted probability of a positive outcome. We ptevibusly ran the model logisticlow age Ivt _Irace_2 _Irace_3 smoke ptl ht ui. We obtain tl_epredicted probabilities of a positive outcome by typing • _re4ict

P

(o]_ti_n

p assumed;

• dum_arize

Pr(tow))

p low Obs

V_iable p low

189 189

Mean .3121693 .3121693

Std. Dev. .1918915 .4646093

Max

Hin .0272559

.8391283 0

I

predibt _vit_ the xb and stdp options predict with the xb option calculates the linear combination xjb, where xj are the independent variaNes! in he jth observation and b is the estimated parameter vector. This is sometimes known as the incle_ fu _ction since the cumulative distribution function indexed at this value is the probability of a _siiive outcome. With the _tdp option, predict calculates the standard error of the prediction, which is not adjusted tbr replidated covariate patterns in the data. The itifluence statistics described below are adjusted for replicated c(_variate patterns in the data.

predict Wit_ the residuals option predict _can calculate more than predicted probabilities. The Pearson residual is defined as the square root bf the contribution of the covariate pattern to the Pearson X2 goodness-obfit statistic. signed adcor_ing_to whether the observed number of positive responses within the covanate pattern is less th_n Or greater than expected. For instance, lz_red_ct N_rize

r,

residuals r, detail Pearson

i '

residual

_ercentiles

Smallest

IZ_

-1.750923

-2.283885

5_

-1.129907

-1.750923

IOZ

-,9581174

-1.636279

Obs

25_:

-,6545911

-1.636279

Sum of

50Z

-.3806923

189 Wgt.

Mea_ Largest 2.23879

Std.

189 -.0242299

Dev.

,9970949

75_

.8162894

90ZI

1.510355

2.317558

Variance

.9941981

95Z 99Z

1.747948 3.002206

3,002206 3,126763

Skewness Kurtosis

.8618271 3.038448

238 notice logistic-Logistic We the prevalence of aregression few, large positive residuals: t

'"

• sort list

r id r 10w

p age

race

in -5/1

185.

id 33

r

low 1

186.

57

2.23879

1

187.

16

2.317558

1

188.

77

3.002206

1

189.

36

3.126763

1

2.224501

p

age 19

race white

,166329

15

white

.1569594

27

other

.0998678

26

white

.0927932

24

white

.1681123

predict with the number option Covariate patterns play an important role in logistic regression, Two observations are said to share the same covariate pattern if the independent variables for the two observations are identical. Although one thinks of having individual observations, the statistical information in the sample can be summarized by the covariate patterns, the number of observations with that covariate pattern, and the number of positive outcomes within the pattern. Depending on the model, the number of covariate patterns can approach or be equal to the number of observations or it can be considerably less. All the residual and diagnostic statistics calculated by Stata are in terms of covariate patterns, not observations. That is, all observations with the same covariate pattern are given the same residual and diagnostic statistics. Hosmer and Lemeshow (1989) argue that such "M-asymptotic" statistics are more useful than "N-asymptotic" statistics. To understand the difference, think of an observed positive outcome with predicted probability of 0.8. Taking the observation in isolation, the "residual" must be positive--we expected 0.8 positive responses and observed 1. This may indeed be the "correct" residual, but not necessarily. Under the M-asymptotic definition, we ask how many successes we observed across all observations with this covariate pattern. If that number were, say, 6, and there were a total of 10 observations with this covariate pattern, then the residual is negative for the covariate pattern we expected 8 positive outcomes but observed 6. predict makes this kind of calculation and then attaches the same residual to all observations in the covariate pattern. Thus, there may be occasions when you want to find all observations number allows you to do this: predict

pattern,

• summarize

We

number

pattern

Variable

Obs

pattern

189

previously

estimated

Mean

89.2328

the model

ui over 189 observations.

predict

sharing a covariate pattern.

logistic

Std.

Dev.

Min

53. 16573

low

age

1

lwt

_Irace_2

Max

182

_Irace_3

smoke

ptl

ht

There are 182 covariate patterns in our data.

with the deviance

option

The deviance residual is defined as the square rooT:of the contribution to the likelihood-ratio test statistic of a saturated model versus the fitted model. It has slightly different properties from the Pearson residual (see Hosmer and Lemeshow, 1989): predict

d,

de_iance

•

......

........

logistic-- Logistic regression Summarize

Percentiles

5_

residual

Smallest

-1.843472

1_

-1.911621

-i. 33477

-1.843472

10_ '_

-!. 148316

-I .843472

Dbs

25_

-.8445325

-1.674869

Sum of _/gt.

50_

-.5202702

Mean Largest

189 189 -. 1228811

Std. Dev.

I.049237

175_ : 90_

.9129041 1,541558

1.894089 1. 924457

Variance

1. 100898

!95_ !99_

i.673338 2. 146583

2. 146583 2. 180542

Skewness Kurtosis

.6598857 2. 036938

predict With the rstandard option z PearsOn residuals do not have a standard deviation equal to t. a fine point, rstandard Pearson _esiduals normalized to have expected: standard deviation equal to !. i redict

rs,

ummarize

generates

rstandard r rs

Variable

Obs

Mean

Std. Dev,

Mix

Max

r

189

-.0242299

.9970949

-2.283885

3. 126763

rs

189

-.0279135

i,026406

-2.4478

3.149081

I

'

• +rrelate

i(o=189>

r rs r

r

t, 0000

rs

O. 9998

rs

1. 0000

Rememblr that we previously created r containing the (unstandardized) Pearson residuals, In these data, wh&her you use standardized or unstandardized residuals does not much matter, I

pred_t

ith the hat option

, : _

hat @culates the leverage of a covariate pattern a scaled measure of distance in terms of the in_tep_ndent variables. LaNe values indicate covariate patterns "far" from the average covariate patlern--_patterns that can have a large effect on the estimated model even if the corresponding residual i[ small. This suggests the following:

[

}

239

d, detail deviance

[

i

i

(Continuefl on next page)

,_

240

.

predict logistich, graph h r,

hatLogis_c regression border yline(O) ylab xlab

°g 15

o

_,

o o000

0

o

o

cO0

oooo

oo

I

o

o

_,

o °°

o

o

o

o

_

vj °

oo

O-

Pearson

residual

The points to the left of the vertical line are obserx,_ed negative outcomes: in this case, our data contain almost as many covariate patterns as observatiens, so most covariate patterns are unique. In such unique patterns, we observe either 0 or 1 success and expect p, thus forcing the sign of the residual. If we had fewer covariate patterns, which is to say, if we did not have continuous variables in our model, there would be no such interpretation arid we would not have drawn the vertical line at O. Points on the left and right edges of the graph represent large residuals--covariate patterns that are not fitted well by our model. Points at the top of our graph represent high leverage patterns. When analyzing the influence of observations on the model, we are most interested in patterns with high leverage and small residuals patterns that might otherwise escape our attention. predict with the dx2 option There are many ways to measure influence of which hat is one example, dx2 measures the decrease in the Pearson X 2 goodness-of-fit statistic that would be caused by deleting an observation (and all others sharing the covafiate pattern):

(Continued

on next page)

{

}

• _re_ict graph

dx2,

dx2

dx2 p, border

ylab

xlab

Paraphrasing Hosmer and Lemeshow (1989), the points going from the top left to the bottom right, correspond to covariate patterns with the number of positive outcomes equal to the number in the group; the points on the other curve correspond to 0 positive outcomes. In our data. most of the covariale patterns are unique, so the points tend to lie along one or the other curves: the points that are off' the curves correspond to the few repeated covariate patterns in our data in which all the outcomes a_e not the same.

! i

We exa_ainethis graph for large values of dx2--there are two at the top left. I i

i

predct w th the ddeviance option Anothel measure of influenceis the change in _thedeviance residuals due to deletion of a covarJate pattern:,

!

pr_dict As

with

d_2,

dd, ddeviance one

typically graphs ddevi_uce:

against the

probabi]ir}, of a positive outcome.

We

direct you ito Hosmer and Lemeshow (I989) foran example and the interpretation of this graph. predi_ With the dbeta option One_of the more direct measures of influence of interest to model fitters is the Pregibon (1981} ,me tsure, a measure of the change in the!coefficientvector that would be caused by deleting an observ_tion (and all others sharing the covartate pattern): dbe_a

i

(Continued on next page}

. predict

242

db, dbeta

logistic -- Logistic regression

ilt!

graph

db p, border

ylab

xlab

I

.75

{

I

J

I

"

-o p

o

_ _

o o o o

o

o

oo

o

o

c_mlmm_

.25 t

o

Q

o

o

_o

a_l_J_,::_o

o_Oo

o

o

o

J

o

eOo_

_

''

o 1

"T

Pr(low)

One observation .

sort

• list

has a large effect on the estimated coefficients.

in I

Observation

189

id lwt

188 95

ptl

3

ht

0

fry

0

bwt

_3637

0 117

p d

.839i1283 -1. 9111621

r rs

-2. 283885 -2.4478

5.99_726

dd

4. 197658

_Irace_3 pattern h db

dx2

low race

dx2

.1294439 ,8909163

Hosmer and Lemeshow graph

We can easily find this point:

db

p [w=db],

0 _ite

age smoke

25 1

ui

I

_Irace_2

0

(1989) suggest a graph that _combines two of the influence measures: border

Symbol

ylab

xlab

size proportional

tl("Symbol

size

proportional

to dSeta

I

Pro'fowl,

We can easily spot the most influential points by the dbeta

and dx2 measures.

to dBeta")

-

] i i

•

.

'_

logistic -- Logistic

regression

243

SavedReSults i

ti ! ! ! I

_og_s Scalars ic saves e(N) e df_.m) e r2_p)

in e(): number of observations model de_ees of freedom pseudo R-sqeared

tog likelihood, constant-only model number of clusters X2

e(ll_O) e(N_clust) e(chi2)

log likelihood

e Ii) :Macro_ e_ e_depvar) el wtype)

logistic name of dependent variable weight type

e(clustvar) name of cluster variable e(vcetype) covariance estimation method e(chi2type) Wald or LR: type of model X2

e(wexp)

weight expression offset

e(predict)

program used to implement predict

coefficient vector

e (V)

variance-covariance matrix of the estimators

eloffset)

test

Matrices e_b)

Functichls e sample)

marks estimation sample

_fi_s:_vesin r(): Scalars

_st_t i

rmmber of observations

r(df)

degrees of freedom

r(m)

number of covariate patterns _r groups

r(chi2)

X2

_aves in r ():

Scalars r(P_c_rr)

i r(P-n

I

r(N)

4)

r(P_p(_) r(P_.n_)

!roc

percent correctly classified

r(P_lp)

putative predictive value

sensitivity

r(P_ln)

negative predictive

specificity

r(P_Op)

false

false positive rate given true negative false negative rate given true positive'

r(P_On)

false negative rote given classified negalivc

positive

Vall}e

rate given

classified

positive

s_ves in r(): Scalars

i

r(N)

number of observations

r(area)

area under the ROC curve

in i

r():

Isens

Scalars r(N) number of observations

_

,,

Methods 3ndFormulas logis';ic,

lfit,

lstat,

lroc.

and lsens

are implemented

as ado-files. l! o

l

Define xj as the (row) vector of independent Variables, augmented by 1. and b as the correspond _v estimated 3arameter (column) vector, The logistic regression model is estimated bv logit: see {RI loRit for demih

!

of estimation.

....

The odds ratio corresponding to the ith coefficient is ¢i --- exp(bi). The standard error of the odds ratio = g'_si,-- where si isregression the standard error of bi estimated by logit. zo_ is sS Iogmuc Logistic

_!

Define lj = xj b as the predicted index of the jth observation. positive outcome is

The predicted probability

of a

exp,±i) 1+ exp(I )

Pj

If it Let 2_r be the total number of covariate patterns among the N observations. View the data as collapsed on covariate patterns j = 1, 2.... , M and define mj as the total number of observations having covariate pattern j and yj as the total number of positive responses among observations with covariate pattern j. Define pj as the predicted probability of a positive outcome in covariate pattern

j.

The Pearson X2 goodness-of-fit

statistic is M

x2=

(uj mjpj)2 j=, mjpj(i-

This X2 statistic has approximately M - k degrees of: freedom for the estimation sample, where k is the number of independent variables including the constant. For a sample outside of the estimation sample, the statistic has M degrees of freedom. The Hosmer-Lemeshow

goodness-of-fit

X 2 (Hosmer and Lemeshow

1980; Lemeshow and Hosmer

1982; Hosmer, Lemeshow, and Klar 1988) is calculated similarly, except rather than using the M covariate patterns as the group definition, the quanti!les of the predicted probabilities are used to form groups. Let G = # be the number of quantiles requested with group (#). The smallest index 1 <_q(i) <_ M such that q(i)

-G

j=l

gives Pq(i) as the upper boundary of the ith quantile for i = 1, 2..... first index.

G. Let q(0) -- 1 denote the

The groups are then !Pq(o),Pq(1)],

(Pq(1),Pq(2)],

...,

If the table option is given, the upper boundaries Pq(1) .... group number on the output.

(Pq(G-1),Pq(c,)] , PqfG) of the groups appear next to the

The resulting X 2 statistic has approximately G- 2 degrees of freedom for the estimation For a sample outside of the estimation sample, the statistic has G degrees of freedom.

sample.

m

logistic -- Logistic regression

predicitafk_r logistic =_ !,. i_

Index j will now be used to index observations, not covariate patterns. Redefine mj for each observatiol as the total number of observations sharing j's covariate pattern. Redefine yj as the total number of positive responses among observations sharing j's covariate pattern_

}

a'son residual for the jth observation is defined

rj= i

For mj > l, the deviance residual dj is defined

, i

dj==t=

_

i ;

i i

dy =

i i

!

i

0

(mj - yj)In

( )/ mj(1-pj)

In the limiting cases, the deviance residual

{-v/2mjll_(l-pj)[ ",/2mjllnp_l

if yj yj=O if = mj

The unldjusted diagonal elements of the hat matrix huj are given by hv_ = (XVXt)jj, where V is the e _timated covafiance matrix of parameters. The adjusted diagonal elements hj created by hat are th,m hj = mjpj (1 - pj) hu_. The-stalldardized Pearson residual rsj is rj/V/I ThePregibon

(1981)

- hi.

_-_j influence statistic is rj _hj

t

I _

yjl

is gi_venb3_

li ; : _:

II ( l

mj - yj

where file dgn is the same as the sign of (yj - mjpj). i

i!

245

(1- hj)

i The corresponding change in the Pearson X_ is r 2 is _Dj

=-Idj/(1 - hi).

sj. The corresponding change in the deviance residual

Istat and Isens

Again, et j index observations. Define c as the cutoff () specified by the user or. if not specified, as 0.5. Lel 7)3be the predicted probability of a positive outcome and yj be the actual outcome, which we will treat as 0 or I. although Stata treats it as 0 and non-0, excluding missing observations. A prediction is classified as positive if Pa >- c and otherwise is classified as negative. The classificati_m is correct if it is positive and yj -I or if it is negative and yj = O. .S'en._lttv_t3; is the fraction of yj = I observmions that are correctly classified. Specificity is the percent of !

!

= 0 obser_ations that are correctly classified.

Iroc

246

logistic --

Logistic regression

1

The ROC curve is a graph of specificity againstl (I - sensitivity), This is guaranteed to be a monotone nondecreasing function, since the number :of correctly predicted successes increases, and

,

the number

i

The area under the ROC curve is the area on the bottom of this graph, and is determined by integrating the curve. The vertices of the curve are determined by sorting the data according to the predicted index, and the integral is computed using the trapezoidal rule.

of correctly

predicted

failures

decreases

as the classification

cutoff

c decreases.

References Brady, A. R. 1998. sbe21: Adjusted population attributable fractions from logistic regression. State Technical Bulletin 42: 8-12. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 137-143. Cleves, M. and A. Tosetto, 2000. sg139: Logistic regression when binary outcome is measured with uncertainty. State Technical Bulletin 55: 20-23. Collett, D, 1991. Modelling Binary Data. London: Chapman & Hall. Garrett, J. M. 1997. sbe14: Odds ratios and confidence intervat_ for logistic regression models with effect modification• State Technical Bulletin 36: 15-22. Reprinted in State Technical Bulletin Reprints, vol. 6, pp. 104-114. Gould. W. W. 2000. sgI24: Interpreting logistic regression in all its forms. State Technical Bulletin 53: 19-29. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 257-270. Green. D. M. and J. A. Swets. t974. Signal Detection Theorj and Psychophysics. rev. ed. Huntington, NY: Krieger. Hilbe, J. t997. sg63: Logistic regression: standardized coefficients and partiai correlations. State Technical Bulletin 35: 21-22. Reprinted in State Technical Bulletin Reprints, voI.' 6, pp. 162-163. Hosmer, D. W., Jr., and S. Lemeshow. 1980. Goodness-of+fit tests for the multiple logistic regression Communications in Statistics A9: 1043-1069.

model.

t989. Applied Logistic Regression. New York: John Wiley & Sons. (Second edition forthcoming in 2001.) Hosmer. D. W.. Jr.. S. Lemeshow, and J. Klar. 1988. Goodness-of-fit testing for the logistic regression model when the estimated probabilities are small. Biometric Journal 30:: 911-924. Irala-Est_vez. J. de and M. A. Mart{nez. 2000. sg125: Automatic estimation of interaction effects and their confidence intervals. State Technical BuIletin 53: 29-31. Reprinted in State Technical Bulletin Reprints, vol. 9, pp. 270-273. Lemeshow. S. and D. W. Hosmer, Jr. 1982. A review of goodness of fit statistics for use in the development of logistic regression models. American Journal of Epidemiology 115:: 92-106. • 1998. Logistic regression. In Encyclopedia of Biostatistics. ed. R Armitage and T. Colton. 2316-2327. John Wiley & Sons.

New York:

Lemeshow. S. and J.-R. Le Gall. 1994. Modeling the severity of illness of 1CU patients: a systems update. Journal of the American Medical Association 272: 1049-1055. Metz, C. E. 1978. Basic principles of ROC analysis. Seminarsi rn Nuclear Medicine 8: 283-298. Pagano, M. and K. Gauvreau. 2000. Principles of Biostatistics. 2d ed. Pacific Grove. CA: Brooks/Cole. Paul, C. 1998. sg92: Logistic regression for data including muitiple imputations. State Technical Bulletin 45: 28-30. Reprinted in State Technical Bulletin Reprints, vol. 8, pp. 180-183. Pearce, M. S. 2000. sgt48: Profile likelihood confidence intervals for explanatory variables in logistic regression. Stata Technical Bulletin 56: 45--47, Peterson, W. W.. T. G. Birdsalt, and W. C. Fox. 1954. The theory of signal detection. Trans. ItLE Professional Group on Intbrmalion Theory', PGIT-4: t71-212. Pregibon, D. 1981. Logistic regression diagnostics. Annals of Statistics 9: 705-724. Tanner. W. P., Jr., and J. A. Swets. 1954. A decision-making theory of visual detection. P._ychological Review 61: 401-409_ TdIbrd. J. M., P. K. Roberson, and D. H. Fiser. 1995. sbel2: Using lilt and lroc to evaluate mortality prediction models. State Technical Bulletin 28: |4-18. Reprinted in 5;rata Technical BulletiJ7 Reprints. vol. 5. pp. 77-81.

[,

i

i

t I

I

logistic-- Logistic regression

247

i

i Tobias.A. 000. she36:Summarystatisticsreport for diagnostictests. StataTechnicalBulletin56: 16-18. Z Tobias.A. ,md M, J, Campbell.t998. sgg0: Akaike's |nformationcriterionand Schwarz'scriterion.State TechtTica! i Bullern 45: 23-25. Reprintedin Steta TechnicalBulletinReprints,vol. 8, pp. 174-177. Wee_ie,J. 998. sg87: Windmeijer'sgoodness-of-fittest for logistic regession. Stat_ TechnicalButle6n 44:_2-27 Reprinte) in State TechnicalBulletinReprints,vol. 8, pp. 153-t60. _

i

:i

Also See i Complem rotary:

[R] adjust, [R] lincom, [_] linktest, [R] Irtest JR] mfx JR] predict [R] roc, [R] sw, [R] test, JR] testnt, JR] vce, [R] xi

Related:

[R] brier, [R] dogit, [R] dloglog, [R] cusum, [R] glm, [R] giogit, [R] logit, [R] nlogit, [R] _obit, [R] scobit, [R] svy estimators

Bad,ground:

[U] 16.5 Accessing coet_ients and standard errors [U] 23 Estimation and p6st-estimafion commands,

i

[U] 23.11 Obtaining robdst variance estimates, [U] 23,12 Obtaining sco_s, [R] maximize

i

]

l

i

i

i

/

i

,is

Iogit -- Maximum-likelihood

I

III

logit estimation

I I

i

Syntax logit

depvar

noconstant asis by ....

[indepvars] or robust

moaimize_options

[weight] cluster

[if

exp]

( varname)

[in range] s¢ore

[, level(#)nocoef

(newvamame

) off set ( vanrame)

]

may be used with logit; see [R] by. iweights, and pweights are allowed; see [U] 14.1;6 weight. shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. may be used with sw to perform stepwise estimation; see [R] sw.

fweights,

logit logit

Syntax for predict [type] newvarname

predict

[if

exp]

[in range]

[, statistic rules

asif

nooffset

]

where statistic is p xb stdp • dbeta , deviance • dx2 , ddeviance • _hat • number residuals • rstandard

probability of a positive outcome (the default) x/b, fitted values standard error of the predioIion Pregibon (1991) A/3 influence statistic deviance residual Hosmer and Lemeshow (1989) A X 2 influence statistic Hosmer and Lemeshow (1989) A D influence statistic Pregibon (1981) leverage sequential number of the c0variate pattern Pearson residuals; adjusted _for number sharing covariate pattern standardized Pearson residues: adjusted for number sharing covariate pattern

Unstarred statistics are available both in and out of sample; type predict .,. if e(sample) ... if wanted only for the estimation sample, Starred statistics are calculated only Torthe estimation sample even when if e(sample) is not specified.

Description logit estimates a maximum-likelihood

logit model,

Also see [R] logistic; logistic displays estimates as odds ratios. A number of auxiliary commands that can be run after Iogit, probit, or logistic estimation are described in [R] logistic. A list of related estimation commands is given in [R] logistic.

248

i T

IogB-- Maximum-fikelihoodlogit estimation

249

options level (#) specifies the confide_e level, in percent, for confidence intervals. The default is level

(95)

or as s_,tby set level: see [V] 23.5 Specif!ing: the width of confidence intervals. nocoef siecifies that the coefficient table is not to be displayed. This option is sometimes used by

i l

progran writers but is of no use interactively.

l

noconsta

i i

_ , i

_t suppresses the constant term (intercept) in the logit model.

or report the estimated coefficients transformeff to odds ratios, i.e., eb rather than b. Standard errors and:confidence inter_,als are similarly transfohned. This option affects how results are displayed, not ho_, they are efitimated, or may be specified at estimation or when replaying previously estimated results. robust s_ecifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditiopal calculatioh; see [u] 23.11 Obtai0ing robust variance estimates, robust combined with c3iuster () allows observations which a_e not independent within cluster (although they must be independent betv_en clusters). If you _pecify pweights, robustis implied_ see [U] 23.13 Weighted estimation. See: fR_logistic for txamptes using this opti0n.

!

clu_ster(Ivamame)

s_cifies

that the observations are independent across groups (clusters_ but

not necessarily within groups, vamame specifies to which group each observation belongs: e.g., c tust#r(personid) in data with repeated iobservations on individuals, cluster() affects the estimated standard 6rrors and variance-covariance matrix of the estimators (VCE), but not the estimatkd coefficients; see [u] 23.11 Obtaining robust variance estimates, cluster () can be

! i

used wlth pweightsto produce estimates for unstratified cluster-sampled data, but see the svylogit

!

commind in [R] sv_(.estimators for a comm_nd designed especially for survev, data.

i l

by itself. () implies robust; specifying robust ¢lust+r See [Rj logistic for examples using this optitn.

I

score(n ,

I

i

cluster

() is equivalent to typing cluster()

|wvarname) creates newvar containing uj = OlnL ./O(xjb)

sample. The score Vector is _

OlnLj/Ob

= _ ujxj',

for each observation j in the

i.e., the product of newrar w_th each

covari_te summed o;¢er observations. See [U] 23.12 Ol_taining scores. offset(_larname) specifies that varname is to be included in the model with coefficient constrained to be 1_. as is fore ._sretention of perfect predictor variables and their associated perfectly predicted observations and m y produce instabilities in maximization: see [R] probit. maximize.options _pecif3 them.

i

,

control the maximization process: see [R] maximize. You should never have t_

Optionsfcbrpredict p, the default,_ calculates the probability of a positive outcome.

I

xb calcul ttes the linea_ prediction,

l

strip cal, ulates the standard error of the linear prediction I

250 :_

Iogit -- Maximum-likelihood Iogit estimation

dbeta calculates the Pregibon (198 l) Aft influence statistic, a standardized measure of the difference in the coefficient vector due to deletion of the observation along with all others that share the same eovariate pattern. In Hosmer and Lemeshow (1989) jargon_ this statistic is M-asymptotic, that is, adjusted for the number of observations that share the same covariate pattern. deviance calculates the deviance residual. dx2 calculates the Hosmer and Lemeshow (1989) ?<X2 influence statistic, reflecting the decrease in the Pearson X2 due to the deletion of the observation and all others that share the same covariate pattern. ddeviance calculates the Hosmer and Lemeshow (t989) £xD influence statistic, which is the change in the deviance residual due to deletion of the observation and all others that share the same covariate pattern. hat

calculates the Pregibon (1981) leverage or the diagonal elemems of the hat matrix adjusted for the number of observations that share the same cOvariate pattern.

number numbers the covariate patterns--observations with the same covariate pattern have the same number. Observations not used in estimation have number set to missing. The "first" covariate pattern is numbered 1, the second 2, and so on. residuals calculates the Pearson residual as given by Hosmer and Lemeshow for the number of observations that share the same covariate pattern.

(1989) and adjusted

rstandard calculates the standardized Pearson residual as given by Hosmer and Lemeshow and adjusted for the number of observations that share the same covariate pattern.

(1989)

rules requests that Stata use any "rules" that were used to identify the model when making the prediction. By default, Stata calculates missing fop excluded observations. asif requests that Stata ignore the rules and the exclusion criteria, and calculate predictions for all observations possible using the estimated parameter from the model. nooffset is relevant only if you specified offset(v_trname) for logit. It modifies the calculations made by predict so that they ignore the offset _ariable: the linear prediction is treated as x_b rather than xjb + offsetj.

Remarks logit performs maximum likelihood estimation of models with dichotomous hand-side) variables coded as 0/l (or. more precisely, coded as 0 and not-0).

dependent

(left-

Example You have data on the make. weight, and mileage rating of 22 foreign and 52 domestic automobiles. You wish to estimate a logit model explaining whether a car is foreign based on its weight and mileage. Here is an overview of your data:

Io_lit-- Maximum-likeVnoodlogit estimation • d _scribe Contains o3s :

data

from

auto.dta 74

vats: i

7 Jul 2000

1,998

(99.7_, of memory

storage

display

value

format

label

.

mak

strl8

_,-18s

Make

mpg

int

_8. Og

Mileage

i

_ei ht for _ign

int byte

_,8.Ogc _,8.Og

name

I

_or _ed by: Note:

i

, ilspect

foreign dataset

Car _ype

Number

Integers

Zero [ve Posit

52 22

82 22

Total Missing

74

74

#

Negative

i•

_ I

#

# #

i: :

1

'

(2

unique

Integers

74

values)

foreign

The vari

of Observations Non-

Total

i

# #

(mpg)

saved

:

i

and Model

foreign

foreign:

i

label

Weight (ibs. ) Car type

last

i

is labeled

amd all values are documented

in the label.

le foreign takes on two unique values. 0 and 1 The value (1 denotes a domestic car

and 1 de_otes a foreign car. i

The njodel you wish to estimate is

!

Pr(foreign= I)= F(flo -*-_lweight4-/32mpg)

i

i

i

.,here Flz) = e=/(1 + e':) is the cumulative logistic distribution. To estlmate" this model, you type I

l_git

i

I i f

foreign!weight

leg likelihood l%g likelihood

= -45.03_21 = -29.898_68

Iteration

2:

lSg

likelihood

= -27.495771

Iteration Iteration !teration

3: 4: 5:

l@g log log

likelihood likelihood likelihood

= -27.184D06 = -27.175_66 = -27. 175156

Lo

it

estimates:_

foreign

Number

I I

Coef.

of obs

LR chi2(2) Prob > chi2 Pseudo R2

_ _ _ -27.175156

Lo_l | likelihood

:

mpg

Ite "ation O: Ire "ation I:

I

i

variable

origin

has cha_tged since

Data

13:5i

free)

type

variable .

1978 Automobile

4

si.,e:

i

i

25t

Std. Err.

weight mpg

i_-.0039067 -.1685869

.0919174 .00t0116

_cons

! t3.70837

4. 518707

z

P>Iz]

=

74

= = =

35.72 0.0000 0.3966

[95Y, Conf.

-3.86 -I .83

O. 000 0.067

-. 3487418 0058894

3.03

O. 002

4,851864

Interval]

-. .011568 001924 22.56487

You find that heavier cars are less likely to be foreign and that cars yielding also less likely to be foreign, at least holding the weight of the car constant. 252 Iogit- Maximum-likeUhood Iogit estimation

F l

i

See [R] maximize

for an explanation

better gas mileage are

<1

of the outpui.

D Technical Note Stata interprets a value of 0 as a negative outcome (failure) and missing) as positive outcomes (successes). Thus, if your dependent and 1, 0 is interpreted as failure and 1 as success. If :your dependent 1, and 2, 0 is still interpreted as failure, but both 1 and 2 are treated If you prefer a more formal mathematical the model

statement,

treats all other values (except variable takes on the values 0 variable takes on the values 0, as successes.

when you type logit

g x, Stata estimates

: exp(xjf_)

p (yj# oIxj)= I+ rl

Model identification The logit command has one more feature, and it is probably the most useful, logit will automatically check the model for identification and. if it is underidentified, drop whatever variables and observations are necessary for estimation to proceed.

> Example Have you ever estimated a logit model where one or more of your independent predicted one or the other outcome.'?

variables perfectly

For instance, consider the following data: Outcome y 0 0 0 1

Independent

Variable x 1 1 0 0

Let's imagine we wish to predict the outcome on the basis of the independent variable. Notice that the outcome is always zero whenever the independent vari_tble is one. In our data Pr(y = 0 ] x = 1) = 1, which in turn means that the logit coefficient on x must be minus infinity with a corresponding infinite standard error. At this point, you may suspect we have a problem, Unfortunately, not all such problems are so easily detected, especially if you have a lot of independent variables in your model. If you have ever had such difficulties, then you have experienced one of the more unpleasant aspects of computer optimization, The computer has no idea that it is trying to solve for an infinite coefficient as it begins its iterative process. All it knows is that. at each step, making the coefficient a little bigger, or a little smaller, works wonders. It continues on its merry way until either (1) the whole thing comes crashing tO the ground when a numerical overflow error occurs or (2) it reaches some predetermined cutoff that stops the process. Meantime, you have been waiting. In addition, the estimates that you finally receive, if you receive any at all, may be nothing more than numerical roundoff.

i

i _

"

lOgit-- Maximum-flkelihoodlogit estimation 253 Stata watches for these sorts of problems, alerts you, fixes them. and properly estimates the model. Let's return to our automobile data. Among the variables we have in the data is one called repair that take ; on three values. A value of t indicates that the car has a poor repair record, 2 indicates an avera :e record, antl 3 indicates a better-than-average record. Here is a tabulation of our data:

[}

tabulate

'.

foregtgn

repair repair

foreign

1

2

3

Total

)omestic

10

27

9

46

i

Foreign

0

3

9

12

I

Total

I0

30

18

58

i_ _

i

! !

Notice t:mt all the cai's with poor repair recor:ds (repair==1) are domestic. If we were to attempt to predk foreign on the basis of the repair records, the predicted probability for the repair==l category _;ould have io be zero. This in turn means that the Iogit coefficient must be minus infinity, andLet's that try; would b_zzing. Statasetonmost this computer problem, programs First, we _ake up two new variables, rep_is_l that indi :ate the repair category. , _enerate repjs_l generate rep_is_2 I

= (repair==1) (repair==2)

The stat#ment generate rep_is_l = (repa_r==l)creates a new variable, rap_is_l, that takes on the v_lue l when repair is 1 and zero otherwise. Similarly, the next generate statement creates rep_isI+2t that takes on the value 1 when repair is 2 and zero otherwise. We are now ready to estimate!our logit model. See [R] probit for tl_e corresponding probit model.

}

_ogit for rep_is_l rap_is_2

I }

No_e: rep_is_l_=O predicts failure pezfectly rep_is_l _droppedand i0 obs not used

i i

It,.'ration 0: It,_ration 1: It,;ration2:

llog likelihood = -26.99_087 _og ]_og likelihood likelihood = -22.48_187 -22.230498

i i !

It,;ration3: It_:ration4: It,._ration 5: It,._ration 6: It,._ration 7:

_og _og ]_og _og _og

i I i

_{ !

and rap_is_2.

likelihood likelihood likelihood likelihood likelihood

= = = = =

-22.22g139 -22.229138 -22.22_138 -22.22g138 -22.229138

Lo_it estimates

Number of obs LR chi2(1)

= =

48 9.53

Lo

Pseudo R2

=

O.1765

likelihood = -22.229138 i

i foreign rep_is_2 _cons

Coef. -2.197225 ! 3.89e- 16

Std. Err. .7698003 .4714045

z -2.85 O.O0

Prob > chi2 = 0.0020 P> [z[ [95% Conf. Interval3 O.004 1. 000

-3. 706006 -. 9239359

-. 6884436 .9239359

Remember that all thd cars with poor repair rdcords (rep_is_1)are domestic, so the model cannot be estimated, or at lehst it cannol be estimated if we restrict ourselves to finite coefficients. State noted theft lact. It sad i Note: rep..as_l-=0 predicts fmlure perfectly . Th_s ,s Statas mathemat_cal-y precise ay of saying _:hat we said in English. When rep_is_l i._not equal toO. the car is domestic.

_l !

Stata then went ll_llllVlll on to say,lli_VlllllVVv "rep_is_l iv_||dropped and t0 obs not used". This is Stata eliminating the iv1 I¥_IL l_Lll||(l[Jl_J| problem. First. the variable rep_is_l had to be removed from the model because it would have an infinite coefficient. Then, the I0 observations that led to the problem had to be eliminated as well so as not to bias the remaining coefficients in the model The 10 observations that are not used are the 10 domestic cars that have poor repair records. Finally, Stata estimated what was left of the model, which is all that can be estimated. q

[] Technical Note Stata is pretty smart about catching problems like this. It will catch "one-way causation by a dummy variable", as we demonstrated above. Stata also watches for "two-way causation"; that is, a variable that perfectly determines the outcome, both successes and failures. In this case Stata says, "so-and-so predicts outcome perfectly" and stops. Statistics dictates that no model can be estimated. Stata also checks your data for collinear variables; it will say "'so-and-so dropped due to collinearity". No observations need to be eliminated in this case, arid model estimation will proceed without the offending variable. It will also catch a subtle problem that can arise with continuous data. For instance, if we were estimating the chances of surviving the first year after im operation, and if we included in our model age, and if all the persons over 65 died within the year, Stata will say, "age > 65 predicts failure perfectly". It will then inform us about the fixup it takes and estimate what can be estimated of our model. logit

(and logistic

note:

4 failures

and probit) and

0 successes

will also occasionally display messages such as completely

determined.

There are two causes for a message like this. Let us deal with the most unlikely case first. This case occurs when a continuous variable (or a combination df a continuous variable with other continuous or dummy variables) is simply a great predictor of the dependent variable. Consider Stata's auto. dta dataset with 6 observations removed. . use

auto

(1978

Automobile

Data)

drop if foreign==O _ gear_ratio>3.1 (6 observations deleted) • logit Logit

Log

foreign

mpg

gear_ratio,

nolog

likelihood

Number

= -6.4874814

foreign mpg weight gear

note:

weight

estimates

Coef.

Err.

=

68

=

Prob

> chi2

=

0.0000

R2

=

0,8484

Pseudo

Std.

of obs

LR chi2 (3)

z

P>Izl

[95% Conf,

72.64

Interval]

-.4944907

.2655508

-1.86

0,063

-i,014961

.0259792

-.0060919

.003i01

-1.96

0.049

-.0121696

-.000014

ratio

15,70509

8.166234

1.92

0.054

-.3004359

31.71061

_cons

-21.39527

25.41486

-0.84

0.400

-71.20747

28.41694

4 failures

and

0 successes

completely

determined.

Iogit -- Maximum-likelihood Ioglt estimation :+ •

I , i

Note that t[ ere are no missing standard errors in the output. If you receive the "completely

determined"

message ar d have one or more missing standard errors in your output, see the second case discussed

;

below. Note g_ar_ratao +large coefficient,logit thought that the 4 observationswith the smallest predictedprobabilitieswereessentiallypredictedperfectly. 1

.(option predict p passumed; !

Pr(foreign))

. so_t"p . li_t p in i/4

!

+

255

i

p

1. 2. 3.

1.34e-I0 6.26e-09 7.84e-09

4. |

!.49e-08

If this hlappensto you,there is no need to dd anything.Computationally,the modelis sound.It

+

i

is the seco d case discussed

+

The se@ndcase occurswhenthe independenttermsare all dummyvariablesor continuousones with repea_edvalues (e.g.. age). In this case, one or more of the estimatedcoefficientswill have missingst_dard errors.:Forexample,considerthis datasetconsistingof 5 observations.

?

• li_t i

y 0 0 1 0 i

1. 2. 3, 4. 5.

I ¢

below that requires

xl 0 1 i 0 0

careful examination.

x2 0 0 0 i i

. lo_it y xl x2, nolog. Logi_ 7 estimates ! i 1 i

Numberof obs LR chi2(2) Prob > chi2 Pseudo R2

-2.7725887

Log likelihood • 1 Coef-. 8td. Err.

++

i8. 26157 t8.26157

{

co

-_8.26157

2 1.414214

P>Izl

[95Y, Conf. Interval]

9 13

0.000

14.34164

-i12.91

0.000

note: i failureand 0 successescompletelydetermined.

i

. predict p (optionp assumed Pr(y)) xl

y

-15,48976

0

0

0

+

2. 3. 4. 5.

0 t 0 1

1 1 0 0

0 0 1 I

Two thiSgs are happeaing

i

covariate

i

dropped,

p

x2

i.

i+

-21.03338

22.1815

, li. _ _

•

5 I,18 0.5530 O. 1761

z

+

+

= = = =

1',. 17e-08

.5 .5 .5 .5

here. The first is tl_at logit is able to fit the outcome

(y = 0) for the

p_ttern+ xl = 0 and x2 = 0 (i.e., the first observation) perfectly. It is this observation that letel' dete + is the "1 f__ilure ...con_ } rm'ned". The second thing to note is that if this observation is t,n

!+

xl, x2, arid the constant

'i

are colli_ar.

....................-_oo---........,o_J,_ "Wmxmmum-.KellnOO0Ioglt estimation

This is the cause of the message "completely determined" and the missing standard errors. It happens when you have a covariate pattern (or patterns) with only one outcome, and there is collinearity when the observations corresponding to this covariate pattern are dropped. If this happens to you, confirm the causes. First identify the covariate pattern with only one outcome. (For your data, replace xl and x2 with the independent variables of your model.) • egen pattern = group(x1 x2) quietly logit y xl x2 predict p (option p assumed; Pr(y)) • snmraarize p Variable

Obs

Mean

p

5

.4

Std. Dev. .22360_8

Min

Max

I. 17e-08

.5

If successes were completely determined, that means there are predicted probabilities 1. If failures were completely determined, that means there are predicted probabilities O. The latter is the case here, so we locate the corresponding value of pat_;ern:

that are almost that are almost

• tab pattern if p < le-7 x2) 1 Total group (xl 1.

Freq.

Percent

Cum.

1

i00.O0

i00.O0

1

100.O0

Once we omit this covafiate pattern from the estimation sample, logit can deal with the collinearity: logit y xl x2 if pattern-=l, nolog note: x2 dropped due to collinearity Number of obs LR chi2 (I) Prob > chi2 Pseudo R2

Logit estimates

Log likelihood = -2.7725887

= = = =

4 0.00 1.0000 0.0000

|

y [

Coef.

xl _cons

0 0

Std. Err. 2 I.414214

z O.O0 O.O0

P>lz{ 1.000 I.000

[95'/, Conf. Interval] -3.919928 -2.771808

3.919928 2 _771808

We omit the collinear variable. Then we must decide whether to include or to omit the observations with pattern

= 1. We could include them

logit y xl, nolog Logit estimates

Number of obs LR chi2(1) Prob > chi2

= = =

5 O. 14 0.7098

Log likelihood = -3.2958369

Pseudo R2

=

0.0206

Y I _cons xl j

Coef. -. 6931472 .6931472

Std. Err. I.224742 1. 870827

or exclude them: • logit y xl if pattern~=l, nolog

z -0.57 O. 37

P> lz[ O.571 O. 711

[957,Conf. Interval] -3.093597 -2.973605

I.707302 4. 3599

Iog_ -- Maximum-likelihoodIoglt estimation ,

Logi

estimates _ i

Log _ikelihood '

i

= =

Prob

=

1.0000

=

0.0000

> chi2

Pseudo

R2

4 0.00

1

Y

i

!

= -2.7725887

Number of obs LR chi2 (I)

257

Coef,

xI

0

_cons

0

Std.

Err. 2

1,4142/4

z

P> ]zt

[95_ Conf.

Interval]

O. O0

1.000

-3.919928

3.919928

0.00

1.000

-2.771808

2.771808

If the _ovariate pattern that predicts outcome perfectly is meaningful, you may want to exclude these observations from the model. In this case_ one would report covariate pattern such and such predicted 6utcome perfectly and that the best model for the rest of the data is .... But. more likely. the perfec_ prediction was simply the result of h_ving too many predictors in the model. In this case. one wouldI omit the extraneous variable(s) from further consideration and report the best model for all the datA. 23

'

i

Obtaining redicted values

i

Once y_u have estimated a logit model, you can obtain the predicted probabilities using the predict )remand for both the estimation sample and other sampI_s: see [U] 23 Estimation and post-estimation commands and [R] predict. Here we will make only a few additional comments.

i

predict without arguments calculates the predicted probability of a positive outcome: i.e.. Pr'd/j = .) = F(xjb), With the xb option, it calculates the linear combination xjb, where x_

!

are the in, ependent variableSasin the jth observation and b is the estimated parameter vectOr.atThis is sometin_es known the index function since the ,mmulative distribution function indexed this

i

value is thI probability of a positive outcome. In bothtcases, State remembers any "rules'" used to identify the model and calculates missing for excluded dbservations unless rules or asif is Specified. This is covered in the following example.

e

!

i

!

For inf_rmation about the other statistics available after predict,

> Example In the _revious example, we estimated the 10git model logit foreign

Pr(foreign))

i

(_0 _issing

values

generated)

i

s_arize

foreign

! i

rep_is_l

rep_is_2.

To obtain _redicted probabilities: . predict p (option p assumed;

,

see [R] logistic.

_ariable

!

:foreign p

p 0bs

Mean

58 48

,2068966 .25

Std. Dev. .4086186 .1956984

Min

Max

0 .t

! .5

State rem_ mbers any "'rules" used to identify ihe model and sets predictions to missing for any excluded , bservations, in the previous examplel logit dropped the variable rep_is_l from our model anc excluded l0 observations. Thus. when we typed predict p. those same I0 observations were a_ai_ excluded an_t their predictions were _et to missing.

-

predict'srules option will use the rules in file prediction. During estimation, we were told "rep__is_t-=O predicts failure perfectly", so the rule is that when rep_is_l is not zero, one should zoe _oglt-- MaxJmum-l|lmlthoodIogit predict 0 probability of success or a positive estimation outcome: predict p2,

rules

• summarize foreign p p2 Variable [

)

foreign p p2

predict's asif for all observations

Obs 58 48 58

Mean .2068966 .25 .2068966

Std. Dev. .4086186 .1956984 •2016268

Min

Max

0 .I 0

1 .5 .5

option will ignore the rules and the exclusion criteria and calculate predictions possible using the estimated parameters from the model:

• predict p3, asif . summarize foreign p p2 p3 Variable

Obs

Mean

foreign p p2 p3

58 48 58 58

.2068966 .25 .2068966 .2931034

S%d, D_v.

Min

Max

.4086186 .195698_4 .2016268 .2016268

0 .1 0 .1

1 .5 .5 .5

Which is right? What predict does by default is the most conservative approach. If a large number of observations had been excluded due to a simple rule, one could be reasonably certain that the rules prediction is correct. The asif prediction is only correct if the exclusion is a fluke and you would be willing to exclude the variable from the analysis anyway. In that case. however, you should re-estimate the model to include the excluded observations. q

Saved Results logit

saves in

e():

Scalars e (N)

number

e ('2.1)

log likelihood

e(df_m) e (r2_p) e (N_clust)

model degrees of freedom pseudo R-squared number of clusters

of observations

e(_l_O) e (chi2)

log likelihood, X2

e(cmd)

logit

e(vcetype)

covariance

e(depvar) e (wtype)

name of dependent variable

e(chi2type) e(offset)

Wald offset

e(wexp) e(clustvar)

weight expression name of cluster variable

e(predict)

program

coefficient

e(V)

variance-covariance estimators

constant-only

model

Macros

weight

type

estimation

method

or LR: type of model X_ test used to implement

predict

Matrices e(b)

vector

Functions e(sample)

marks estimation

sample

matrix

of the

t

Iogitj- Maximum-likelihoodIoglt estimation

259

V

; i

.

Methods

Formulas

The wo_ logit is due to Berkson (1944) and is by analogy with the word probit, For an introduction to probit a_d logit, see, for example. Aldrich and Nelson (1984), Hamilton (1992Z Johnston and DiNardo (1_97), Long (1997), or Powers and Xie (2000). The likelihood function for logit is 1

InL=

Ewj

lnF(xjb)

+ Zwiln{1-F(xjb)}

jES

i !

j_S

where S is!the set of all observations j such that yj _ O, F(z) = eZ/(l optional wdights. In L is maximized as described in [R] maximize.

+ eZ), and wj denotes the •

If robusl standard errors are requested, the dalculation described in Methods and Formulas of [R] regresstis carried forward with uj = !1 - F(xjb)}xj for the positive outcomes and -F(xjb)xj for the neghtive outcomes, qc is given b5 its asymptotic-like formula,

Reference

.Aldrich.J. 0' and F. D. Nelson. 1984. Linear Probab_lit);Logit, and Probit Models. Newbury Park. CA: Sage

i o

Publicatiohs. Berkson,J. @44. Applicationof the logisticfunctionto l'iio-assay.Journalof the AmericanStatisticalA._ociation39: 357-365.' Cleves,M. a}d A. Tosetto 2000. sg139:Logisticregressionwhenbinary outcomeis measuredwith uncertainty.Stata T_chmcallBulletin55: 20-23.

',

Hami['ton.L.!C 1992. f_egres_ionwith Graphics.PacificGrove.CA: Brooks/ColePublishingCompany.

i

ltosmer. D +'.. Jr.. and S. Lemeshow.1989. AppliedLOgisticRegression.New York:John Wiley & Sons. (Second editionforthcoming"in 200I.) Johr_ston.J. ]nd J. DiNardo. I997. EconometricMethods.4th ed. New York:McGraw-Hill. Judge,G. G..!W.E Griffiths,R. C. Hill.H. L/itkepohl.andT.-C.Lee. 1985. The Theoryand Practiceof Econometrics. 2d ed. Nlw York:John Wiley& Sons. Long. J. S_|997. RegressiobModels for Cate_,oricaland Limited Dependent Variables.ThousandOaks. CA: Sage

i

, i

Publicatic_as. Powers.D, _, and Y. Xie. 2000. StatisticalMezhodsfor CategoricalDataAnah,sis, San Die__o.CA: AcademicPress Pre_ilyon.D.!1981. Logisticregressiondiagnostics.Annals of Statistics9: 705-72&

Also Compleme

atary:

i

[R] clogit, [R] cloglog, [R] cusum, [R] glm, [R] giogit, [R] logistic. [Ri nlogit, [R] probit, [R] scobit. [R] svy estimators. [R] xtelog.

I

[R]

i

i

Related:

[R] adjust, [R] lincom. [R] linktest. [R] lrtest. [R] mfx. [R] predict, [R] roe. [R] sw, [R] test, [R] testnl, [R] vce. [R] xi

Backgrounch

xtgee, [R] xtlogit, [R] _tprobit

[u] 16.5 Accessing coefficients and standard errors, [U_]23 Estimation and post-estimation commands, [U_23.11 Obtaining robu_ variance estimates. [U_23,12 Obtaining scores, [R_maximize

_

Ioneway

:,

--

Large

one-way

ANOVA,

random

effects,

I I

and reliability

I

Syntax loneway

response_var

group_var

[weight t [if

exp]

[in

range I [, mean median exact

l_evel(#) ] by ...

: may be used with loneway; see [R] by.

aweights

are allowed; see [U] 14.1.6 weight.

Description loneway estimates of levels

one-way

of group_var

analysis-of-variance

and presents

different

(ANOVA) models

ancillary,

statistics

Feature

from

on datasets one,ray

with a large number (see [R] oneway):

oneway loneway

Estimate one-way model on fewer than 376 levels on more than 376 levels Bartlett's test for equal variance Multiple-comparison tests Intragroup correlation and S.E. Intragroup correlation Confidence interval Est. reliability of group-averaged score Est. S.D. of group effect Est. S.D. within group

x x

x x x

x x x x x x x

Options mean specifies that the expected value of the Fk-l.N,-k distribution Fm in the estimation of p instead of the default value of 1. median specifies that the median of the Fk-l_N-k distribution in the estimation of p instead of the default value of 1. exact;

requests

confidence not used. level

(#)

that exact intervals.

specifies

default is level(95) intervals.

confidence

This option

the confidence

intervals is allowed

level,

or as set by set

be computed,

level;

be used as the reference

as opposed

only if the groups

in percent,

be used as the reference

for confidence

are equal intervals

see [U] 23.5 Specifying

260

to the default

point

of the coefficients. width

Fm

asymptotic

in size and weights

the

point

are The

of confidence

r

i

_

Re.m

loneway-

i

Large one-wayANOVA,random effects,and reliability

261

> Example lonewa't's output looks like that of oneway except that, at the end, additional information is presented. Jsing our automobile dataset (see [U]'9 Stata's on-line tutorials and sample datasets), we have eated a (numeric) variable called ma_u:facturer_grp identifying the manufacturer of each car an within each manufacturer we have retained a maximum of four models, selecting those with',the h Jcest mpg. We can compute the intradlass correlation of mpg for all manufacturers with at least You models as follows: . 'loneway mpg manufacturer_grp if nummake == 4 One-way Analysis of VarianCe for mpg: Mileage (mpg)

S_urce

SS

df

Number of obs =

36

R-squared :

0. 5228

MS

F

Between|manufactu/~p Withi_ manufactur_p

621.88889 567.75

8 27

77,736111 21.027778

Total

1189. 6389

35

33.989683

Intraclass correlation 0.402T0

Asy. S.E O.18770

Prob > F

3.70

O.0049

[957 Conf. Interval] O.03481

0.77060

Estimatec_SD of manufactur_p effect 3.765247 Estimated SD within manufactur-p 4.585605 Est. reliability of a manufactur-p mean .7294979 (evau%uatedat 11=4.00)

q

In additi(,n to the standard one-way ANOVAoutput, lonewayproduces the R-squared, estimated standardde,,iation of the group effect, estimated standard deviation within group, the intragroup correlation he estimated reliability of the group-a_eraged mean, and, in the case of unweighted data. the asyrr/ptc :ic standard error and confidence interval for the intragroup correlation.

R-squared The R-squared is, of course, simply the underlying R 2 for a regression of response_var on the levels of ¢rqlup_var. or mpg on the various manu(acturers in this case.

The random effects ANOVA model loneway assumes that we observe a variable Yij measured for r_, elements within k groups or classes such that Yij

::=

,tZ+ Ct" i -I-

6ij,

i = 1,!2,...,

k.

3 = 1.2 .....

ni

and %. and _]ij are independent zero-mean randon3 variables with variance a,] and cr2, respectively. This is the random-effects ANOVAmodel, also kno_'n as the components of variance model, in which it is t}_picall31assumed thak the Yij are normally d_stributed.

!

The interpretation '!

with respect to our example is that the observed value of our response

variable,

mpg, is created in two steps. First, the ith manufacturer is chosen and a value c_i is determined--the !o,Large one-way reliability typical mpgtoneway for that --manufacturer less ANUVA, the overallrandom mpg/_. effects, Then aand deviation, eij, is chosen for the jth model within this manufacturer. This is how much that particular automobile differs from the typical mpg value for models from this manufacturer. For our sample of 36 car models, the estimated standard deviations are cr,_ = 3.8 and cr, -- 4.6. Thus, a little more than half of the variation in mpg between cars is attributable to the car model with the rest attributable to differences between manufacturers. These standard deviations differ from those that would be produced by a (standard) fixed-effects regression in that the regression would require the sum within each manufacturer of the eij, ei. for the ith manufacturer, to be zero while these estimates merely impose the constraint that the sum is expected to be zero.

Intraclass correlation There are various estimators of the intraclass correlation, such as the pairwise estimator, which is defined as the Pearson product-moment correlation computed over all possible pairs of observations that can be constructed within groups. For a discussion of various estimators, see Donner (1986). loneway computes what is termed the analysis of variance, or ANOVA, estimator. This intraclass correlation is the theoretical upper bound on the variation in response_var that is explainable by group_var, of which R-squared is an overestimate because of the serendipity of fitting. Note that this correlation is comparable to an R-squared you do not have to square it. In our example, the intra-manu correlation, the correlation of mpg within manufacturer, is 0.40. Since aweights weren't used and the default correlation was computed, i.e., the mean and median options were not specified, loneway also provided the asymptotic confidence interval and standard error of the intraclass correlation estimate.

Estimatedreliability of the group-averagedscore The estimated reliability of the group-averaged score or mean has an interpretation similar to that of the intragroup correlation; it is a comparable number if we average response_var by group_vat, or rapg by manu in our example. It is the theoretical upper bound of a regression of manufactureraveraged mpg on characteristics of manufacturers. Why would we want to collapse our 36-observation dataset into a 9-observation dataset of manufacturer averages? Because the 36 observations might be a mirage. When General Motors builds cars, do they sometimes put a Pontiac label and sometimes a Chevrolet label on them, so that it appears in our data as if we have two cars when we really have only one. replicated? If that were the case, and if it were the case for many other manufacturers, then we would be forced to admit that we do not have data on 36 cars: we instead have data on 9 manufacturer-averaged characteristics.

Saved Results loneway

saves in r O :

Scalars r(N) r(rho) r(lb) r(ub)

number of observations intraclass correlation lower bound of 95% CI for rho upper bound of 95% CI for rho

r(rho_t) r(se) r(sd_w) r(sd_b)

estimated reliability asymp. SE of intraclass correLati_m estimated SD within group estimated SD of group effect

!

'loneway-- Large one-wayANOVA,random effects, and reliability

263

Metl!ods and Formulas is implemented as an ado-file.

lo_e_ !

The r_ean squares in the lone_ay's

ANOVAtable are computed as follows:

Mso= _i wi(_,.-9.)_/(k- t) an_

MS,= _ _ _,j(y,j- _,.)2/(u-k) •

1"

j

in which

j i

i

j

i

= E expected wij w.. values = wi. these Yi. m_an = E squares wiiyij/wi, The c0rre_:i. ;ponding of are

t

and

_.. =

wi.ff_./w..

2 + go% and E(MS_)= _2

E(MS_,) = a 2 in Which

_..- Z,wUw k-1

Note that in the unweighted case, we get

N- Z,-_/N g=

k-1

i

As expecti d, g = rn for the case of no weights _mdequal group sizes in the data, i.e., n_ = m for all i. l_ep[acilLgthe expected values with the obse_,ed values and solving yields the ANOVAestimates of a_ and cry. Substituting these into the defini[ion of the intraclass correlation 2

P= _ + G_ yields the _NOVA estimator of the intraclass correlation: IFobsPA

=

_bbs

--

1 1+ 9

Note that 7obs is the observed value of the F statistic from the ANOVAtable. For the case of no weights ar d equal hi, PA = roh, the intragroup correlation defined by Kish (1965). Two slightly different e:timators are available through the mean and median options (Gleason 19971. If either of these optioas is specified, the estimate of p becomes

•

0= Fob_ _-(__ i-)Fm

i

}

) ' :

For _he:rae..n option, Fm= E(Fk-1,._'-K) = (_}r_ k)/(N - k - 2), i.e., the expected value of the ANO\__,tab e's F statistic. For the median optioh. Fm is simply the median of the F statistic. Note thal setting F,, to I gives PA, so for large samples these different point estimators are essentially the samd. Als_, since the iniyaclass correlation of the random-effects model is by definition nonnegative.

I

:

for any of he three possible point estimators p is truncated to zero if Fobs is less than F,_.

r_ ' it ,:i

For the case of no weighting, interval estimators for PA are computed. If the groups are equal-sized 264 ni equal) Ionewayeffects, exact and reliability (all and the Large exact one-way option isANOVA, specified,random the following (under the assumption that the Yij are normally distributed) 100(1 a)% confidence interval is computed:

-

[

Fobs - FmF_,

Fobs -- FmFz

Fobs + (9 - 1)FmFu'

Fobs + (9 - 1)FmFt

]

with F,_ - 1, Fl = Fa/2,k_l,N_k, and Fu - Fl_a/2,k_l,N_k, F.,k--l,N--k being the cumulative distribution function for the F distribution with k - 1 and N - k degrees of freedom. Note that if mean or median is specified, Fm is defined as above. If the groups are equal-sized and exact is not specified, the following asymptotic 100(1 - a)% confidence interval for PA is computed: [PA --ZaI2V(pA),PA + zaI2V(pA)] where Zal2 is the t00(1 -a/2) percentile of the standard normal distribution and V(pA) is the asymptotic standard error of p defined below. Note that this confidence interval is also available for the case of unequal groups. It is not applicable, and therefore not computed, for the estimates of p provided by the mean and median options. Again, since the intraclass coefficient is nonnegative, if the lower bound is negative for either confidence interval, it is truncated to zero. As might be expected, the coverage probability of a truncated interval is higher than its nominal value. The asymptotic standard error of PA, assuming the Yij are normally distributed, is also computed when appropriate, namely for unweighted data and when PA is computed (neither the mean nor the median options are specified):

V(pA)

= 2(1_P)2i

(A + B + C)

with A = {1 + p(gN-k

1)} 2

B = (1 - p){1 + p(2gk-1 2

1)}

2

C = p {_-_ ni " 2N-1E:nf

(k- 1)2

n_)2}

and PA is substituted for p (Donner 1986). The estimated reliability of the group-averaged score, known as the Spearman-Brown formula in the psychometric literature (Winer, Brown. and Michels t991, 1014), is

prediction

tO Pt--

1 -t- (tfor group size t. loneway

1)p

computes Pt for t -= 9.

The estimated standard deviation of the group effect is aa -- v/(MSa - MSe)/g. This comes from the assumption that an observation is derived by adding a group effect to a within-group effect. The estimated standard deviation within group is the square root of the mean square due to error, or x/--M--Se.

Ioneway -- Large one-wa_ ANOVA, random effects, and reliability

265

AcknOWledgment We wo_lld like to thank John Gleason of Syracuse vernon

University

for his contributions

to this improved

of loneway.

Referencts Donner, A. 1986. A review of inference procedures for'the intraclass correlation coefficient in the one-way random effects ITodel. International Statistical Review 54: 67;-82. Gteason, L !_. 1997. sg65: Computing intraclass correlations and large ANOVAs. Stata Technical Bulletin 35: 25-3t Reprinte_ in Stata Technical Bulletin Reprints. vol. 6, pp I67-176. Kish, L.; 19_5. Survey Sampling. New York: John Wiley & Sons. Win¢r, B. L D. R. Brown. and K. M Michels. 199I. Statistical Principles in Experimental Design. 3d ed. New York: McOraw -Hill.

Also See Related:

[R] onewa_d

'io

lorenz -- Inequality measures

............

[;i"

; _

1

II

IIII H I

i

i

ii

III

II

i

Remarks Stata should have commands for inequality measures, but at the date that this manual was written, Stata does not. Stata users, however, have developed an excellent suite of commands, many of which have been published in the Smm Technical Bulletin (STB),

Issue

insert

author(s)

command

description

STB-48

gr35

N.J. Cox

psr., qsm, pdagum._ qdagum

Diagnostic plots for assessing Singh-Maddala and Dagum distributions fitted by MLE

STB-23

sg30

E, Whitehouse

lorenz, inequal_ atkinson, relsginl

Measures of inequality in Stata

STB-23

sg31

R. Goldstein

rspread

Measures of diversity: absolute and relative

STB-48

sgl04

S.P. Jenkins

sumdist, xfrac,

Analysis of income distributions •

ineqdeca, geivars, i ineqfac, povdeco STB-48

sgl06

S. R Jenkins

smfit, dagumfiti

Fitting Singh-Maddala and Dagum distributions by maximum likelihood

STB-48

sgl07

S. R Jenkins, E Van Kerm

glcurve

Generalized Lorenz curves and related graphs

STB-49

sgl07.1

S. E Jenkins, P. Van Kerrn

glcurve

update; install this version

STB-48

sgl08

P. Van Kerm

poverty

Computing poverty indices

STB-51

sgtI5

D. Jolliffe, B. Krushelnytskyy

ineqerr

Bootstrap standard errors for indices of inequality

STB-51

sgll7

D. Jolliffe, A. Semykina

sepoy

Robust standard errors for the Foster-GreerThorbexke class of poverty indices

Additiona] commands may be available; enter Stata and typ_ search

inequality

measures.

To download and install from the Internet the Jenkins isumdistcommand, for instance, you could 1. Pull down Help and select STB and User-written Programs. 2. Click on http://www.stata.com. 3. Click on stb, 4. Click on stb48. 5. Click on sg 104. 6. Click on click here to install. or you could instead do the following: 266

i

- '

-

lorenz -- Inequality measures

1, !Na_lgate

to the appropriate

_. Type net

267

STB issue:

_rom http://vwv,

stata,

com

stata,

com/stb/stb48

| Type net cd stb Type net cd stb48 or

). Type net

from http://_w,

2. Ty] e net

describe

3. Ty

±nsta_[1

net

sgl04 sgl04

Refemncq s Cox, N. J. I999, gr35: Diagnostic plots for assessing Singh_Maddala and Dagum distributions fitted by MLE, Stata Technic_1 Bulletin 48: 2-4, Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 72-74. Goldstei_, _. 1995. sg3]: Measures of diversity: absolute and relative. Stata Technical Bulletin 23: 23-26. Reprin'ted in Stata Technical Bulletin Reprints, voL 4, pp. 150_-154. Jenkins. S. _. 1999a. sgl04t Analysis of income distributions. Stata Technical Bulletin 48: 4-18. Reprinted in Stata Tec1_ica' Bulletin Reprints, vol. 8, pp. 243-260. -. 19991 sg]06: Fitting Singh-Maddal_ & Dagum distributions by maximum likelihood. Stata Technical Bulletin 48: t9-5. Reprinted in Stata Technical Bulletin Reprints. rot. 8, pp. 26t-268. Jenldns. _S. • and P. Van Kdrm. 1999a, sgl07: Generalized Lorenz curves and related graphs. Stata Technical Bulletin 48: 25- 9. Reprinted in Stata Technical Bulletin Re_qrints,vol, 8, pp. 269-274. --

lff°J9 sgl07.t: Generalized Lorenz cur'¢es and related graphs. Stata Technical Bulletin 49: 23, Reprinted in S_ata Tetfinical Bulletin • epnnts, voL 9, p. 171.

Jolliffe, D. nd B. Krushelrtytskyy. 1999 sgll5: Bootstrap standard errors for indices of inequality, Stata Technical 13_lletin ;1: 28-32. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 191-196, Jotliffe, D. nd A. Semykintt. 1999 sgll7: Robust stant_u'd errors for the Foster-Greer-Thorbecke class of poverty i_ices. 'tara Technical Bulletin_51: 34-36. Reprinted in Stata Technical Bulletin Reprints. vol. 9. pp, 200-203." Van Kerm. 1999. sg]08: Computing povert_ indices. St_ta Technical Bulletin 48: 29-33. Reprinted in Stata Technical Bulletin _eprints, vol. 8,_pp. 274-278. Whitethouse,_E. 1995, sg30: !Measures of inequality in Sltata. Stata Technical Bulletin 23: 20-23. Reprinted in Stata Te_chnicallBulletinReprir_s. vol. 4, pp. 146-150.

! ,

I

Irtest -- Likelihood-ratio I

test after model estimation i

I

II

I

I

Syntax irtest [, saving(name) using(name)

m_odel(name)dr(#) ]

where name may be a name or a number but may not exceed 4 characters.

Description irtest saves information about and performs lil_elihood-ratio tests between pairs of maximum likelihood models such as those estimated by cox, ]_ogit, logistic, poisson, etc. lrtest may be used with any estimation command that reports a tog-likelihood value or, equivalently, displays output like that described in [R] maximize. lrtest, typed without arguments, performs a likelihood-ratio test of the most recently estimated model against the model previously saved by lrtest ,i saving(0). It is your responsibility to ensure that the most recently estimated model is nested within the previously saved model. lrtest

provides an important alternative

to test'for

maximum likelihood

models.

Options saving(name) specifies that the summary statistics as:;ociated with the most recently estimated model are to be saved as name. If no other options are pecified, the statistics are saved and no test is performed. The larger model is typically saved by typing lrtest, saving(0). using(name) specifies the name of the larger mode_ against which a model is to be tested. If this option is not specified, using(O) is assumed. model (name) specifies the name of the nested model (a constrained specified, the most recently estimated model is used.

model) to be tested. If not

df (#) is seldom specified: it overrides the automatic degrees-of-freedom

calculation.

L

Remarks The standard use of Irtest is 1. Estimate the larger model using one of Stata's estimation saving(O). 2. Estimate an alternative,

nested model (a constrained

commands

and then type lrtest,

model) and then type lrtest.

Example You have data on infants born with low birth weights along with characteristics of the mother (Hosmer and Lemeshow 1989 and more fully described in JR] logistic). You estimate the following model: 268

_

irtest -- LiWelihood-ratio test after model estimation

269

lo istic low age lwt race2 race3 smoke ptl ht ui bogi

Dog

estimates

ikelihood =

]

-100.724

age lwt low race2 race3 smoke

.9732636 ,9849634 Odds Ratio 3. 534767 !2.368079 12.517698

ptl ht ui

;1.719161 ;6.249602 2.1351

.0354759 .0068217 Std. Err, 1.860737 1.039949 I.00916 .5952579 4.322408 .9808153

Number of obs LR chi2 (8) Prob > chi2

= = =

189 33.22 0.0001

Pseudo R2

=

0.1416

-0,74 -2.19 z 2.40 1.96 2,30

O.457 0.029 P> Iz J O,016 0.050 O.021

.9061578 1,045339 .9716834 .9984249 [957_Conf. Interval] 1.259736 9.918406 1.001356 5.600207 I.147676 5,52316"2

1.56 2.65 1.65

O.118 O.008 0.099

.8721455 I.611152 .8677528

You now _ ish to test the constraint that the coefficients on age, lwt;, ptl, equivalent] in this case that the odds ratios are all 1). One solution is te3t ( I ( 2 ( 3 (4

age l_t

3.388787 24,24199 5.2534

and ht are all zero (or

pl_ ]at

age = 0.0 lwt = 0.0 ptl = 0.0 ht = 0.0

I

chi2( 4) = Prob > dhi2 =

12.38 0.0147

This test i; based on the inverse of the information matrix and is therefore based on a quadratic approxima ion to the likelihood function: see [R] test. A more precise test would be to re-estimate the model, apt lying the proposed constraints, and then calculate the likelihood-ratio test. lr't:est assists you iin doi lg this. You fir_t save the st_itistics associated with tlie current model: lr zest, saving_O)

The"nam_" 0 was not _h°sen arbitrarily, although we could have chosen any name. Why we chose 0 will bec _me clear sb+rtly. Having saved the information on the current model, we now estimate the constrained model, ,_,hich in this case is themodel omitting age, l_,,,"t:,ptL and ht: Io istic low r_ce2 race3 smoke ui L_gi

estimates

Number of obs LR chi2(4) Prob > chi2

Dog Likelihood = '-107.93404 low race2 race3 smoke ui

Pseudo

R2

= -=

189 18.80 0.0009

=

0.0801

Std. Err.

z

P>Izt

[957,Conf. Interval]

3.052746

I.498084

2.27

O.023

I.166749

7.987368

12.922593 12.945742 2.419131

I.189226 I.101835 1.047358

2.64 2.89 2.04

0. 008 O.004 0.041

1.31646 i.41517 i.03546

6.488269 6.131701 5.651783

Odds Ratio

That done. typing Irteit will compare this model with the model we previously saved: Ir zest Logi 3tic:

likelJhood-ratio test

chi2(4) = Prob > chi2 _

14.42 0.0061

._/i

_'

¢"# 'J LqI_OL -I.,.II_V_IIIIVq.PU--i CIILI_I IIIUUI_I _,_ILI||lidIIqQIFI The more !!precise syntax for theCILIU test |ql_Ol is Irtest, usihg(O),meaning that the current model is to be compared with the model saved as 0. The name 0, a_ we previously said. is special when you do not specify the name of the using() model, using(b) is assumed. Thus. saving the original model as 0 saved us some typing when we performed the test.

Comparing results, test reported that age, lwt, ptl, and ht were jointly significant at the 1.5% level; lrtest reports they are significant at the 0.6% level, lrtest's results should be viewed as more accurate. q

Example Typing lrtest, saving(0) and later lrtest by itself is the way lrtest used, although here is how we might use the other options: logit lrtest, logit

chd age

age2

sex

estimate full model

saving(O) chd

age

save results

sex

estimate

lrtest lrtest, logit

is most commonly

simpler model

obtain test saving(I)

save logit results as t

chd sex

estimate simplest model

Irtest

compare

with full model

irtest, using(1)

compare

with model 1

lrtest,

repeat against furl model wst

model (1)

_>Example Returning to the low birth weight data in the first example, you now wish to test that the coefficient on race2 is equal to that on race3. The base modellis still stored in 0. so you need only estimate the constrained model and perform the test. Letting z be the index of the logit, the base model is z = _0 -1- fllag

e -J- _21wt

+ fls_ace2

+ fl4race3

-k ""

+ fl3lrace2

+ race3)

+ -..

If _3 -- 34, this can be written z --- _0

+ tinge

+ fl21wt

To estimate the constrained model, we create a variable equal to the sum of race2 estimate the model including that variable in their place:

(Continued

on next page)

and race3

and

........

•"_-

ge:

race23

= r_ce2

_

--

I----7

...............................

_....................

LiK_llnooo-ra1[lO le_ff a_er model estlmatlon

271

+ race3

• loiistic low a_e _ lwt race23 smoke ptl ht ui Lbgi' estimates

Lpg

Ekelihood = low

)

-100.9997

Oc_dsRatio

age lvt; race23 smoke ptl ht ui

Number of obs LR chi2(7) Prob > chi2 Pseudo P_

1.9716799 :.9864971 _.728186 5.664498 )1.709129 _.116391 !2.09936

Std. Err. .0352638 .0064627 1.080206 1.052379 ,5924775 4.215585 .9699702

z -0.79 -2.08 2.53 2.48 1.55 2.63 1.61

= = = =

189 32.67 0.0000 0.1392

P>lzl

[95Z Conf. Interval]

0.429 0.038 0.011 0.013 0.122 0.009 0.108

.9049649 .9739114 !.255586 1.228633 ,8663666 1.58425 .8487997

1.043313 .9992453 5.927907 5.778414 3.371691 23.61385 5.192407

chi2(1) = Prob > chi2 =

0.55 0.4577

Comparing this model with our original model, we obtain Irl;est Logi_tic:

likelihood-ratio test

By corn )arison, typing testrace2=race3after estimating our base model results in a significance level of .4:;72. q

Saved Re;ults lirtest

saves in r() Scalars r(p)

two-sided p-_alue

r(df)

degrees of fvbedom

r(chi2)

X2

Ptogan mers desiring that an estimation command be compatible with trtest it requires :he following Imacros to be defined: e(c_l)

name q estimationcommand

e(ll) e (dr._m) e(N)

log-likelihood value model degrees of freedom number of observations

should note that

MethOdsand Form! Jlas irtest

is implemen_d

as an ado-file.

Let Lo and Lt be +e log-likelihood values_ associated with the full and constrained models, respectivel '. Then X2 _ -2(L1 - L0) with d_ - dl degrees of freedom, where do and dl are the model Jegrees of freedom associated with the full and constrained models (Judge el al )985, 216-;21q).

Z7Z

+t i

Irtest --

LiKelmooo-ratlo

lesl al_er moael estlmalciOrl

References Hosmer, D. W., Jr., and S. Lemeshow. I989. Applied Logistic Regression. New York: John Wiley & Sons. (Second edition forthcoming in 2001.) Judge, G. G., W. E. Griffiths, R. C. Hill, H. L/itkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wiley & Sons. P6rez-Hoyos, S. and A. Tobias. 1999. sgtll: A modified likelihood-ratio test command. Stata Technical Bulletin 49: 24-25. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp 171-173. Wang, Z. 2000. sg133: Sequential and drop one term likelihood-ratio Reprinted in Stata Technical Bulletin Rcpnnts, vol. 9, pp. 332-334.

tests. Stata Technical Bulletin 54: 46-47.

Also See Related:

[R] estimation commands, JR] linktest, [R] test, [R] testnl

I

F

Title

i Life tables for survival data

SyntSx lgab:e

timevar [deadvar]

[weight] [if

bxp] [in range]

sm'vival fail_re hazard intervals(interval)

test

[, by(groupvar)

level(#)

tyid(varname)

noadjust

nol;abgraph g_,aph_options noconf ]

fwdigkts Ireallowed;se_ [UI 14.1.6weight.

Deripti

)n

itab] displays and graphs life tables for individual-level or ag_egate data and optionally presents the likeli]lood-ratio and log-rank tests for equNalence of groups, ttable also allows examining the empirical hazard function through aggregation. Also see [R] st sts for alternative commands. timevc r specifies the time of failure or censoring. If deadvar is not specified, all values of timevar are inteq: reted as failure times: otherwise, time_ar is interpreted as a failure time where deadvar _ 0 and as a _ensoring time otherwise. Observations with timevar or deadvar equal to missing are ignored Note arefully that deadvar does not specify_the number of failures. An observation with deadvar eq_aalto or 50 has the same interpretation the observation records one failure. Specify frequency weights )r aggregated data (e.g., itabletim_ [freq=number]).

options bz(groulwar) creates :_eparate tables (or graphs within the same image) for each value of groupvar :group Jar may be siring or numeric. 1eve! (#) specifies the confidence level, in percent, for confidence intervals. The default is level i or,as _et by set level; see [R] le Vel.

(95)

survival, failure, 'and hazard indicate the table to be displayed. If not specified, the default is the survival table. Specifying failure Would display the cumulative failure table. Specifying surv_ va! failure would display both the survival and the cumulative failure table. If graph is specif ed, multiple iables may not be requested. intervals (intowaI) !specifies the time intervals into which the data are to be aggregated for tabular preset tation. A single numeric argument is':interpreted as the width of the interval. For instance. intm'val(2) aggregates data into the time intervals 0 _< t < 2, 2 _< _ < 4. and so on. Not specif¢ing interval() is equivalent to specifying interval(I). Since in most data, failure times are recorded _s integers, tNs amounts _to no aggregation except that implied by the recording of the time variable and so produces Kaplat_-Meier product-limit estimates of the survival curve (with an actuarial _justment; see the noadjust option below). Also see [R] st sts list. Although it is l]ossible to exhmine survival and faihire without aggregation, some form of aggregation is almol j

always req_red for exarnining the tilazard. 273

=

"

274

._

ltable -- Life tables for survival data

When more than one argument is specified, time intervals are aggregated as specified. For instance, interval(O,2,8,16) aggregates data into the intervals 0 < t < 2. 2 _< t < 8, 8 < t < 16, and (if necessary) the open-ended interval t > 16.

....

interval (w) is equivalent to interval (0,7,15,30,60,90,180,360,540,720), corresponding to one week, (roughly) two weeks, one month, two months, three months, six months, 1 year, 1.5 years, and 2 years when failure times are recorded in days. The w suggests widening intervals. test

presents two X2 measures of the differences does nothing if by () is not specified.

between groups when by()

is specified, test

tvid(varname) is for use with longitudinal data with time-varying parameters as processed by cox; see [R] cox. Each subject appears in the data more than once and equal values of varname identify observations referring to the same subject. When tvid() is specified, only the last observation on each subject is used in making the table. The order of the data does not matter, and "last" here means the last observation chronologically. noadjust suppresses the actuarial adjustment for deaths and censored observations. The default is to consider the adjusted number at risk at the start of the interval as the total at the start minus (the number dead or censored)/2. If noadjust is specified, the number at risk is simply the total at the start, corresponding to the standard Kaplan and Meier assumption, noadjust should be specified when using ltable to list results corresponding to those produced by sts list; see [R] st sts list, notab

suppresses displaying the table. This option is often used with graph.

graph requests that the table be presented graphically as well as in tabular form; when notab is also specified, only the graph is presented. When specifying graph, only one table can be calculated and graphed at a time; see survival,failure, and hazard above. graph_options are any of the options alIowed with graph, twoway; see [G] graph options. When noconf is specified, twoway's connect() and symbol() options may be specified with one argument; the default is connect (1) symbol(O). When noconf is not specified, the connect () and symbol () options may be specified with one or three arguments. The default is connect(Ill) and symbol(Oii), drawing the confidence band as vertical lines at each point. When you specify one argument, you modify the first argument of the default. When you specify three, you completely control the graph. Thus. connect(ill) would draw the confidence band as a separate curve around the survival, failure, or hazard. noconf

suppresses graphing the confidence

intervals around survival, failure, or hazard.

Remarks Life tables date back to the seventeenth century; Edmund Halley (1693) is often credited with their development, ltable is for use with "cohort" data and. although one often thinks of such tables as following a population from the "birth" of the first member to the "death" of the last. more generally, such tables can be thought of as a reasonable way to list any kind of survival data. For an introductory discussion of life tables, see Pagano and Gauvreau (2000. 489-495): for an intermediate discussion, see. for example, Armitage and Berry (1994. 470-477) or Selvin (t996 311-355); and for a more complete discussion, see Chiang (1984).

L>Example In Pike (1966), two groups of rats were exposed to a carcinogen and the number of days to death from vaginal cancer was recorded (reprinted in Kalbfleisch and Prentice 1980, 2):

,

itabte -- Life tables for survLvMderta Group i Group 2

143

164 t88 188 190 t92 206 209

213

220

227

230

234

246

265

304

216"

244*

t42 233 344*

155 239

163 240

198 261

205 280

232 280

232 296

233 296

233 323

216 233 204*

The '*' o[ a few of the' entries indicates that the observation was censored--as the rat ha rea_n$.

275

of the recorded day,

still not died due to vaginal cancer but was withdrawn from the experiment for other

Having _ntered these data into Stata, the firs| few observations are i

list

in

1/5 group

1 2 3 4 5,

t 143 164 188 188 190

1 1 1 1 1

died 1 1 1 1 1

That is, t] e first obse_¢ation records a rat from group I that died on the 143rd day. "/'be va6able died reccrds whether that rat died or was wlthdra n (censored): lJst if died==O t 216 244 204 324

group I 1 2 2

18, 19, 39. 40,

died 0 0 0 0

Four rats, wo from each group, did not die but were withdrawn. The sl: lival table f_brgroup 1 is I'able t died lifgroup==l nterval 1,_3 I_M 1_t8 I!)0 I!)2 2_)6 2 )9 2 ,3 2 .6 2 !0 2 !7 _0 2 14 2 _4 2 _6 255 3 )4

144 165 189 191 193 207 210 214 217 221 228 231 235 245 247 266 305

Beg. Sotal 19 18 17 15 14 13 12 11 I0 8 7 6 5 4 3 2 1

Deaths

Lost 1 1 2 1 I 1 1 1 1 1 1 1 i 0 1 1 1

0 0 0 0 0 0 0 0 I 0 0 0 0 1 0 0 0

Survival O.9474 O.8947 O.7895 O.7368 O.6842 O.6316 0.5789 O.5263 0.4709 O.4120 0.3532 O.2943 0.2355 O. 2355 O.1570 O.0785 O.0000

Std. Error O.0512 O.0704 O. 0935 O.1010 O.I066 O.1107 0.1133 O.1145 O.1151 O.1148 O.1125 O.1080 O. 1012 O. 1012 0.0931 O.0724

[957. Conf. O.6812 0.6408 O. 5319 O.4789 O.4279 O.3790 0.3321 O.2872 0.2410 O.1937 O.1502 O.1105 0.0751 0.0751 0.0312 O.0056

Int. ] O.9924 O.9726 O.9153 O.8810 O.8439 O.8044 0.7626 O.7188 O.6713 O.6194 O.5648 O.5070 0.4259 O. 4459 O.3721 O.2864

The repoted survival rates are the survival rates at the end of the interval, Thus. 94,7% of rats su_ived 144 days or r_ore.
_

%_

276

ltable -- Life tables for survival data

0 Technical Note If you compare the table just printed with the corresponding table in Kalbfleisch and Prentice (1980, I4), you will notice that the survival estimates differ beginning with the interval 216-217, the first interval containing a censored observation. 1table treats censored observations as if they were withdrawn half-way through the interval. The table printed in Kalbfleisch and Prentice treated censored observations as if they were withdrawn at the end of the interval even through Kalbfleisch and Prentice (1980, 15) mention how results could be adjusted for censoring. In this case, the same results as printed in Kalbfleisch and Prentice could be obtained by incrementing the time of withdrawal by 1 for the four censored observations. We say "in this case" because there were no deaths on the incremented dates. For instance, one of the rats was withdrawn on the 216th day, a day on which there was also a real death. There were no deaths on day 217, however, so moving the withdrawal forward one day is equivalent to assuming the withdrawal occurred at the end of the day 216-217 interval. If the adjustments are made and ].table is used to calculate survival in both groups, the results are as printed in Kalbfleisch and Prentice except that for group 2 in the interval 240-241, they report the survival as .345 when they mean .354. In any case, the one-half adjustment for withdrawals is generally accepted but it is important remember that it is only a crude adjustment that becomes cruder the wider the intervals.

to El

> Example When you do not specify the intervals, Itable uses unit intervals. The only aggregation performed on the data was aggregation due to deaths or withdrawals occurring on the same "day". If we wanted to see the table aggregated into 30-day intervals, we would type Itable t died if group==l, interval(30) Interval 120 150 180 210 240 300

150 180 210 240 270 330

Beg. Total

Deaths

Lost

Survival

19 18 17 11 4 1

1 1 6 6 2 1

0 0 0 1 1 0

O. 9,_74 0. 8947 O. 5_89 O. 24_81 O. 1063 O. O(N)O

Std. Error O. 0512 O. 0704 O. 1133 O. I009 O. 0786

[95Y.Conf. Int.] O. 6812 O. 6408 O. 3321 O. 0847 O. 0139

O. 9924 0. 9726 O. 7626 O. 4552 O. 3090

The interval printed 120 150 means the interval including 120. and up to but not including The reported survival rate is the survival rate just after the close of the interval. When you specify more than one number as the argument to interval(), widths but the cutoff points themselves.

150.

you specify not the

I

i

t

Rab4e-- Life tables for survival data

277

{ o

l_able }

t

died

if

group==1,

interval(l!20,180,210,240,330)

Beg. nterval

Total

Std. Deaths

Lost

Survival

Error

[95Z Conf.

Int,]

I_0

180

19

2

0

0.8947

0.0704

0,6408

0.9726

2 0

240

I1

6

1

0.2481

0.1009

0.0847

0.4552

2II0 0

330 210

4 17

3 6

1 0

0.0354 0,5789

0.0486 0.1133

0.0006 0,3321

0.2245 0,7626

If any of :he underlying failure or censoring tifnes are larger than the last cutoff specified, they are treated as being in the open-ended interval: • l;able

t died

interval(_20,180,210,240)

Beg. Total

Deaths

Lost

210

17

6

0

0.5789

0.1133

0.3321

0.76_6

240

11

6

1

0.2481

0.1009

0.0847

0.4552

4

3

1

0.0354

0.0486

0.0006

0.2245

_nterval

1 0

if group==l,

i! Io

Survival

Std. Error

[95Z Conf.

Int,]

ooo00o4 00o

Whether lhe last interval is treated as open-end_d or not makes no difference for survival and failure tables, bu_' it does affect hazard tables. If the ifiterval is open-ended, the hazard is not calculated for it. q

Examfle !

The by(varname) option specifies that separate tables are to be presented for each value of va;vzame. Remember that our rat dataset contains two goups: l_able

I

t died,

by(group)

interval(30)

I interval

Beg. Total

Deaths

Lost

groap = 1 20 150

gr

Survival

Std. Error

[95Z Conf.

Int.]

19

t

0

0.9474

0.0512

0.6812

0.9924

50

180

18

1

0

0,8947

0.0704

0.6408

0,9726

30

210

17

6

0

0.5789

0.1133

0.3321

0.7626

tO

240

11

6

1

0.2481

0.1009

0,0847

0,4552

_0 )0

270 330

4 1

2 1

1 0

0.1063 0.0000

0,0786

0.0139

0.3090

ap = 2 20 150

21

1

0

0.9524

0.0465

0.7072

0.9932

_50

180

20

2

0

0.857!

0.0764

0.6197

0.9516

_80 _10

210 240

18 15

2 7

1 0

0,7592 0,4049

0.0939 0.1099

0.5146 0,1963

0.8920 0.6053

70

300

6

4

0

0.1012

0.0678

0.0172

0.2749

O0

330

2

1

0

0.0506

0.0493

0.0035

0.2073

30

360

1

0

1

0.0506

0.0493

0.0035

0.2073

40

270

8

2

0

0.3037

0.1031

0.1245

0,5057

278

,,.

Rable -- Life tables for survival data

> Example A fmlure table is simply a different way of looking at a surviv_ • liable t died if group==l, Interval 120 150 180 210 240 300

interval(30)

Beg. Total

Deaths

Lost

19 18 17 ii 4 1

1 1 6 6 2 i

0 0 0 1 1 0

150 180 210 240 270 330

table: failure is 1 - survive:

failure Cum. Failure

Std. Error

0.0526 0.ii053 0._211 0.7519 0.8937 1.0000

0.0512 0.0704 0.1133 0.1009 0.0786

[95_ Conf. Int.] 0.0076 0.0274 0.2374 0.5448 0.6910

0.3188 0.3592 0.6679 0.9153 0.9861

q

Example Selvin (! 996, 332) presents follow-up data from Cuder and Ederer (1958) on six cohorts of kidney cancer patients. The goal is to estimate the 5-year survival probabihty. WithYear Interval Alive Deaths Lost drawn

','ear Interval Alive

1946

19_48

1947

0-1 1-2 2-3 3-4 4-5 5-6 0- 1 1-2 2-3 3-4 4-5

9 4 4 4 4 4 t8 11 ll 10 6

4 0 0 0 0 0 7 0 1 2 0

1 0 0 0 0 0 0 0 0 2 0

19z_9 4 1950 1951

0-1 1-2 2-3 3-4 0-I 1-2 2-3 0-I I-2 0-1

21 10 7 7 34 22 t6 19 13 25

WithDeaths Lost drawn 11 I 0 0 I2 3 1 5 1 8

0 2 0 0 0 3 0 ! ! 2

7

15 1t 15

6

The following is the Stata dataset corresponding

to the table:

list

I. 2. 3. 4. 5. e[c.

year 1946 1946 1946 1947 1947

t .5 .5 5.5 .5 2.5

died 1 0 0 1 1

pop 4 1 4 7 1

As summary data may often come in the form shown above, it is worth understanding exactly how the data were translated for use with 3.table. t records the time of death or censoring (lost to follow-up or withdrawal), died contains 1 if the observation records a death and 0 if it instead records Iost or withdrawn patients, pop records the number of patients in the category. The first line of the table stated that. in the 1946 cohort, there were 9 patients at the start of the interval 0-1, and during the interval. 4 died, and 1 was lo_t to follow-up. Thus, we ent.ered in observation 1 that at t = .5. 4 patients died and, in observation 2 that at t = .5, t patient was censored. We ignored the information on the total population because ].table will figure that out for itself. t

liable -- Life tables for survival data

279

!

i •

The s@ond line of the table indicated that in the interval 1-2, 4 patients were still alive at the beginninglof the interval and. during the interval, 0 died or were lost to follow-up. Since no patients died or wgre censored, we entered nothing into our data, Similarly, we entered nothing for lines 3, 4, and 5 _f the table. The last line for 1946 staied that, in the interval 5-6, 4 patients were alive at the begmr_mg of the mtervat and that those 4 peltlents were w,hdrawn. In obserx:atmn & we entered that there lwere 4 censorings at t = 5.5.

4

}

The f t that we chose to record the times cff deaths or censoring as midpoints of intervals does not matte_: we could just as well have recorded the times as 0.8 and 5.8. By default, liable wilt

l

form mteqvals 0-1, 1-2, and so on, and place Observations into the intervals to which they belong. We sugge,_t using 0.5 and 5,5 because those numbers correspond to the underlying assumptions made

i

by ltabl_;

!

in making its calculations. Using midpoints reminds you of the assumptions.

To ob@n the survival rates, we type

!

. l_abte

t

died

nterval

[freq=pop]

Total

Deaths

Lost

Survival

Error

[95Y, Conf.

Int.]

iO

1

Beg. i26

47

19

O.5966

Std. O. 0455

O.5017

O. 6792

II 12 13

2 3 4

60 38 21

5 2 2

17 15 9

O. 5386 O. 5033 0.4423

O. 0479 O.0508 O. 0602

O.4405 O.4002 O. 3225

O. 6269 O. 5977 O. 5554

14

5

I0

0

6

0.4423

0.0602

O.3225

O. 5554

5

6

4

0

4

O. 4423

0.0602

O. 3225

O. 5554

|

We estimate the 5-year sur_,ival rate as .4423 and the 95% confidence interval as .3225 to .5554, Selvin t1996, 336), in presenting these results, lists the survival in the interval 0-1 as I. in 1-2 as .597, i_ 2-3 as .539. and so on. That is. rdative to us, he shifted the rates down one row and inserted all in the first row. In his table, the survival rate is the survival rate at the start of the interval. 1t_ our table, the survival rate is the survival rate at the end of the interval (or, equivalently. at the star_ of the new interval). This is. of course, simply a difference in the way the numbers are presented!and not in the numbers themselves. 4

Example The di,,crete hazard function is the rate of failure--the number of failures occurring within a time interval divided by the width of the interval (assuming there are no censored observations). While the surviv:fl and failure tables are meaningful at_the "individual" level with intervals so narrow that each cont4ins only a si_lgle failure

that is not true for the discrete hazard. If all intervals contained

!

to be a c_nstant! one death liand if all intervals were of equal widfla, the hazard function would be I/'At and so appear The e_piricalty determined discrete hazard function can only be revealed by aggregation. Gross and Clark!(1975, 37_ print data on malignant melanoma at the M. D. Anderson Tumor Clinic between

1

1944 and 1t960. The interval is the time from i/fitial diagnosis:

i

! {

!

a.,.,v

_uz_u_ --

n_._ taulW_

;, _ _:_i:_i

IUI- _UI'VlVal

Interval (years)

Number withdrawn alive

Number dying

19 3 4 3 5 1 0 2 0 0

77 71 58 27 35 36 17 10 8 0

312 96 45 29 7 9 3 t 3 32

0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9+

i i

CIRl[a

Number lost to follow-up

For our statistical purposes, there is no difference between the number test to follow-up (patients who disappeared) and the number withdrawn alive (patients dropped by the researchers)--both are censored. We have entered the data into Stata; here:is a small amount of it: . list

1.

t .5

d 1

pop 312

2.

.5

0

19

3.

.5

0

77

4.

t .5

1

96

5.

1,5

0

6.

1.5

0

We entered numbers

in I/6

each group's

of the table, Itable

t d

time of death

recording

or censoring

d as 1 for deaths

[freq=pop],

hazard

Beg. Interval

3 71

as the midpoint

of the intervals

and 0 for censoring.

The hazard

and entered table

the

is

interval(O,l,2,3,4,5,6,7,8,9)

Cum.

Std.

Total

Failure

Error

Std. Hazard

Error

[95_ Conf.

Int.]

0

1

913

0.3607

0.0163

0.4401

0.0243

0.3924

0.4877

1

2

505

0.4918

0.0176

_.2286

0,0232

0.1831

0,2740

2

3

335

0.5671

0.0182

0_.1599

0.0238

0.1133

0.2064

3

4

228

0.6260

0.0188

0i. 1461

0,0271

0,0931

0.1991

4

5

169

0.6436

0.0190

01.0481

0.0182

0.0125

0.0837

5

6

122

0,6746

0.0200

0_.0909

0.0303

0.0316

0.1502

6 7

7 8

76 56

0.6890 0.6952

0.0208 0.0213

0_.0455 0L0202

0.0262 0,0202

0.0000 0.0000

0.0969 0,0598

8 9

9

43 32

0,7187 1.0000

0,0235

0,0800

0.0462

0.0000

0.1705

We specified the interval() option as we did and not as interval(1) (or omitting the option altogether) to force the last interval to be open-ended. Had we not, and if we had recorded t as 9.5 for observations in that interval (as we did), ltable would have calculated a hazard rate for the "interval". In this case. the result of that calculation would have been 2, but no matter what the result, it would have been meaningless since we do not know the width of the interval. You are not limited to merely see the result graphically: . itable

i

t d [freq=pop],

> xlab(0,2,4,6,8,10)

examining

hazard

border

a column

of numbers.

1(0,1,2,3,4,5,6,7,8,9)

With the graph

graph

notab

option,

you can

liable -- Life tables for survival data i

,

i

,l_

I

1

I

|

281

I

{o I

2-

t

"_trne{years)

The verti@ lines in the graph represent the 95% confidence intervals for the hazard; specifying noconf w_uld have suppressed them. Among th+ options we did specify, although it is not required, not_b supl_ressed printing the table, saving us some paper, xlab () and border _ ere passed through

,.ee

o.,,o°.,e.e

made

q

Example You cani _raph the survival function the same way you graph the hazard function: just omit the hazm-d', op! on,

q

Methodsand Formulas It.able lis implemented as an ado-file. Let ri b_ the individual failure or censoring times. The data are aggregated into intervals given by tj, j = !, .... J, and t j+l = oc with each interval containing counts for tj _<_- < tj.1. Let dj and mj be the number 6f failures and censored observations during the interval and i\_ the number alive at theistart of the imerval, Define nj = Nj - m)/2 as the adjusted number at risk at the start of the inter}'al. If the noadjust option is specified, nj = .N_.. The product-limit _ estimate of the survivor function is

sj

11 k=t

_k

{Kalbfleisct_ and Prentice t980. 12, 15). Greenwood's formula for the asymptotic standard crror of

Sj is

i1 1

sj = S)

k=i

' nk(nk-dk)

(i •., _ !

282 Itable -- Kalbfleisch Life tables for (Greenwood 1926; and survival Prentice data 1980, 14, 15). sj is reported as the standard deviation of survival but is not used in generating the confidence intervals since it can produce intervals outside 0 and 1. The "natural" units for the survival function are log(- log Sj ) and the asymptotic standard error of that quantity is

s'J = _J [_"_,log {(nk-dk)/nk}l Edk/{nk(nk_dk) (Kalbfleisch and Prentice 1980, 15). The corresponding The cumulative

} 2

confidence

intervals are S; xp(±zl-°/2_)

failure time is defined as Gj = 1 - Sj, and thus the variance is the same as for A

Sj and the confidence intervals are 1-

S; xp(±zl-_/2

s_).

For purposes of graphing, both Sj and Gj are graphed against tj+t. Define the within-interval failure rate as fj = d_/nj. (within-interval) hazard is then

The maximum

likelihood

estimate of the

fj AJ = (1-

fj/2)(tj+l

- tj)

The standard error of )_j is /1-

{(tj+l

- tj)Aj/2}

= AjV from which a confidence tj+l)/2If the noadjust

interval is calculated.

2

dj For graphing purposes,

,_ is graphed against (tj +

option is specified, the estimate of the hazard is

tj+l

fJ

- tj

and its standard error is

The confidence interval is

2

)'J 2

where X22d_,qis the qth quantile of the X 2 distribution with 2dj degrees of freedom (Cox and Oakes 1984, 53-54. 38-40). For the likelihood-ratio test for homogeneity, let do be the total number of deaths in the gth group. Define Ta = _,eg wi, where i indexes the individual failure or censoring times. The X 9 value with G - ] degrees of freedom (where G is the total number of groups) _s

(Lawless 1982, J13). The log-rank test for homogeneity

is the test presented by sts

test;

see [R] st sts.

!

! i

._ i

Itable -- Life tables for survival data

283

AcknDWledgme nts ltabl_is thank Mi(het

Refemnc Armita_e, ]

based on the lftbl Henry-Amar, Centre

command by Henry Krakauer and John Stewart (1991). We also Regional Francois Baclesse, Caen, France for his comments.

and G. Berry. 1994. Statistical Methods in Medical Research. 3d ed. Oxford: Blackwell Scientific

Pa_tcat: )ns. Chiang. C. _.. 1984, The Life Table and Its Applications. Malabar. FL: Krieger. Cox. D.R. _nd D. Oakes. 1984. Analysis of Survival Data. London: Chapman and Hall. Cutler, S JI Ghr_ic'_,!nsdFelEderer_-'958eas 8" 699' 7i2, " Maximumutilizationofthefifetablemethodinana,yzingsurvival..loumalof Greenwood. M. 19_6. The natural duration of cancer. Reports on Public Health and Medical Subjects 33: 1-26. bondon: tHis Majesty's Stationery Office.

i i

York: Joltn Wiley & Sons. Gross. A. J. and V. A. Clark. I975. Smwival Distributions: Reliability Applications in the Biomedical Sciences, New Halley. E. I,i93. An estimate of the degrees of mortaliw of mankind, drawn from curious tables of the births and funerals _t the city of Breslau. with an attempt to ascertain the price of annuities on lives. Philosophical Transactions 17: 596-510. London: The Royal Society.

!

Kahn. H. A and C. T. Sempos. 1989. Statistical Methods in Epidemiology. New York: Oxford University Press.

i

Katbfleiseh, I D. and R. L. Prentice. 1980 The Statistical Analysis Sons. 1

i

! I

of Failure Time Data. New York: John Wiley &

Krakauer, H.!and J. Stewart. 1991 ssa]: Actuarial or life,table analysis of time-to-event data. Stata Technical Bulletin 1:23-25 t Reprinted in Stata Technical Bulletin Repents vol 1 pp 200-'_02 Lawless J fl 198"_ Statistical Models and Methods for Lifetime Da_ New York"John Wiley & Sons. K. Gauvreau, 2OO0. Pagano, M. J. Principles of Biostalistics, 2d ed, Pacific Grove, CA: Brooks/Cole. Pike, M. C. 1_66 A method of analysis of a certain class of experiments in carcinogenesis. Biometrics 22: 142-161. Selvin, S. 19_6 Statistical ._nalysis of Epidemiologic Dgta. 2d ed. New York" Oxford University Press.

AlsoSee Related:

[R] cox, [R] st, [R] weibull

Bact_graun

:

]

Stata Graphics Manual

r.... _

Title I Iv--Letter-valuedisptay

s

, , _ , ,

]

i !;2

Syntax

by ... : may be used with lv; see [R] by.

Description lv shows a letter-value display _ukey 1977, 44,-49; Hoaglin t983) for each variable in varlist. If no variables are specified, letter-value displays are shown for each numeric variable in the data.

Options generate adds four new variables to the data: _-mid, containing the midsummaries; _spread, containing the spreads: _psigma, containing the pseudosigmas; and _z2, containing the squared values from a N(0, 1) corresponding to the particular letter value. If the variables _.mid, _spread, _psigma, and _z2 already exist, their contents are replaced. At most, only the first 11 observations of each variable are used; the remaining observations contain missing. If varlist specifies more than one variable, the newly created variables contain results for the last variable specified. The generate option may not be used wfth by ... :. tail(#) indicates the inverse of the tail density through which letter values are to be displayed. 2 corresponds to the median (meaning half in each tail), 4 to the fourths (roughly the 25th and 75th percentiles), 8 to the eighths, and so on. # may be specified as 4, 8. 16, 32, 64, 128, 256, 512, or 1,024 and defaults to a value of # that has corresponding depth just greater than 1. The default is taken as 1,024 if the calculation results in a number larger than 1.024. Given the intelligent default, this option is rarely specified.

Remarks Letter-value displays are a collection of observations drawn systematically from the data, focusing especially on the tails rather than the middle of the distribution. The displays are called letter-value displays because letters have been (almost arbitrarily) assigned to tail densities: Letter M F E D C

Tail Area 1/2 1/4 1/8 1/16 1/32

Letter B A Z Y X

Tail Area 1/64 1/128 1/256 t/512 1/1024

i

_

Iv -- Letter-valuedisplays

285

Example! You h_ve data on the mileage ratings of 741automobiles. To obtain a letter-value display:

i

; _

i

i i

"#

74

H F E D C B A

37.5 19 10 5.5 3 2 1.5 I

Mileage (mpg)

18 15 14 14 12 12 12

inner

fence

7.5

outer

fence

-3

!

20 21.5 21.5 22.25 24.5 23.5 25 26.5

25 28 30.5 35 35 38 41

35.5

! !

# below 0

46

pseudosigma 5.216359 5.771728 5.576303 5.831039 5.732448 6,040635 6.16562 # above 1

0

0

The d_cimal points can be made to line up _d' thus the output made more readable by specifying a display_format for the variable: see [U] 15.5 Formats: controlling how data are dispiayed.

•. i_ mpg mpg _J.9._2f f rmat # | 74

F M E D

19 37.5 I0 5.5

Mileage (mpg)

18.O0 15.00 14. O0

21.50 20.O0 21.50 22.25

25.O0 28.00 30.50

spread 7.O0 13.00 16.50

pseudosigma 5.22 5.77 5.58

C

3

14. O0

24.50

35. O0

21.00

5.83

B A

2 1.5 1

12. O0 O0 12. 12. O0

23.50 25.00 26,50

35. O0 O0 38. 41. O0

23.00 26. O0 29, O0

5.73 6.04 6.17

i i

spread 7 13 16.5 21 23 26 29

inn,_r fence out,_rfence I

At the to

7.50 -3.O0

35.50 46.00

# below

# above

0 0

1 0

the number of observations is indicated as 74. The first line shows the statistics associated

!

with M, t_ie letter value that puts half the densiiy in each tail, or the median. The median has depth 37.5 (thatlis. in the ordered data. M is 37.5 obse_'ations in from the extremes) and has value 20. The next line ;hows the staiistics associated with F or the fourths. The fourths have depth 19 (that is, in the ordere d data, the lower fourth is observation; 19 and the upper fourth is observation 74 - t 9 + 1), and the viLlues of the lower and upper fourths are 18 and 25. The number in the middle is the point ketween the fourths called halfway a midsummary. If the distribution were perfectly symmetric. the midsu nmarv would equal the median. The spread is the difference between the lower and upper summarie, , (25 - 18 = 7). For fourths, half of the data lies within a 7-mpg band. The pseudosigma is a calculation of the staf_dard deviation using only the lower and upper summaries and assuming that the variab e ts normally &stnbuted. If the data really _ " " " were normally distributed, all the pseudos_gmas

i

would be roughly equal.

i

Alter tl_eletter values, the line labeled with depth l reports the minimum and maximum values, tn this case. _he halfway point between the extremes is 26.5, which is greater than the median, indicating that 4I is more extreme than 12. at least relative to the median. Also note that, with each letter value, the midsummarids are increasing our daia arc skewed. The pseudosigmas arc also increasing,

f i i

i i

_!:'_

_!

indicating that the data are spreading out relative to a normal distribution skewness, this elongation may be an artifact of the skewness.

although, given the evident

At the end is an attempt to identify outliers, although the points so identified are merely outside some predetermined cutoff. Points outside the inner fence are called outside values or mild outliers. Points outside the outer fence are called severe outliers. The inner fence is defined as (3/2)IQR and the outer fence as 3IQR above and below the F summaries, where the tQR is the spread of the fourths. <1

[] Technical Note The form of the letter-value display has varied slightly with different authors, lv displays appear as described by Hoaglin (]983) but as modified by Emerson and Stoto (]983), where they included the midpoint of each of the spreads. This format was later adopted by Hoaglin (]985). if the distribution is symmetric, the midpoints will all be roughly equal. On the other hand, if the midpoints vary systematically, the distribution is skewed. The pseudosigmas are obtained from the lower :and upper summaries for each letter value. For each letter value, they are the standard deviation a normal distribution would have if its spread for the given letter value were to equal the observed spread. If the pseudosigmas are all roughly equal, the data are said to have neutral eIongation. If the pseudosigmas increase systematically, the data are said to be more elongated than a normal; i.e., have thicker tails. If the pseudosigmas decrease systematically, the data are said to be less elongated than a normal; i.e., have thinner tails. Interpretation of the number of mild and severe outliers discussion is drawn from Hamilton (1991): Obviously, the presence

of any such outliers

is more problematic.

The following

does not rule out that the data have been drawn

from a normal; in large datasets, there will most certainly be observations outside (3/2)IQR and 3IQR. Severe outliers, however, comprise about two per million (.0002%) of a normal population. In samples, they lie far enough out to have substantial effects on means, standard deviations, and other classical statistics. The .0002%, however, should be interpreted carefully: outliers appear more often in small samples than one might expect from population proportions due to sampling variation in estimated quartiles. Monte Carlo simulation by Hoaglin, Iglewicz, and Tukey (t986) obtained these results on the percentages and numbers of outliers in random samples from a normal population:

n t0 20 50 100 200 300

percentage any outliers severe 2.83 1.66 1.15 .95 .79 _75 .70

.362 .074 .011 .002 .001 .001 .0002

number any outliers severe .283 .332 .575 .95 1.58 2.25 _

.0362 .0148 .0055 .002 .002 .003 o:,

Thus, the presence of any severe outliers in samples of less than 300 is sufficient to reject normality. Hoaglin, Iglewicz. and Tukey (1981) suggested the approximation ,00698 + .4In for the fraction of mild outliers in a sample of size n or. equivalently, .00698n + .4 for the number of outliers. vI

Iv -- Letter-value-displays

287

Exampt i i

The _enerate option adds the variables __mid. _spread, _psigma, and _z2 [o your data. makin_ possible many of the diagnostic graphs sugges_tedby Hoaglin (1985). • _v mpg, generate

(_utputomitted) I

list _mid spread _psigma _z2 in 1/!2 i _mid _spread _psigma

il • i

16.5 13

5. 216359

.4501955

5.771728 5.576303

2.188846 1.26828

22.25 21.5

_.

23.5

23

5. 732448

4.024532

_.

25

26

6.040635

4.631499

1!.

26.5

29

2,5

2, 58 103,3.24255

6.16562

5.53073

l

:

!

7

_i [

,

I

2o

21.5

_z2

Observallons 12 through the end are missing for these new variables. The definition of the observations 1 is alway_ the same. The first observation conlains the M summary, the second the E the third the E, and _ on. Observation 11 always contains the summary for depth 1. Observations 8 through 10 corresponding to letter values Z, '_: and X contain missing because these statistics were not calculate_t. "Wehave °nly 74 observations and !their depth would be 1. : •

Hoag!in (1985) suggests graphing the midsummary against z 2. If the distribution is not skewed. the poin_ in the resulting graph will be along a horizontal line: :raph _mid _z2, border ylabel xlabel

7 i

28"

28-

c

o u:,

24-

E

o

o 22-

20-

c

o

r

Z squared

Theegrapi3 clearly indicates the skewneas of the distribution. One might also graph _psigma _72 to e!amine elongation.

aeainst -I

i l

Saved 288 Results Iv- Letter-value displays iv saves in r()" Scalars r (_1) r(rain) r (max) r (median) r(l_F) r(a...F) r(l_.E) r(u__g) r(l_r)) r(u_D) r(l_C) The lower/upper

nmnber of observations minimum maximum median lower 4th upper 4th lower 8th upper 8th lower 16th upper 16th lower 32rid

8ths, 16ths .....

r(u_C) r(l_B) r (u_B) r (1..A) r(u../) r(l_Z)

r(u..Z) r(t_¥) r(u_Y) r(t_.X) r(u...X)

upper 32nd lower 64th upper 64th lower 128th upper 128th lower 256th upper 256th lower 512th upper 512th lower 1024th upper 1024th

1024ths will be defined only if there are sufficient data.

Methodsand Formulas iv is implemented

as an ado-file.

Let N be the number of (nonmissing)

observations

on x, and let z(i ) refer to the ordered data

when i is an integer. Define x(i+.5) = (x(i) + x(i+l))/2;

the median is defined as x((N+_)/2 ).

Define X[d] as the pair of numbers x(d ) and x(N+l-d) where d is called the depth. Thus, x[1 ] refers to the minimum and maximum of the data. Define m = (N q- 1)/2 as the depth of the median, f - (LmJ + 1)/2 as the depth of the fourths, e -- ([fj + 1)/2 as the depth of the eighths, and so on. Depths are reported on the far left of the letter-value display. The corresponding fourths of the data are x[fj, the eighths x[e], and so on. These values are reported inside the display. The middle value is defined as the corresponding midpoint of x[.]. The spreads are defined as the difference in

x[.]. The corresponding 456-457)

point zi on a standard

zi =

normal distribution

F-l{(di, 1/3)/(N + 1/3)} F -1 {0.695/(_N + 0.390) }

where d, is the depth of the letter value. The corresponding the spread to -2zi (Hoaglin 1985, 431).

is obtained

as (Hoaglin

1985,

if di > 1 otherwise

pseudosigma

is obtained as the ratio of

Define (Fl, F_,) = x[f]. The inner fence has cutoffs Ft - _ (F,_ - Ft ) and F_ + 23(F_, - Ft). The outer fence has cutoffs/_ - 3(Fu - Fl) and Fu + 3IFu - Ft). The inner fence values reported by lv are almost exactly equal to those used by graph, box to identify outside points. The only difference is that graph uses a slightly different definition of fourths: namely, the 25th and 75th percentiles as defined by summarize.

References Emerson. J. D. and M. A Stoto. 1983. Transforming data. In UnderstandingRobust mid Exploratory Data Analysis. ed. D. C. Hoaglin. E Mosteller, and J. W. Tukey, 97-128. New York: John Wiley & Sons

,il

i' Iv-

Letter-value displays

289

:

Fox. J. 19901 Describing umvafiate distributions. In Modem Methods of Data Analysis_ ed J. Fox and J S_Long,

I

58-t25.L.IC _ewbury Sage Publications. Hamilton, 1991.Park, sed4:CA: Resistant normality check and outlier identification. Stata Technical Bulletin 3: 15-18.

i

Repfintedl1 in Stata Technical Bulletin Reprints, rot. 1,i pp. 86-90. Hoagiin, D. ¢. 1983, Letter values: a set of selected or_r statistics. In Unde_tanding Robust and ExploratoO, Data

i

D. quantiles C. Hoaglin, Mosteller, J. V¢.Yukey, 33-57. New York: John Wiley &ed. Sons. _. Analysis. 1985. led. IUsing to F. study shape, and tn Exploring Data Tables, Trends, and Shapes, D_ C. Hoaglin,

i

E Mosteller, and J. W. Tukey, 417-4601 New York: John Wiley & Sons. !

J

Hoa21in.D.C.B. Iglewicz, hnd J. _,r Tuke,,. 1981. Smali-sample performance of a resistant rule for outlier detection. _¢_80Prc_eedines of the Statistical Computing Section,!144-152. Washinoton, DC: American Statistical Association. I

---'8I:1986'99I-_)9_beff°rmance of some resistant rules for outlier IaUelinz._Journal of the American Statistical AssociatiOn

[

Tuke,,', J. W.t°"I19,7.Explorat&v Dat, Anah'sis. Reading, MA: Addison-Wesley Publishing Company.

i

i i

I

i

i

I

i _,

Also See Related:

[R] diagplotsi

[R] stem, [R] summarize

[ i

]

!ii ii_

F

r

mILm_

matsize -- Set the maximum number of variables in a model

_..

I

I

I I

f j

I

Syntax set mgtsize

#

where 10 _< # _< 800

Description set matsize sets the maximum number of variables that can be included in any of Stata's modelestimation commands. The command may not be used with Small State: matsize is permanently frozen at 40. For Intercooled Stata, the initial value is 40, but it may be changed upward or downward. The upper limit is 800.

Remarks set matsize controls the internal size of matrices that Stata uses. The default of 40, for instance, means that linear regression model s are limited to 38 independent variables--38 because the constant uses one position and the dependent variable another, making a total of 40.

Under Stata for Macintosh, there must be no data in memory when you change matsize and increasing matsize decreases the amount of memory available for data. Under Stata for Windows and Stata for Unix. you may change matsize with data in memory-, but increasing matsize increases the amount of memory consumed by Stata, increasing the probability of page faults and thus of making Stata run more slowly.

Example You wish to estimate a model of y on the variables xl through xlO0, Without thinking, you type • regress

y xl-xlO0

matsize too r (908) ;

small;

type

-help

matsize-

You realize that you need to increase matsize; you are using Intercooled set matsize 150 no; data r(4);

in memory

would

be

If you are using Statafor Macintosh,

lost

you must

save mydata file

mydata,dta

• drop • set

saved

_all matsize

150

• use mydata • regress

y xl-xlO0

(output omitted ) 290

Stata and type

i, i J

i

matsize -- Set the r_ximum

Under $tat_ for Windows

i

I t

i

-help matsize-

r(90_) ; _

. sel_ matsize . re_ress

l_

291

or Stata for Unix, yo_ do not have to go to that trouble:

• recess y xl-xlO0 mats_ze too small; type

i

number of variables in a model

!

150

y xl-xlDO

(ou_ut omitted )

Related:

[R] memory

Backgrom_d:

[u] 7 Setting the size of merry

_1

!i

i

i

i

r! .,,;eoax,mze Oet ,sofi, .at, ema.i I !

Syntax mle_cn_/ ...

[,

tolerance(#) where init_specs matname

[,

[no]l_og trace 1tolerance(#)

gradient hessian showstep gtolerallce(#)

diEficutt

iterate(g) from(init_.specs)

]

is one of skip

copy]

{ [eqname:] name=#1

/eqname=#

} ['"]

# [#-1, opy Description Stata has two maximum likelihood optimizers: one is used by internally coded commands and the other is the ml command used by estimators implemented as ado-files. Both optimizers use the Newton-Raphson method with step halving (to avoid downhill steps) and special fixups when nonconcave regions of the likelihood are encountered. The two optimizers are similar, but differ in the details of their implementation. For information on programming maximum likelihood estimators in ado-files, see [RJ ml and Maximum Likelihood Estimation with Stata (Gould and Sribney 1999).

Options log

and nolog specify whether an iteration log showing the progress of the log likelihood is to be displayed. For most commands, the log is displayed by default and nolog suppresses it. For a few commands (such as the svy maximum likelihood estimators), it is the opposite; you must specify log to see the log.

trace

adds to the iteration log a display of the current parameter

gradient (ml-programmed vector. hessian (m!-programmed Hessian matrix.

vector.

estimators only) adds to the iteration log a display of the current gradient estimators only) adds to the iteration tog a display of the current negative

shorastep (ml-programmed estimators only) adds to the iteration log a report on the steps within an iteration. This option was added so that developers at Stata could view the stepping when they were improving the ml optimizer code. At this point, it mainly provides entertainment. iterate (#) specifies the maximum number of iterations. When the number of iterations equals iterate (), the optimizer stops and presents the current results. If convergence is declared before this threshold is reached, it will stop when convergence is declared. Specifying iterate (0) is useful for viewing results evaluated at the initial value of the coefficient vector, iterate(0) and from() specified together will allow one to view results evaluated at a specified coefficient vector; note, however, that only a few commands allow the from() option, iterate (16000) is the default for both estimators programmed internally and estimators programmed with ml. 292

i ! ! ! !

! ! _

! :_

|_

maximize-- Detailsof iterativemaximization

!

293

totera_nc_ (#) specifies the tolerance for the Coefficient vector. When the relative change in the coefficient vector from one iteration to the next is less than or equal to tolerance(), estimates are decl:tred to have converged. If this criterion is satisfied, convergence is declared regardless of the stat_s of the likelihood tolerance ltolerance (). I to!er ce(le-4) is the default for estimators programmed internally in Stata. toleramce(le-6)is the default for estimators programmed with ml. Itolerance(#)specifies the tolerance for the log likelihood. When the relative change in the log likeliho(d from one iteration to the next is less than or equal to ltolerance(), estimates are declared to have con,_erged. If this criterion is satisfied, convergence is declared regardless of the status ol the coefficient vector tolerance tolerance (). Itoler_mce(0)

is the default for estimators programmed internally in Stata.

Itoler_mce(le-7) is the default for estimators programmed with ml. gt oleran(e(#) (ml-prc/grammed estimators only) specifies an optional tolerance for the gradient relative to 1hecoefficientg. When [9ibil < gtolerance() for all parameters bi and the corresponding element., of the gradidnt gi, then the gradient tolerance criterion is met. Unlike tolerance()and

!

Itoler_ race (). isthe ffl:oler_ce criterion mustisbemet metand in addition to an3' or other tolerance. That is, conv_:rgence declared when () gtolerance() tolerance() ltolermaee() is

i

met. Th_ gtolerance() option is provided for particularly deceptive likelihood functions that may trigger premature declarations of conve!gence. The option must be specified for gradient checkin_ to be activa! ed: by default the gradient is not checked. , . . .t _ . • i ...... difficult I (ml-programmed estimators only) st_eclfies that the hkehhood functmn Is hkety to be difficult to maximize due to nonconcave regions. When the message "'not concave" appears repeated y, ml's standhrd stepping algorithm may not be working well. difficult specifies that a different stepping algorithm is to be used in nonconcave regions. There is no guarantee that diffic_Llt will work better than the default: sometimes it is better, sometimes it is worse. The difficlLlt option shbuld only be attempted when the default stepper declares convergence and the last teration is "'not concave", or when the_default stepper is repeatedly issuing "not concave" message and only producing tiny improvemetltS in the log likelihood.

__ ! ! ! i

i ! i _

fromC) spdcifies initial _}atuesfor the coefficients. Note that only a few estimators in Stata currently support his option. "l)he initial values can be specified in one of three ways: by. specifying the name of a vector corjtaining the initial value_ (e.g., from(b0) where b0 is a properly labeled vector); _y specifying coefficient names with _,hevalues (e.g., from(age--2. I /sigma=7.4)); or by specifying a list o_fvalues (e.g., from(2.1 7.4, copy)), from() is intended for use when doing b(otstraps (see_[R] bstrap) and in other special situations (e.g., used with iterate(0)]. Even wren the valueg specified in from() are close to the values that maximize the likelihood. onl,,, a fetw of iteratiofis may be saved. Poor values in from() may lead to convergence problems.

!

skip species that any parameters found in the specified initialization vector tha! are nor also found in the mbdel are to be ignored. The default action is to issue an error message.

l: !

copy specil_es that the li._t of numbers or the initit_lization vector is to be copied into the initial-value vector b_ position ratiaer than by name.

_

! ! 1

Remarks Only in -are circumstances would a user ever need to specify any of the these options, with the exception oI nolog, nol_g is useful for reducing the amount of output appearing in lot files. The foil Ning is an e_ample of an iteration log:

), • _i_'

Iteration O: Iteration I:

log likelihood = -3791.0251 log likelihood -- -3761.738

Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Iteration 5: log likelihood Iteration 6: log likelihood Iteration 7: log likelihood Iteration 8: log likelihood (table of results omitted)

= -3758.0632 = -3758.0447 = -3757.5861 = "3757.474 ---3757.4613 = -3757.4606 = -3757.4606

(not concave)

At iteration 8, the model converged. The only notable thing about this iteration log is the message "not concave" at the second iteration. This example was produced using the heckman command; its likelihood is not globaIty concave, so it is not surprising that this message sometimes appears. The other message that is occasionally seen is "'backed up". Neither of these messages should be of any concern unless they appear at the final iteration. If a "not concave" message appears at the last step, there are two possibilities. One is that it is a "valid" result, but there is collinearity in the model that the command did not catch. Stata checks for obvious collinearity among the independent variables prior to performing the maximization, but strange collinearities or near collinearities can sometimes arise between coefficients and ancillary parameters. The second cause for a "not concave" message at the final step is that the optimizer entered a very flat region of the likelihood and prematurely declared convergence. If a "backed up" message appears at the last step, there are also two possibilities. One is that it found a perfect maximum and it could not step to a better point; if this is the case, all is fine, but this is a highly unlikely occurrence. The second is that the optimizer worked itself into a bad concave spot where the computed gradient and Hessian gave a bad direction for stepping. If either of these messages appear at the last step, do maximization again with the gradient option. If the gradient goes to zero, the optimizer has found a maximum that may noI be unique but is a maximum. From the standpoint of maximum likelihood estimation, it is a valid result. If the gradient is not zero, it is not a valid result, and you shoutd try the following: Try tightening up the convergence criterion. Try lgol (0) to1 (le-7) or gtol (0.1) (with the default ltol () tol ()) and see if the optimizer can work its way out of the bad region. If you get repeated "not concave" steps with little progress being made at each step, try specifying the dif:ficult option. Sometimes di:f:Eicult works wonderfully, reducing the number of iterations and producing convergence at a good (i.e., concave) point. Other times, difficult works poorly, taking much longer to converge than the default stepper.

(Continued

on next page)

?

maximize -- Details of iterafive maximization

i i

295

SavedRe,,;ults Maximt m likelihood estimators

save in e():

Scalars

!

i l I

-e(ri)

number of observations

always saved

e (df..z)

model degrees of freedom

always saved

e (r24) e (il)

pseudo':R-squared log likelihood

sometimes saved always saved

e(ll_()

log likelihood, constant-only model

usually saved

e(N_clust) ! e (chi_) e (±c) e(raxLk)

number of clusters

saved when cluster specified; see [U] 23.11 Obtaining robust variance estimates usually saved

e(rank3)

rank of e(V) for constant-only model

saved when constant-only model is estimated

x2 number of iterations rank of: e(V)

always saved

Macms i

e (cmd) e (depv 4)

name of command name(s) of dependent variable(s)

always saved usually saved

i

e(wtyp._) e(_exp

weight !type weight expression

saved when weights are specified or implied saved when weights are specified or implied

e(clus

name Of cluster variable

saved when cluster specified: see [U] 23.11 Obtaining robust variance estimates

l

I ,_ i

i I l

l i

1 ;_

e(vcet_pe) covariahce estimation method

saved when robustis specified or implied; see

e (user e (opt)

name Of likelihood-evaluator program type of- optimization

[U] 23.11 Obtaining robust variance estimates always saved always saved

e(chi2;ype) e(predtct) e(cnsltst)

Wald or LR: type of model _-_ test program used to implement predict constraint numbers

usually saved usually saved saved when there are constraints

coefficient vector iteration log (up to 20 iterations) variance-covariance matrix of the estirntators

always saved always saved always saved

Matrices e(b) e(ilog e(V)

I l

vat) i

i Funclions e (samp te)

marks esnmation sample

See the Saved Results section complete li;t of returned results.

in the manual entry for any maximum

likelihood

estimator

ior a

Methodsand FormUlas Let Id e the log likelihood of the tull model (i.e.. the log-likelihood value shown '3on the outpul[ and let Lo __ethe log likelihood of the "'constantaonly'" model. The likelihood-ratio X+ model te_t ts

i

defined as (L1 -- L0), The pseudo R 2 (Judge et al. 1985) is defined as t L1;Lo+ This is simply the log lib lihood on a gcale where 0 corresponds to the "constant+only'" model and I corresponds

!

to perfect

t,rediction

likelihood

i O.

tbr_discrete

models

i.e.. predicted

probabilities

are all I and the overall

log

_)

estimates given by the inverse of the negative Hessian (second derivatives) matrix. If robust, cluster(), p_reights are specified, standard display errors are based errors on the robust By default,or Stata's maximum likelihood then estimators standard based on variance variance estimator (see [U] 23,11 Obtaining robust variance estimates); in this case, likelihood-ratio tests are not appropriate (see [U] 30 Overview of survey estimation) and the model X 2 is a Wald test. Some maximum likelihood routines have the ability to report coefficients in an exponentiated form; e.g., odds ratios in logistic. Let b be the unexponentiated coefficient, s its standard error, and b0 and bl the reported confidence interval for b. In exponentiated form, the point estimate is e b, the standard error ebs, and the confidence interval e bo and e bl . The displayed Z statistics and p-values are the same as those for the unexponentiated results. This is justified since e b = 1 and b = 0 are equivalent hypotheses, and normality is more likely to hold in the b metric.

References Gould, W. and W Sribney. 1999. Maximum Likelihood Estimation with Stata. College Station, TX: Stata Press. Judge, G. G., V¢ E. Griffiths, R. C. Hill, H. Liitkepohl,and T.-C. Lee. 1985. The Theory and Practiceof Econometrics. 2d ed. New York:John Witey & Sons

Also See Complementary:

[R] irtest, [R] ml

Background:

[u] 23 Estimation

and post-estimation

commands

i¸¸¸ i

i

T,,,e

i_

Arithmetic, geometric, and harmonic means

Syntax I !

,

l

meansi [varlist] [weight] [if exp] [in range] [, add(#) o__nlylevel(#)

]

by ,.. : n}ay be used witla means; see [R] by. aweights _nd fweights ai'e allowed; see [U] 14.1.6 _eight

i

!

iI

i

DescriptiOn, meansicomputes the arithmetic, geometric, and harmonic means, and corresponding confidence intervals, for each variable in varlist or for all ihe variables in the data if vaHist is not specified. If you simpl_'want arithmetic means and corresponding confidence intervals, see [R] ci.

! i

)

i

iI

Options

i i

! : !

_

add(#) a_ds the value # to each variable in i,arlist before computing the means and confidence intervals. This is useful when analyzing variables with nonpositive _atues. only modifies the action of the add(#) option. If specified, the add(#) option only adds # to vari.abll_swith at least one nonpositive value! level (#) specifiesthe Confidencelevel, in percem,for confidenceintervals. The default is level (95) or as s'.t by set level; see [U] 23.5 Specit_ing the Mdth of confidence intervals.

Remarks i

Example ) i_

You ha!_ea dataset containing 8 observation_ on a variable named x. The eight values are 5, 4, -4, -5, I_,0, missing, and 7.

_,

iable

Type

i

Obs

Arithmetic Geometric Harmonic

7 3 3

Mean

[957. Conf.

5. _92494 i 5.060241

-3.204405 2.57899 3.023008

Interval]

5.204405 i0. 45448 15. 5179

i

i

me,,,odd(5 _ariable

i

!

i

i i

t

( (,)

_

Miss_ng

Type x !

)

ConsUlt

A_ithmetic

7

Geometric iHarmonic

6 6

Mean

was

added

values

tt,eferenc_

_o

the

variable(s)

ifi confidence _nterval Manual

is for

1.795595

5.477226 3. _40984 prio_

interval(sl undefined

[95Y. Conf. 6

i

thatlconfidence _

Obs

Interval] 10.2044"

2.1096

14.22071

............... to for

f6r

details.

::

calculating harmonic

corresponding

the mean

results. indicate

variable(s).

* *

]

! !!_; il

298 numbermeans -- Arithmetic, geometric, and harmonic The of observations displayed for the arithmetic mean means is the number of nonmissing observations. The number of observations displayed for the geometric and harmonic means is the number of nonmissing, positive observations. Specifying the add(5) option results in 3 additional positive observations. Note that the confidence interval for the harmonic mean is not reported; see Methods and Formulas bel ow.
Saved Results means saves in r () : Scalars r(N) r(N_pos) r (mean) r(ib) r(ub) r(Var) r(mean_g) r(lb_g)

r(ub_g) r(Var_g) r (raean_) r(Ib_h) r(ub_.h) r(Var_h)

number of nonmissingobservations; used for arithmeticmean number of nonmissingpositive observations: used for geometric & harmonic means arithmetic mean lower bound of confidence interval for arithmetic mean upper bound of confidence interval for arithmetic mean variance of untransformeddata geometric mean lower bound of confidence interval for geometric mean upper bound of confidence interval for geometric mean variance of lnz_ harmonic mean lower bound of confidence interval for harmonic mean upper bound of confidence interval for harmonic mean variance of 1/z_

Methodsand Formulas means is implemented

as an ado-file.

See, for example, Armitage and Berry (1994) or Snedecor and Coct_ran (1989). For a history of the concept of the mean, see Plackett (1958). When restricted to the same set of values (i.e., to positive values), the arithmetic mean (7) is greater than or equal to the geometric mean which in turn is greater than or equal to the harmonic mean. Exact equality holds only if all values within a sample are equal to a positive constant. The arithmetic mean and its confidence interval are identical to those provided by ci; see [R] ci. To compute the geometric mean, means first creates uj = In x_ for all positive zj. The arithmetic mean of the uj and its confidence interwd are then computed as in ci. Let g be the resulting mean, and let [L, U] be the corresponding confidence interval. The geometric mean is then exp(g) and its confidence interval is [exp(L), exp(U) ]_ The same procedure is followed for the harmonic mean, except in this case uj = 1/xj. The harmonic mean is then I/g and its confidence interval is [l/U, I/L] if L is greater than zero. If L is not greater than zero. then this confidence interval is not defined and missing values are reported. When weights are specified, means applies the weights to the transformed values, uj -- In zj and uj - 1/a:7 respectively, when computing the geometric and harmonic means. For details on how the weights are used to compute the mean and variance of the "uj, see [R] summarize. Without weights, the formula for the geometric mean reduces to

means -- Arithmetic, geometric, and harmonic means

exp n i

Without _ eights, the formula

for the harmonic

299

i

J mean is

,, ,i ....

II

J "

i

AcknoWl_dgments This in, proved version of means Is"based on the _ci command (Carlin, Vidmar, and Ramalheira ]998) andtwas written by John Carlin. University of Melbourne, Australia; Suzanna Vidmar, University of Metbodrne, Australia: and Carlos Ramalheira, Coimbra University Hospital, Portugal:

I

I

I

i_'

Referen_ Arrearage. _.

Publicat_ns. Cartin, J., _. Vidmar, and C. Ramalheira. 1998. sg75: Geometric means and confidence intervals. Stata Technical Bulletin t41: 23-25. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 197-199.

i

1

I

,

and G Berry. 1994. Statistical Methods i in Medical Research. 3d ed. Oxford: Blackwell Scientific /

i

l_:

_ i

i '

:

Ko_.z.S. Sons. an_ N. L. Johnsor_. ed. 1985. Encyclopedia of StatisticaI Sciences. vol. 1. vol. 3, New York: John Wiley & Plackett....R L I958. The principle of the arithmetic m_an, Biome_ka 45: 130-I35. Snedec_r. ,_. W. and W. G Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa S_ate University Pres_.

AlsoSee Related:

[R] ci, [R] summarize

i i

Title

I_'

memory -- Memory size considerations

Syntax set memory #[klm] memory set virtual { on Ioff } where # is specified in terms of kilobytes or megabytes.

Description set memory, memory, and set

virtual

are relevant only if you are using Intercooled

Stata.

set memory allows you to increase or decrease the amount of memory allocated to Stata while Stata is running. Increases are obtained from the operating system; decreases are returned to the operating system, set memory can be specified only if you are using Stata for Windows or Stata for Unix. Stata for Macintosh users must instead set the amount of memory [GSM] A.5 Specifying the amount of memory allocated.

when they invoke Stata; see

memory displays a report on Stata's memory usage, memory is available on all Intercooled regardless of platform.

Statas

set virtual controls whether Stata should perform extra work to arrange its memory to keep objects close together. By default, virtuaI is set off. set virtual is available on all Intercooled Statas regardless of platform.

Remarks Remarks are presented under the headings Resetting the amount of memory Obtaining the memory report and how Stata uses memory Using virtual memory

If you use Stata for Macintosh,

skip the first heading.

Resettingthe amountof memory If you use Stata for Windows or Stata for Unix, you can change the amount of memory Stata has allocated while Stata is running:

J

300

memory- Memorysize considerations

301

i

, s_It memory 4m no; _data in memory

would

be lost

;

r(41

You can lhange the amount of memory, but only when there are no data in memory: . d_op

_all

. s _t memory (40 _6k)

You can

_crease it

. s#t memory (32 t68k)

° ! i

4m

32m

or decrea ;e it. 1 . s-_t memory (10 _4k)

Im

If you as] for more than your operating system can provide, you will be told so: l

. S_t

i

op._sys,

I

i r(969);

i :

I

memory

128m

refuses

to provide

memory

The nurn_r you type can be specified in megabytes or kilobytes. When you suffix numbers with m it means _negabytes. When you suffix numbers with k (or nothing) it means kilobytes. • s_t

memory

4000k

(40_0k) s_t memory

lOOO

(it OOk)

21Technica Note } i

(This lote is relevant only if you use Stata for Unix.) There is a detail in the operating system's handling of returned memory that we have gldssed over. You probably think that the checking out and returJling of memciry from the operating system is handled like the checking out and retumin_ of

t

a book ai a library. With some operating systems, it is handled that way. but others var__.Operating systems andle returned memory in one of thr e ways: 1. The iI|stant memory is returned, it is marked as returned and is available for other programs {o

i i i i

checkl°ut" 2. When memory is returned it is put in a special bin and, five or ten minutes from now. it will be m_ ked as returned for other programs to check out. In the meantime, you could check it ou_ again t f you want, but no other program can.

i

3, When memory is returned it is put in the special bin and never moved from there. You can have the m:mory back, but no other program can ever have that memory. Wind(,ws follows _olicy t. The various fla_,'orsof Unix differ on which policy they follow and this has mplications.

i

Let's magine you are pushing your Unix computer to its limits and have allocated lots of memory to Stata. You suddenlyI want to jump ou_ of Stare and do something in Unix. so you use Stata's shell comman_ to obtain a hew shell:

i i

opl sys. refuse_

rqo2) ;

to start

_ew process

"

t

r! _/_*

--,,,--,,,--, if ! there ,v,,,;,,,vn oI-u memory. uul i:_![lerll[iorls This can happen is noIf free This reminds you that Stata has all the memory but you no longer need it, so you retum most of it: • set memory (4096k)

4m

Now you try the shell

command again. What will happen?

1. If your system follows policy l, shell 2. If your system follows policy 2, shell 3. If your system follows policy 3, shell

will work• will not work, but will work five or ten minutes from now. will not work.

The result hinges on whether the operating system really takes back, and when, the memory Stata returns. If your operating system follows policy 3, you must exit and restart Stata. If your operating system follows policy 2 and you are in a hinD, , you can also exit and restart. O

Obtaining the memory report and how Stata uses memory Type memory and Stata will give you a memory report. Below, we just started Stata: • memory 1,023,992

Tot al memory overhead

(pointers)

data

data

+ overhead

programs,

saved

results,

etc.

Total

bytes

O• 00_,

0

O.00_,

0

O. 00_.

1,392

O. 14_,

O. 14Z

I,392

Free

i00. OOZ

0

1,022,600

99.86Y,

If you perform this experiment on your computer, you will probably see different numbers. our memory report after we load the automobile dataset that comes with Stata: . use

auto

(1978

Automobile

Here is

Data)

• memory Total

1,023,992

memory

overhead

(pointers)

data

data

+ overhead

programs,

Total Free

saved

results,

bytes 296

etc.

I00.00_, O. 03_

3,182

O. 317'

3,478

0.34_,

2,368

0.23_

5,846

O. 577,

I, 018,146

99.43Y,

Total memory refers to the total amount of memory State has allocated to its data areas--the number that can be specified at start-up time or reset by set memory. Well. almost. If you use Stata for Macintosh. total memory refers to a number somewhat smaller than that because State has to carve an area out of the total for another purpose. Stata for Macintosh users: just accept that the number is smaller than the number you specified and know that the larger the number you specify at start-up time. the larger the total memo W will be; see the technical note below.

F

=

memory -- Memory size considerations

303

!

Overh_d. data. and -data -4.overhead refer to the amount of memory necessary, to hold the dataset currently in memory. Start with the middle number.

! i

3,182 l_vtes is the total amount of memory necessary to hold the automobile dataset and you could work this lout for yourself from a describe,detail. The automobile dataset has 74 observations and each _bservation requires 43 bytes (called the width), and 74 × 43 = 3,182. 296 bytes is the pointer overhead associated with this dataset. Stata needs something called a pointer to keep track of where each observation is stored in memory, On this computer pointers are 4 bytes but that vanes and the dataset has 74 observations, so 4 x 74 = 296.

!

Data _t.overhead is just the sum of the two rmmbers: 296 + 3,182 = 3.478 is the total amount of i ,

! •

memory ata needs to store and keep track of these data. Programs, saved results, etc., is the total amount of memory Stata has used to store just what it

} _ !_ ! ,:

says: State's programs (ado-files); macros, matrlces, value labels, and all sorts of other things. This is sometin}es referred to as Stata s dynamic meinory. The report shows 2,368 bytes this instant but this numb}r changes fraquently, Here is! a memory report from another session in which we have loaded a dataset with 69,515 observations on _93variables and are in the midst of analyzing it using xtgee:

i

t

. memory

i

ov#rhead (poineers) To];al memory

278,060 6,291,446

1

dat;a

2,363,510

37.57_,

}

d@a

2,641,570

41.99_.

i

prggrams,

i

ToSal

2,689,362

42.75_

!

Fr_e

3,602,086

57.25'/,

+ overhead saved

results,

etc.

47,792

bytes

4.42% 100.00%

0.76'/,

Technical qote Stata fo- Macintosh. The total amount of memory shown by memory is less than the amount you tell your t, lacintosh to _llocate because we need to use some of that memory for other purposes. How much we need is given by 88matsize 2 + 8matsize + k. where k is a constant for you (but varies sligl_tlv across Mhcintoshes). Thus. you will see total memory rise and fall according to the I } i i

value to w tich you set matsize. Let's colmpare matsize = 40 with 80. For matsize = 80, we need 88 • 802 + 8.80

-r k =

563,840 +_'. For matsize = 40, we need 88- 402 + 8.40 + k - 141,120 + k. The difference is then 422,7_0. Conclusion: if matsize was 40 _nd you set matsize 80, memory will report that total memo_: declines by 422,720 bytes. Since it',is in "total memory"that Stata stores 5.'ourdataset. reducing theevalue of matsize is one way to reallocate your memory.

j

! I? I

Using virtual memory Virtual ilemor? refers to using more memory than is physically present on },our computer, This is a feature]provided by/he operating system, not Stata. and is one that you as a Stata user may find yourself so_letimes usin_

++i,+!

V'trtual memory is slow. You will be unhappy if you need to use virtual memory on a daily basis. On the other hand, virtual memory can get you out of a bind and that is the right way to use it with Stata. ......

+'

.........

7

_,.v

IWLVVlIDIUCPIi:ILIUII,_

You do NOT need to set virtual on for Stata to use virtual memory. All set virtual on does is maybe make Stata run a little faster when the operating system is paging a lot. set virtual on will not make Stata run fast, just faster• Virtual memory is most efficient (which is not to say efficient) when the program being executed exhibits something called locality of reference. This is the idea that if the program accesses one location in memory, subsequent memory references will be to a location near that. If you set virtual on Stata's memory-management routines will go to extra work to arrange things so that the idea is true more often. Hence, Stata will run a little faster. If Stata is not using virtual memory, setting virtual on will make Stata run a little slower because Stata will be going to extra work for no good reason. You set virtual • set

on by typing the command

virtual

on

You can check whether virtual is on or off using query: query Status type virtual more rmsg matsize

linesize

79

off off

pagesize trace

23 off

off

:textsize

40

adosize graphics

level

128 off

log

logtype

i00 95 smcl

(closed)

cmdlog

virtual

float

Files

(closed)

is reported on the second line of the left column. To set virtual • set virtual

off

Saved Results memory saves in r () : Scalars number

r(width)

width

r(N_cur) r(N_curmax) r(w_cur)

maximum observations(current partition) max. max. observations (current partition) maximum variables(current partition) maximum width (current partition)

r(M_total)

total

memory

allocated

r(M_data)

total

memory

available

r(M_dyn)

total

programs,

r(size_ptr)

size

r(k_eur)

of

variables

r(k)

of

of

r (matsize)

matsize

r (adosize)

adosize

dataset

memory

saved pointer

to Stata to data results, (bytes)

(bytes) (bytes)

etc.

(bytes)

off,

type

memory -- Memory size considerations

305

Note tt_at there are four saved results that refer to the current partition. At any instant Stata has partitioned the memory _nto observations and "_afiables. The characteristics of the partition can change at any tir_e including tight in the middle of a command, so the first four numbers are really of little interest in that they do not reflect any real constraint. What they do reflect is efficiency. If something should occur that violates any of those limits, Stata will have to silently work to reform the partiti4m, something it is able to do reasonably efficiently and without any disk accesses. Also note that Ehe description of r(l,l_curmax) is not a typographical error. It records the maximum number o_ observations in the current partition if the size of total programs, saved results, etc. (what is recorde, i in r(M_dyn.)) were zero. When Stata is faced with a request that violates the current partition's limits, it considers the possibility of discarding memory copies of ado-files that have not been usedtrecentty. AdO-files are loaded automatically on an as-needed basis, so how tong they are kept in m_mory is only an efficiency issue. Stata considers reducing the memory requirement as an altemativelto| repartition_ng. The oulput produced; by memory can be calculated from the saved results by total memory = r (M_data) overhead (pointers) = _

× r(size_ptr)

data = ...N × r(width) programs, saved results, etc. = r (M_dyn)

References Sasie_i.P. 1_7, ip20:Checl_ingfor sufficientmemoryto add vm'iables.Stata TechnicalBulletin40: t3. Reprintedin Smm TechnicalBulletinReprints.vol. 7. p. 86

Also See l Complemehtarv:

[R] que_

Related:

[R] rnatsize

Baci_groun El:

[U] 7 Setting the size of memory

I

merge m Merge datasets

f '

Syntax

=erge [,'a,'li.tJ using le.ame I, noZabe update replace nokeep -merge(var.ame J Iffilename is specified without an extension, .dta is assumed.

Description merge joins corresponding observations from the dataset currently in memory (called the master dataset) with those from the Stata-format dataset stored as filename (called the using dataset) into single observations.

Options nolabel prevents Stata from copying the value label definitions from the disk dataset into the dataset in memory. Even if you do not specify this option, in no event do label definitions from the disk dataset replace label definitions already in memory. update varies the action merge takes when an observation is matched. By default, the master dataset is held inviolate_values from the master dataset are retained when variables are found in both datasets. If update is specified, however, the values from the using dataset are retained in cases where the master dataset contains missing. replace, allowed with update only, specifies that even in the case when the master dataset contains nonmissing values, they are to be replaced with corresponding values from the using dataset when the corresponding values are not equal. A nonmissing value, however, will never be replaced with a missing value. nokeep causes merge to ignore observations in the using dataset that have no corresponding observation in the master. The default is to add these observations to the merged result and mark such observations with _merge = 2. --merge(varname)

specifies the name of the variable that is to be created that will mark the source

of the resulting observation. The default is _merge (__merge); that is, if you do not specify this option, the new variable will be named _merge. See The two kinds of merges below for details.

Remarks Remarks are presented under the headings The two kinds of merges _e-to-one merge Match merge Updating data 306

_ : :

i i i i ! "

' _ i

_' ' , : -

,

_

merge-- Mergedatasets

307

Disting :ish carefully between merging and appending datasets, and the corresponding Stata commands mez_e and append. Appending refersto the addition of new observations on existing variables. If one thinl:sof a dataset as a rectangle with observations going down and variables going across, appending in_:reasesthe dal_aset'slength. Merging adds new variables to existing observations, increasing the dataset s width. See :[u] 25 Commands for Combiningdata if this is not clear. Say voi have a dataset in which each observation records the characteristics of a particular automobile_such as the car's price, weight, etc. If you have two such datasets, one for domestic and another for imported cars. and you wish to combine them into a single dataset, you are reading the wrong entri,: see [R] append. On the )ther hand. if you have two datasets, one recording price and the other weight, mileage, etc., and ydu wish to combine them into a single set, continue reading; merge does this. In additii)nto merge, another command, j oinby, forms all pairwise combinations of observations within grou _. Say you have one dataset on mothers and fathers and another on their children. If you wish to con _binethem so that each parent is matched with every one of their children (each child is matched wi h both parents), so that a 2-parent. 3-chitd family results in 2 × 3 = 6 observations, see [R]joinby. I

The two kin s of merges i

mez'ge j4ns the observations stored in memory with the observations stored in filename. The disk dataset mus_be a Stata-fOrmatdataset; that is. it must have •beencreated with the savecommand.

_

Stata performs two kinds of merges. If no _arhst Is specified, Stata performs a one-to-one merge. In a one-to-0ne merge, the first observation of one dataset is joined with the first observation of the other ttataseI, the second :observationis joined with the second, and so on. If a varlist is specified. however, St_ta uses those variables to perform a match merge. In a match merge, observations are joined onh, if the values of the variables in the specified varlist match.

i

RegardleSs of the style-of merge being performed, merge always adds a new variable called (by default) _.merge to the dataset. This variable takes on the values 1.2. or 3 to mark the source of the resulting obs_ervation.The-coding is

i i_ i

1.

T e observation occurred only in the master dataset.

2. 3.

T_e observatiofi occurred only in the using dataset. T_e observatio_ is the result of joining an observation from the master dataset with o e from the ui;ingdataset.

When you u_ the update option, this coding is ektended to include

_ I

4.

S me as 3 except that missing values in the master were updated with values from

z

thlusing. .';.

Sa_e_as "_except that some values in tile master disagree with values in the using.

One-to-one merge .

In a one-to!one merge, t_e first observation in the..masterdataset is joined with the first observatioo

i

the same namle occur in bdth the master and the using datasets, the joined ohse_'ation retains those anabtes on,real values, the values of the variables in the master dataset. When the master and in the dstasetfficontaindiffe_nt using _ataset, the s_ondnumbers observation is joined with the second, andjoined so on.with tf variables with using of observations, missing values are the remaining

i

observations f_om.the longer dataset.

V

'

_

°

"

_suu

merge -- Merge Oatasets

> Example You have two datasets stored on disk that you wish to merge into a single dataset. The first dataset, called odd. dta, contains the first five positive odd numbers. The second dataset, called even. dta, contains the fifth through eighth positive even numbers. (Our example is admittedly not realistic, but it does illustrate the concept.) The datasets are • use odd (First five odd numbers) list number 1 2 3 4 5

i. 2. 3. 4. 5.

odd 1 3 5 7 9

• use even (5th

through

8th

even

numbers)

list 1. 2. 3. 4.

number 5 6 7 8

even 10 12 14 16

We will join these two datasets using a one-to-one memory (we just used it above), we type merge using

merge. Since the even dataset is already in odd. The result is

- merge using odd number was int now float list

1. 2. 3. 4. 5.

number 5 6 7 8 5

even 10 12 14 16

odd 1 3 5 7 9

_merge 3 3 3 3 2

The first thing you will notice is the new variable _merge. Every time Stata merges two datasets, it creates this variable and assigns a value of 1, 2, or 3 to each observation. The value I indicates that the resulting observation occurred only in the master dataset, 2 indicates the observation occurred only in the using dataset, and 3 indicates the observation occurred in both datasets and is thus the result of joining an observation from the master dataset with an observation from the using dataset. In this case, the first four observations are marked by _merge equal to 3, and the last observation by .._merge equal to 2. The first four observations are the result of joining observations from the two datasets, and the last observation is a result of adding a new observation from the using dataset. These values reflect the fact that the original dataset in memory had four observations, and the odd dataset stored on disk had five observations. The new last observation is from the odd dataset exclusively: number is 5, odd is 9, and even has been filled in with missing. Notice that number takes on the values 5 through 8 for the first four observations. Those are the values of number from the original dataset in memory--the even dataset--and conflict with the value of number stored in the first four observations of the odd dataset, number in that dataset took on the values 1 through 4, and those values were lost during the merge process. When Stata joins observations and there is a conflict between the value of a variable in memory' and the value stored in the using dataset. Stata by default retains the value stored in memory.

_

i i [

ii

:

When t e command 'merge using

merge -- Merge datasets 309 odd was ;zssued, Stata responded with "number was int now

i

Letls describethe datasets in this example: float"., describe usingodd Contains

data

First

ob#:

I

5

vat :

2

starage

display

valm

vari_ible name

_ype

format

labe_

numb_ ,=r

float

_.9.Og

odd

_loat

Y,9.0g

Sort,_d

five

5 Jul 2000

variable

odd numbers 17:03

label

Odd numbers

by :

!

,

I

Cont_dns

i

ob_,_: sin,=:

i

de:;cribe using

!evenl

data 4 40

5th through

8th even numbers

II Jul 2000

14:12

2 1 vari_ble

name

St QTage _3pe

display format

valu_ label

variable

label

[ I

numb_ _r

i_t

Y,8.Og

even!

float

Y,9.Og

Sort+d

by :

Even

numbers

,

!

i

Note that number is stored as a float in oad.dta,but is stored as an int ineven.dta;see [U] t5.2.2 Numeric storage types. When you mergetwo datasets, Stata engages in automatic variable promotion;! that is, if there are conflicts in numeric storage types, the more precise storage type will be used. The resulting dataset, therefore, will have number stored as a float, and Stata told you

i

this when It said "number was int now float".

I MatchimerSe !

In a maich merge, obgervations are joined if ttie values of the variables in the varlist are the same.

[ if

Since the +alues must lie the same, obviously the variables in the varlist must appear in both the master andtthe using datasets.

!

A mate merge proceeds by taking an observation from the master dataset and one from the usm_ dataset and comparing t_e values of the variable_ in _he vartist. If the varlist values match, then the • 1 .. ' . _ . . observatmr_s are joined, if the varhst values do nqt match, the observation from the earher dataset (the dataset whose var/ist valhe comes first in the sort order) is joined with a pseudo-observation from the l_ter data@ (the other dataset). All the variable_ in the pseudo-observation contain missing values. The actual !observation from the later dataset is _etained and compared with the next observation in the earlier [lataset, and the process repeats.

f

i ! [ i !

l

I

!

,_

_ Ju

merge-

Merge

oaTasel[s

> Example

-_

The result is not nearly so incomprehensible as the explanation. Let's return to the dataset used in the previous example and merge the two datasets on variable number. We first use the even dataset and then type merge number using odd: .

even

use

(Sth

through

8th

even

numbers)

• merge number using odd master data not sorted

r(5); Instead of merging the datasets, Stata reports the error message "master data not sorted". Match merges require that the data be sorted in the order of the varlist, which in this case means ascending order of number. If you look at the previous example, you will observe that the data are in such an order, so the message is more than a little confusing. Before Stata can merge two datasets, however, the data must not only be sorted but Stata must know that they are sorted. The basis of Stata's knowledge is the internal information it keeps on the sort order, and Stata reveals the extent of its knowledge whenever you describe the dataset: • describe Contains

data

from

evenl.dta 4 2

obs: vats: size:

40

5th through 11 Jul 2000 (99.9%

storage

of

memory

display

value

format

label

number

int

7.8.Og

even

float

7,9.Og

Sorted

name

numbers

free)

type

variable

8th even 14:12

variable

Even

label

numbers

by :

The last line of the description shows that the data are "Sorted by:" nothing. We tell Stata to sort the data (or to learn that it is already sorted) with the sort command: • sort

number

describe Contains

data

from

evenl.dta

obs: vars:

4 2

size:

40

5th through 11 Jul 2000 (99.8Y, of memory

storage

display

value

format

label

number

int

7,8.Og

even

float

_,9.Og

Sorted

name

by:

numbers

free)

type

variable

8th even 14:12

variable

Even

label

numbers

number

Now when we describe the dataset, Stata informs us that the data are sorted by that Stata knows the data are sorted, let's try again: • merge number using data not

r(5);

using odd sorted

number.

Now

::

Stata stilltrefuses to ca_ out our request, this time complaining that the using data are not sorted. Both data_ets, the masier and the using, must be in ascending order of number before Stata can

ii

perform

I

As befbre, if you look at the previous exardple you will discover that odd.dta is in ascending order of _umber,but as before, Stata does noi kn_w this yet. We need to save the data we just sorted, use the odd da_.a,sort it, and re-save it:

[ [

aimerge.

1

stve even, replace fil_ even. dta sa#ed . u_ e odd

i

cn, t s ode

!

• s¢rt number i

i

. s_veodd.dta odd, repl_ce fil_ saved

Now we hould be able to merge the two datMets

[ i 1 i

. .

(5tl]

through

number

was

8th even

numbers)

int nbw float

li_st

i

number 5

even 10

2 .i 3 .i

6 7

12 14

I

4.

8

16

[

5.{

1

t

2

;

6. i

2

3

2

7

3

5

2

8;

4

7

2

1

I li

i [ I

[

[ i

_ [ !

i [

odd 9

amerge 3 1 1 1

{

It workedl Let's under_and what happened, Even though both datasets were sorted by number, we immediately discern that the result is no longei in ascending order of number. It will he easier to understan( what happer_ed if we re-sort _he d_ta and then list the data a_ain: . sort li st number I 1 _

number 1

even

odd 1

_merge 2 2

2

2

3

3

3

5

2

4

4

7

2

5 .! 6?

5

10

9

3

6

i2

1

7 8

14 16

1 1

8 _

7 8. i

Notice [hat number now goes from 1 to 8, With no repeated values and no values left ou_ of the sequence. Recall that th_ odd datase[ defined observations for number between I and 5, whereas the even data Jet defined o_servations between 5 a_id 8. Thus, the variable odd is defined for number equal to ]llhrough 5, a_d even is defined for n_ber equal to 5 through 8, 1 For insiance, in the first observation number is l, even is missing, and odd is 1, The value of _mer _ _ indicate_ that this ob_ervat on chme from the usin£ dataset odd dta In the last obserxatio_ number is 8, even is 16, and oddis missing. The Value of _merge. this obser_Jation came f}om the master _' dataset_even.dta.

1, indicmes that

i

i

,/

312 merge -- Merge dalmsets The fifth observation is worth comrnent, number is 5, even is 10, and odd is 9. Both even and odd are defined, since both the even and the odd datasets had information for number equal to 5. The value of _.merge, 3. also tells us that both datasets contributed to the formation of the observation. q

> Example Although the previous example demonstrated, in glorious detail, how the match-merging process works, it was not a practical example of how you wilI ordinarily employ it, Here is a more realistic application. You have two datasets containing information on automobiles. The identifying variable in each dataset is make, a string variable containing the manufacturer and the model. By identifying variable, we mean a variable that is unique for every observation in the dataset. Values for make-- for instance, Honda Accord--are sufficient for identifying each observation. One dataset, autotech.dta, also contains mpg, weight, and length. cost. dta, contains price and rep78, the 1978 repair record. describe

using

Contains

autotech

data

Automobile

1978

obs: vats : size

74 4

:

11 Jul

2000

Data 13:55

2,072

variable

name

storage type

display format

value label

variable

label

make

strl8

7.18s

Make

mpg

int

7.8.Og

Mileage

weight

int

7.8.Og

Weight

(lbs.)

length

int

7,8.Og

Length

(in.)

Sorted

by:

• describe Contains obs: vats :

and Model (mpg)

make using

autocost

data

1978 Automobile Data ii Jul 2000 13:55

74 3

size :

1,924

storage

display

value

type

format

label

make

strl8

price

int

rep78

int

variable

Sorted

The other dataset, auto-

name

by:

variable

label

Y,18s

Make

Model

7.8.0g

Price

7.8.0g

Repair

and

Record

1978

make

We assume that you want to merge these two datasets into a single dataset: • use

autotech

(A1_tomobile . merge

make

Models ) using

autocost

I

I

i

F

merge-- Merge datasets Let's now _xamine the _esult: • de: ,cribe

i

!

from

siz4:

11 Jul 2000

2,¢142 (99.6X of memory sto_age

flee)

value

t_pe

format

label

strl8 iht

_,18s

Make

i_t

XS.0g XS.0g

Mileage (mpg) Weight (IBM.)

lengl h pric(

int i:It

Y,8. Y,8.Og Og

Price

(in.)

rep7_ _mez e

i:it b rte

_,8.0g %8.0g

Repair

Record

vari_b!e

name

make mpg weig_ t

Data

13:55

display

variable

label

and Model

Length

Sorted

1978

by:

Note:

datas_t

has changed

since

last saved

We have alsingle defame! containing all the information from the two original datasets--or at least it appears t_at we do. B_fore, accepting that conclusion, we need to verify, the result. We think that we entered ldata for the _ame cars in each dataset, so every variable should be defined for every car. Although "ateknow it is tintikely, we recognize the possibility that we made a mistake and accidentally ,eft some _ars out of ohe or the other dataset. We can reassure ourselves of our infallibility by tabulating _._merge: . ta_late

I

_mergel

a

I Total _merge

74

}

74 i Freq. !

I

ioooo I00.00 Percent

Ioooo Cure.

We see that __merge is !3 for ever), observation in the dataset. We made no mistake--for obsevvation in autocos_.dta, there was an obs+rvation in autotech.dta and vice versa.

every

Nov,' p_etend that £'e have another dataset containing additional information on these automobile_--automord.dta--and we want to merge that dataset as well. Before we can do so. we muM sort the da{a we have in memory, by make since after a merge the sort order may have changed: . sor mer

I

r(ltO ; _m_rg t already

i

1978 Automobile

7

I

! I .

autotech.dta 74

var_:

I I

}

!

Cont_Lins data ob_,:

i

i i : ii

313

make make

usin_

automore

! defined

After sortin_ the data, S!ata refused to merge the new dataset, complaining instead that ._merge _s already de_ned. Every t_me State merges datase|s it wants to create a variable called _merge (or tlarname if!the _merge(!vamame) option was specified). In this case, there is an _merge variable can rename _merge • _ • • i .... left over frdm the last ti_e we merged. \_,e have _hree choices: We the variable, we can dr_p it, or we can specify a different variable name with the _.merge() option. In this case _merge contains nQ useful lnformatmn we already verified that the prevmus merge went as expected ;o we drop ii and try, again: • dro

_merge

i

Stata performed

our request; whatever new variables

were contained

in automore,

dta

are now

contained in our single, master dataset---perhaps. One should not jump to conclusions. After a match merge, you should always tabulate ..merge to verify that the expected actually happened, as we do below: • tabulate _merge _merge

Freq.

Percent

Cum.

1 2 3

1 1 73

1.33 1.33 97.33

1.33 2.67 100. O0

Total

75

100.00

Surprise! In this case something strange did happen. Some 73 of the observations merged as we anticipated. However, the new dataset automore.dta added one new car to the dataset (identified by ..merge equal to 2) and failed to define new variables for another car in our original dataset (identified by _merge equal to 1). Perhaps this is what should happen, bui it is more likely that we have a mistake in automore, dta. We probably misidentified one car so that to Stata it appeared as data on a new car, resulting in one new observation and missing data on another. If this happened to us, we would figure out why it happened. We would type list make if ._merge==l to learn the identity of the car that did not appear in automore, dta. and we would type list make if _merge==2 to learn the identity of the car that automore, dta added to our data. q

[3 Technical

Note

It is difficult to overemphasize the importance of tabulating ..merge no matter how sure you are that you have no errors. It takes only a second and can save you hours of grief. Along the same lines, one-to-one merges are a bad idea. In the example above, we could have performed all the merges as one-to-one merges and saved a small amount of typing. Let's examine what would have happened. We first merged autotech.dta with autocost,dta by typingmerge make using autocost We could perform a one-to-one merge by typing merge using autocost. The result would be the same; the datasets line up and are in the same sort order, so sequentially matching the observations from the two datasets would have resulted in a perfectly matched dataset. In the second case, we merged the data in memory with automore .dta by typing merge make using automore. A one-to-one merge would have led to disaster, and we would never have known it! If we type merge using automore, Stata would sequentially, and blindly, join observations. Since there are the same number of observations in each dataset, everything would appear to merge perfectly. We speculated in the previous automore.dta included data on no error, things have gone awry. match. For instance, assume that

example that we had an error in automore, dta. Remember that one new car and lacked data on an existing car. Even if there is No matter what, the data in memory and automore, dta do not this new car is the first observation of automore, dta and that it

is some (perhaps mistaken) model of Ford. Assume that the first observation of the data in memory is on a Chevrolet. Stata could and would silently join data on the Chevrolet with data on the Ford. and thereafter data on a Volvo with data on a Saab, and even data on a Volkswagen with data on a Cadillac, and you would never know. Every dataset should carry a variable or a set of variables that uniquely identifies each observation, and then you should always use those variables when merging data. Ignore this advice at your own peril. []

I

! merge -- Merge datasets

315

1 i

121Technical INote , Circu n!s t tances ma Y _rise when you will mer,,e . . _,_two datasets knowino5 there will be mismatches,, hay

_

you have _n analysis d!taset on patients from t!e cancer ward of a particular hospital and you have just recei_ded another d_taset containing their demographic information. Actually. this other dataset contains nbt just their dhmographic information but the demographic information on every patient in

! i

the

i

hospital during the year. You could • m_ge patid using demog d_op if _raerge,=2

or

i

':

m_ge

i i_

l

i

, patid using demog, nokeep

.

.

.

The noke_p opnon tell_ merge not to store observations from the usln_ data that do not appear in the mastel_ There is an!advantage in this, \Vhe_ we merged and dropped, we stored the irrelevant • I -i observanotls and then discarded them• so the data in memory temporarily grew, When we merge with the nokee_ option, the _data never grow beyond what is absolutely necessary,

°

i

In our _automobile example, we had a _ingle identifying variable. Sometimes you will have

! !

idenfifvin ! variables, v_riables that, taken togettJer, are unique for every observation," Let's in_agine that. r_ther than having a single variable called make, we had two variables: manuf and mode_,manuf conl_ainsthe manufacturer arid model contains the model, Rather than having a single varihble recording, sav, Honda Accord . we have two variables, one recording Honda" and another re_ording "Accord". Stata can deal with this type of data. You can go back through our previous ekample and substitute manuf model everywhere you see make.For instance, rather than typing meljge make usihg autocost, we would have typed merge manuf model using autocost

! ! ! i iI i i

Now le_'s make one more change in our assumptions. Let's assume that manuf and model are nol strir_g vari_tbles but are ihstead numerically coded variables. Perhaps the number 15 stands for Honda in the roanifvariable a_d the number 2 stands for Accord in the model variable \Ve do not have to remember mr numeric dudes because we have smartly created value labels telling Stata what number stands for what string of characters. We now go back to the step where we merged autotech.dta with auto :ost. dta: • us, ._ autotech i (Aut)mobile mode_s)

I

, me:.'ge manuf mo_ei using autocost (lab._lmanuf alr4ady defined) (2ab,_lmodel alrdady defined)

I i

! !

Stata makds two minor domments but otherwise carries out our request, It notes that the labels manuf and modell, are already. _lefined.•The messages refer to the value labels named manuf and model

:_ ._

Both• d t tasets contait_t value label definitions that turn the numeric codes for manufacturer and model lntolwords. Whed Stata merged the two daiasets, it already had one set of definitions in memory (obtained _hen we type_t use autotech) and thus ignored the second set of definitions contained in autocost;dta. Stata f_lt obliged to mention the second set of definitions while otherwise ignorin_

i

t

.

i

.

.

.

hem slncq they smght _contaln different codings. In this case, we know they are the same since we CreatedI them. (Him:i You should never give the same name to value label's containin._ different codings.) ! i

_,_

p

,_ ,vWhen one mergeis -Merge oatasets performing a match

merge, the master an_or using datasets may have multiple observationswith the same varlist value. These multiple observations are joined sequentially,as in a one-to-one merge. If the datasets have an unequal number of observations with the same varlist value, the last such observationin the shorter dataset is replicated until the number of observations is equal. ;> Example The processof replicating the observationfrom the shorter dataset is known as spreadingand can be put to practical use. Suppose you have two datasets, costs of your firm, by region, for the last year:

dollars,

dta

contains

the dollar

sales and

• use dollars (RegionalSales & Costs) • list region NE N Cntrl South West

I. 2. 3. 4. sforce,

dta

they operate:

sales 360,523 419,472 532,399 310,565

cost 138,097 227,677 330,499 1.65,348

containsthe names of the individualsin your sales force along with the region in which

• use sforce (SalesForce) list region I. NE 2. NE 3. N Cntrl 4. 5. 6. 7. 8. 9. I0. II. 12.

N Cntrl N Cntrl South South South South West West West

name Ecklund Franks Krantz Phipps Willis Anderson Dubnoff Lee McNiel Charles Grant Cobb .

You now wish to merge these two datasets by region, spreading the sales and cost information across

all observations

for which

it is relevant;

that

is, you want

to add the

costs to the sales force data. The variable sales will assume the value observations, $4t9,472 for the next three observations, and so on.

(Continued

on next page)

variables

$360,523

sales

and

for the first two

I

i

i -

i• ! ! i {

i

i

r-

merge -- Merge da'msets • me::ge region (la_;1

region

using

317

dollars

at_,eady

defined)

• li_t 1

I. _

region NE

2. i

name ']ckland

NE

3. N Cntrl 4, _ N Cntrl 5. N Cntrl

sales 360 523

cost 138,097

merge

Freaks

360 523

138,097

3

Krantz Phipps Willis

419 472 419 472 419 472

227,677 227,677 227,677

3 3 3

3

6.

South

7.

South

_nderson bubnof f

399 532 399

330,499 330,499

3 3

8. 9.

South South

: Lee , McNiel

532 399 532 399

330,499 330,499

3 3

West West West

Grant bharles Cobb

310 310 565 565 310 565

165,348 165,348 165,348

3 3 3

11. 10. i 12.

532

Even th)ugh there arc 12 observations in the gales force data and only 4 obse_'ations in the sales and cost data, all the re%rds The sforce, dta contained dollars. Ira was matched record in dollars, dta'was

merged. The dollars, dta contained one obsen'ation for the NE region. two observations for the same region. Thus, the single observation in to both the observations in sforce.dta. In technical jargon, the single replicated, or spread, across the observations in sforce, dta.

2

UpdatingdOata merge

ith the updaie

option varies merge's actions when an observation in the master is matched

with an oblervationin tile using dataset. Without the update option• merge leaves the values in the master dat_set alone and adds the data for the new variables. With the update option, merge adds the new vagables, but ii also replaces missing values in the master observation with corresponding values fron_ the using. (_dissing values mean numeric missin_ (.) and empty, strings (""_)., The vahtes for __merge are extended: _merge

1 2 i

meaning

obs. from masterdata obs. from usingdata obs. from botlt,masteragrees with using obs. from both, missingin master updated obs from both, masterdisagreeswith using

4 5

i

In the easel of __merge = 5. the master values are retained unless replace

i i

case the m_.,.ter values are updated just as if they had been missing. Pretend tlataset l con[ains variables id. ,,_.and b: dataset 2 contains id. a, and z. You merge the

i

two dataset_ by _d. data_et I being the master d_taset in memory and dataset 2 the using dataset on disk. Con._i_er two obseivations that match and _all the values from the first dataset idl, etc., and those fromlthe second i_,. etc. The resulting d_taset will have variables id _ b. a'. and _anerge. merge's tyt_ical logic is i

i

I

.

_

•

o

I. The factl,that the obsdrvations match means idl = ida, Set id = ida.

:_!

_.'_Variablelo occurs in l_oth datasels. Ignore o,, and _;eta = al.

i _

3. Variable b occurs in c}nlydataset 1 Set b = b_. 4. Variable a' occurs in +nly dataset 2. Set z = ,r>

i

5. Set _me::ge = 3.

!

i

is specified, in which

.?

With update,

the logic is modified:

1. (unchanged.) :

Since the observations

match, idl = id2. Set id = idl

2. Variable a occurs in both datasets: a. If al -- a2, set a = al and set __merge = 3. b. If al contains missing and a2 is nonmissing, update was made•

set a = a2 and set ...merge -- 4, indicating

an

c. If a2 contains missing, set a = al and set __merge - 3 (indicating

no update).

d. If al 5¢ a2 and both contain nonnfissing, set a = al or, if replace regardless, set _merge = 5, indicating a disagreement.

was specified, a = a2 but,

Rules 3 and 4 remain unchanged.

> Example In original.dta you have data on some cars that include the make, price, and mileage rating. In updates, dta you have some updated data on these cars along with a new variable recording engine displacement. The data contain • use

original,

(original

clear

data)

list make 1. Chev. 2. Chev.

pri c e 3,299 4,504

Chevette Malibu

3. Datsun

mpg 29

510

5,079

4. Merc.

XR-7

6,303

5. Olds

Cutlass

4,733

19

3,895

26

7,140

23

6.

Renault

Le Car

7. VW Dasher • use updates, (updates,

24

clear

mpg

and

displacement)

• list make i. Chev.

Chevette

mpg

displac-t 231

2. Chev. 3. Datsun

Malibu

22

200

510

24

119

XR-7

14

302

Cutlass

19

231

25

79

23

97

4. Merc. 5.

Olds

6. Renault

Le Car

7. VW Dasher

By updating our data. we obtain • use

original,

(original • merge

clear

data)

make

using

updates,

update

list make i. Chev.

Chevette

price 3,299

mpg 29

displac~t 231

_merge 3

2. Chev.

Malibu

4,504

22

200

4

3. Datsun

510

5,079

24

119

3

4. Merc. XR-7 5. Olds Cutlass

6,303 4,733

14 19

302 231

4 3

6. Renault

3,895

26

79

5

7,140

23

97

3

Le Car

7. VW Dasher

_,

,

i

_

merge-- Merge datasets

319

All observations merged because all have _..merge > 3. The, observations having _.merge = 3 have _pg:just a_ it was recorded in the original dataset. In observation t. mpg is 29 because the updated data,£et had mpg = . : in observation 3. mpg remhins 24 because the updated dataset also stated that

i i

mpg_is 24"I The ob!ervations having ...merge = 4 have had their mpg data updated. The mpg variable was missing in,observations 2 and 4 and new values' were obtained from the update data.

i

The _ob_ervation having _merge = 5 has its mpg just as it was recorded in the original dataset, ! i

just as do the __merge = 3 observations, but ther_ is an important difference. There is a disagreement "about the %lue of rapg; the original claims it is 26 and the updated, 25. Had we specified the

i

replaGe

i

_mergo =i5. replace

_ption, mpg would now contain the u_ated

25 but the observation would still be marked

affects only _xhich value _is retained in the case of disagreement,

q

ReferenCe Nash,J. D. t994. dmlg: Mergingrag, data and dictionaryfiles. Stata TechnicalBulletin 20: 3-5, Reprintedin Su_ta i.

i

i

Techaicalt BulletinReprints' v°l" 4'matchedmerging._tata pp" 22-25" Weesie. J. 2000. din75:Safe andeasy TechnicalBulletin53: 6-1% Reprintedin Stata Technical Bulletin!eprints, vol. 9, pp. 62-77.

{

i

AlsoSee Compimne _tary:

[R] save, JR]sort

Related::

JR] append, [R] cross, [R] _oinby I

BaeRgroun d:

[u] 25 Commands for combining data

meta

--

Meta analysis

)i!

,

II

,

Remarks Stata

should

have

a meta-an_alysis

command,

but

as of the date

that this manual

was

written,

Stata does not. Stata users, however, have developed an excellent suite of commands for performing meta-analysis, many of which have been published in the Stata Technical Bulletin (STB).

Issue

insert

author(s)

command

description

STB-38

sbet6

S. Sharp, J. Sterne

meta

meta-analysis for an outcome of two exposures or two treatment regimens

STB-42

sbel6.1

S. Sharp, J. Sterne

meza

update of sbel6

STB-43

sbel6.2

S. Sharp, J, Sterne

meta

update; install this ve_ion

STB-41

sbel9

T, J. Steichen

metabias

performs the Begg and Mazumdar (1994) adjusted rank correlation test for publication bias and the Egger et al. (1997) regression asymmetry test for publication bias

STB-44

sbel9.1

T.J. Steichen. M. Egger, J. Sterne

metabias

update of sbel9

STB-57

sbel9.2

T.J. Steichen

metabias

update; install this version

STB-41

sbe20

A. Tobias

galbr

performs _he Gatbraith plot (1988) which is useful for investigating heterogeneity in meta-analysis

STB-56

sbe20.1

A. Tobias

galbr

update; install this version

STB-42

sbe22

J. Sterne

metacum

performs cumulative meta-analysls, using fixed- or random-effects models, and graphs the result

STB-42

sbe23

S. Sharp

raetareg

extends a random-effects meta-analysis to estimate the extent to which one or more covariates, with values defined for each study in the analysis, explains heterogeneity in the treatment effects

STB-44

sbe24

M.J. Bradbum. J. J. Deeks, D. G. Altman

metma, funnel, labbe

meta-analysis of studies with two groups funnel plot of precision versus treatment effect UAbb6 plot

STB-45

sbe24.1

M. J, Bradbum J. J. Deeks, D. G. Altman

funnel

update; install this version

STB-47

sbe26

A. Tobias

metainf, meta

graphical technique to look for influential studies in the meta-analysis estimate

STB-56

sbe26.1

A. Tobias

metainf

update: install this version

STB-49

sbe28

A. Tobias

metap

combines p-values using either Fisher's method or Edgington's method

STB-56

sbe28.I

A. Tobias

metap

update: install this version

STB-57

sbe39

T_ J. Steichen

metatrim

pedbrms the Dural and Tweedie (20001 nonparametric "trim and fill" method of accounting for publication bias in meta-analysis

Additional commands may be available: enter Stata and type search

320

meta analysas.

I

meta-- Metaanalysis

f

321

To downl_)adand install from the Interact the Sharp and Stem meta command, for instance. Stata i

yot_cquld

i

i

2. !- Click Pull down on http://www.stat_.com. Help and select STB and Use :-written Programs. I i

3. Click on stb. !

I

4. Click on stM9. 5. Click on she28.

I

6. Clk k on dick here to install i !

or yot1co aid instead do the following: l. Na_igate to the appropriate STBissue: Type net from http://w_, Type net ¢d stb Type net

I

i i

i i

[

cd stb49

or _. Type net from http ://www.sta_a, com/stb/stb49

2. Tyt e net describe sbe28 3.TyFe net installsbe28

fere.cs

Be_, C. B|and M. Mazurndar 1994 Opet_atingcharactdristics of a rank correlation test for publication bias. Biometric._

0:!lo8 -llOi.

t

Bradbu_n,

ii

'

sta_a, corn

--.

_k/l.J., J, J. Decks, and D. G. Altman. 1998a. sbe24: metan--an alternative mcta.anatys_s command. Stat,_

:Yechnicll Bulletin 44: 4-15. Reprinted in Stata TeJhnica! Bulletin Reprints, vol. 8. pp. 86-100. t998_, sbe24.l: Correction to funnel plot. Stata qgechnicalBulletin 45: 21. Reprinted in Stata Technical Bulletin

_eOrint_, vol. 8, pp, 100. ¢ Egger, M.,IG. D. Smith, M. Schneider, and C. Mindef. I997. Bias in meta-analvsi., detected by a simple, graphical _test_Bt ;tish Medical Journal 315: 629-634.

Gat_rai_h,iMe_licile_. 7:F" 889-894.I988' A note on graphical display of"cstimated;,odds ratios from several clinical trials. Statistics i,,_ t L Abbe:,K. A., A. S. Detsky. and K. O'Rourke. 1987. Meta-anaNsis in clinical research. Annals of Internal Medicine 107): 224-233. , i i !

Sh_sp, S. 1998. sbe23: Meta-analvsis rc_rcssion. Stat_ Technical Bulletin 42: 16-22. Reprinted in Stata Technic_! :Bu_letir,Reprints, vol. 7. pp. 1_18-155. i Sharp, S, _nd J. Sterne. t997. sbet6: Meta-analysis. Siata Technical Bulletin 38:9-1,1 Reprinted in Stata Technical 'Bu_detit Reprints, voi. 7. pp. 100-106. 199ta. sbel6.1: New syntax and output for the meta-analysis command. Stata Technical Bulletin 42:6-8 Rcprint,'.d in Stata Technical Bulletin Reprints, vol. 17,pp. 106-108. 'Te_nk al Bulletin Reprints, vol. 8, p. 84. ....... . 19%b. sbel6.2: Corrections to the recta-analysis!command. Stata Technical Bulletin 43: 15. Reprinted in Slam Steichen. _. J. 1998. sbcl9: Tests for publication bias in meta-ana!ysis. Stata Technical Bulletin 41:9-t5 Reprinted in Stat_ Technicel Bulretin Reprints, rot. 7. pp. 125-133. __..i_ 200_a. sbel9.2: Update of tests for publication _ias in recta-analysis. Stata Technical Bulletin ._7:4.

!

-_'

_]!i;

322

meta u

Meta analysis

. 2000b. she39: Nonparametric trim and fill analysis of publication bias in meta-analysis. Stata Technical Bulletin 57: 8-14. Steichen, T. J.. M, Egger, and J. Sterne. 1998. sbel9.1: Tests for publication bias in recta-analysis. Stata Technical Bulletin 44: 3-4. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp, 84-85. Sterne, J. 1998. sbe22: Cumulative recta-analysis. Stata Technical Bulletin 42: 13-t6. Bulletin Reprints, vol. 7, pp. 143-147.

Reprinted in Stata Technical

Tobias. A. 1998. sbe20: Assessing heterogeneity in recta-analysis: the Galbraith plot. Stata Technical Bulletin 41: 15-17. Reprinted in Stata TechnicaI Bntletin Reprints, vot. 7, pp. I33-I36. 1999a, she26: Assessing the influence of a single study in the meta-analysis estimate, Stata Technical Bulletin 47: 15-17. Reprinted in Stata Technical Bulletin Reprints, vot. 8, p. 108-t10, 1999b. she28: Meta-analysis of p-values. Stata Technical Bulletin 49: 15-t7. Bulletin Reprints, vol. 9, pp, 138-140. 2000a. she20.1: Update of galbr. Stata Technical Bulletin 56: 14, 2000b. sbe26.1: Update of metainf. Stata Technical Bultetin 56: 15. • 2000c. sbe28.1: Update of metap. Stata Technical Bulfetin 56: 15,

Reprinted in Stata Technical

Title !

!

i! i

=-

i m,x

Obtain marginal effects or elasticities after estimation

i

mfx

i

c_mpRte

[if el]o]

at l atlist) eqlist Rot Linear

i

' [in range]

[, d_dx

eyex

dyex

i

eydx

(eqnames) predict ('predict_option)

nodiscrete

noesample

nowght

nose

level(#)

]

=fX r ,play[,level(#)] where!at/_t is {mean I median I zero [varname=# [, varrrame=#] nu,ntist

(for single_quation estimators only)

mamame

(for single-equation estimators only)

[...]]}

DesCriptii:m numerically calculates the marginal effects or the elasticities and their standard errors after esti_mtion. Exactly what mfx can calculate is determined by the previous estimation command and the ]_redict (predict_option) option. At Whichpoints the marginal effects or elasticities are to be evaluated is determined by the at (atlist) option. By default, mfx calculates the marginal effects or the el_Lsticitiesat the means of the independent variables by using the default prediction option

}

mfx ci,mpute

i i i

associate, with the previous estimation command. mfx r _lay replays the results of the previous rafx computation.

!

i

t

i _

Opti dy x s! ifies that marginal effects are to be calculated. It is the default. eyex .sp_:ifies that elasticities are to be calculated in the form of 0 log b'/Olog z.

i

dy_xsp_ :ifiesthat elasticities are to be calculated in the form of Oy/Ologz

i

eydxspe _ifiesthat elasticities are to be calculated in the form of 0 log y/O_. at(atlist specifies the points around which the marginal effects or the elasticities are to be estimated.

i

The d ,.faultis to estimate the effect around_the means of the independent variables. at( mBan median I zero [ varname = # [, varnarne= #] [...]]) specifies that the marginal effecL,or the elasticities are to be evaluated at means, at medians of the independent variables, or at zer,)s. It also allows users to specify pahicular values for one or more independent variables. assuming the rest are means, medians, or zeros. For instance.

i

• ,

robit foreiffp, mpg _eight price !

.

fx compute, at(mean

mpg=30) 323

_

Jii" i

,._d;-e

..A

--

uuttllll

IIIdlylllal

elTeCI$

or

elasTiciTies

after

estimation

at there (numlist) specifies term that the are tooption be evaluated at the numlist. If is a constant in marginal the model,effects add 1ortothe theelasticities numlist. This is for single-equation estimators only. For instance, • probit • mfx

foreign

compute,

mpg at(21

weight 3000

price

6000

I)

at (matname) specifies the points in a matrix format. A 1 is also needed if there is a constant term in the model. This option is for single-equation estimators only. For instance, • probit

foreign

mpg

• mat

A = (2i,

3000,

• mfx

compute,

at(A)

weight 6000,

price I)

eqlist (eqnames) indirectly specifies the variables for which marginal effects (or elasticities) are to be calculated. Marginal effects (elasticities) will be calculated for all variables in the equations specified. The default is all equations, which is to say, all variables. predict (predict__option) specifies which function is to be calculated for the marginal effects or the elasticities; i.e., the form of ft. The default is the default predict option of the previous estimation command. For instance, since the default prediction for probiZ is the probability of a positive outcome, the predict () option is not required to calculate the marginal effects of the independent variables for the probability of a positive outcome. probit . mfx

foreign

mpg

weight

price

compute

To calculate the. marginal effects for the linear prediction (xb), specify predict (xb). • mfx

compute,

predict(xb)

To see which predict options are available,

see help for the particular estimation

command.

nonlinear specifies that y, the function to be calculated for the marginal effects or the elasticities, does not meet the linear-form restriction. For the definition of the linear-form restriction, please refer to the Methods and Formulas section, By default, mfx will assume that y meets the linearform restriction, unless one or more independent variables are shared bv multiple equations• For instance, predictions after heckman

mpg

price,

sel(for=rep)

meet the linear-form restriction, but those after • heckman

mpg

price,

sel(for=rep

price)

do not. If y meets the linear-form restriction, specifying nonlinear or not should produce the same results. However, the nonlinear method is generally more time-consuming. Most likely, users do not need to specify nonlinear after a State official estimation command. For user-written estimation commands, if you are not sure whether y is of linear-form, specifying nonlinear is always a safe choice. Please refer to the Speed and accuracy section for further discussion. nodiscrelze treats dummy variables as continuous ones. Ifnodiscrete is not effect of a dummy variable is calculated as the discrete change in the expected variable as the dummy variable changes from 0 to 1. This option is irrelevant the elasticities, because all the dummy variables are treated as continuous in

specified, the marginal value of the dependent to the computation of computing elasticities.

noesample only affects at (atlist). It specifies that when the means and medians are calculated, the whole dataset is to be considered instead of only those marked in the e (sample) defined by the previous estimation command.

d

J mfx -- Obtain marginaleffects or elasticitiesafter estimation

325

nowghtof ly affects at(atIist). It specifies that Weights are to be ignored when calculating the means and ime [inns for the atlist. nose asks mfx to calculate the marginal effects or the elasticities without their standard errors. Caleula ing standard errors is very time-consuming, Speci_,ing nose will reduce the running time Of dfx. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as s¢ by set level: see [U] 23,5 Specifying the width of confidence intervals.

(95)

Pmarl 1 '

Rentarl_s are presented under the headings ! Obtaining

marginal

effects

after single-equation

Obtaining

marginal

effects

after multiple-equation

Obtaining

three

Speed

forms

(SE) estimation (ME)

esdmation

of elasticities

and accuracy

Obtaining arginal effects after singk -equation (SE) estimation Bef4_re :unning mfx. type help estimation_cmd to see what can be predicted after estimation and to see _e default prediction.

> Example We esti mate a logit model using the auto dataset: io ;it foreign

mpg

price

_ter _ion

O:

log likelihood

=

Ieer_tion

I:

log likelihood

= -36.694B39

I_er_tion

2:

log likelihood

= -36,463_94

_ter_tion

3:

log likelihood

=

I_er_tion

4:

log likelihood

= -36,462189

L_gi

L_g

-45.03_21

-36.46_19

estimates

likelihood

foreign I

Number of obs LR chi2(2) Prob > chi2 = -36.462189

Coef.

Pseudo

Std. Err.

mpg

.2338353

.067t449

price cons

.000266 -7.648111

.0001166 2.043673

z 3.48 2.28 -3.74

R2

= = =

74 17.14 0.0002

=

0.1903

P>]z

[95_ Conf.

Interval]

0.000

.1022338

.3654368

0.022 0,000

.0000375 -11.65364

.0004945 -3.642586

To determine the marginal effects of mpg and price for the probability of a positive outcome at their mean values, issue the mfx command, because the default prediction after !ogit is the probabilit} of a positive outcome and the calculation is requested at the mean values.

_N

.=_

-

-

,,,,A -- uuu==n marginal e_ects or elasticities after estimation

• mfx compute effects after Marginal logit y = Pr(foreign)(predict) = .26347633

/J

variable

dy/dx

Std. Err.

z

P>lzl

[

95Z C.I.

]

X

J

mpg price

The

.0453773 .0000516

first line

of the

output

.0131 .00002

indicates

3.46 2.31

that the

0.001 0.021

marginal

.019702 .071053 7.8e-06 .000095

effects

were

21.2973 6165.26

calculated

estimation, The second line of the output gives the form of fl and the predict would type to get y. The third line of the output gives the value of fl given X, in the last column of the table.

after

a logit

command that we which are displayed

To calculate the marginal effects at particular data points, say, mpg = 20, price = 6000, specify the at () option: mfx compute,at(mpg=20 , price=6000) Marginal effects after logit y = Pr(foreiKn) (predict) = .20176601 variable I mpg price

dy/dx

Std. Err.

I

.0376607 ,0000428

.00961 .00002

z

P>Izl

3.92 2,47

0.000 0.014

[

95_ C.l.

]

.018834 .056488 8.8e-06 .000077

X 20.0000 6000.00

To calculate the marginal effects for the linear prediction (xb) instead of the probability, specify predict (xb). Note that the marginal effects for the linear prediction are the coefficients themselves. mfx compute, predict(xb) Marginal effects after legit y = Linear prediction (predict,xb) =-I.0279779 variable

dy/dx

Std, Err,

.2338353 .000266

mpg price

.06714 .00012

z

P>[z[

3.48 2.28

0.000 0.022

[

95_ C,I.

.102234 .000038

]

.365437 .000495

X 21.2973 6165.26

If there is a dummy variable as an independent variable, mlx will calculate the discrete change as the dummy variable changes from 0 to 1.

gen

record

. replace (34 real

= 0

record changes

= I if rep made)

> 3

_

mfx -- Obtain marginaleffects or elasticitiesafter estimation

327

• _logit foreign mpg record, nolog L_it

estimates

Number of obs LR chi2(2) Prob > chi2 Pseudo R2

Lo_ likelihood = _31.898321

mpg I

.1079219

record cons ._oreign :.

1

_fx

2,435068 -4.689347 Coef.

.0565077

i.91 3.42 43.54 z

.7128444 i.326547 Std. Err.

O.056 0.001 O.000 P>Izl

= = = =

74 26.27 O.0000 0.2917

-.0028311 .2186749 I 3.832217 -7.28933 -2.089363 [957,Conf. Interval] • 03_18

compute

_Ma_gi_al effects after logit ly = Pr(forei_1) (predict) | = .21890034 ! v_i

le mP_ I

_ec_'-rd* I

dy/dx

Std. Err.

z

P>Izl

[

95_ C.I.

.018_528

.01017

1.81

0.070 -,001475

.4272707

.10432

4.09

0,000

]

X

.038381

21.2973

.63163

.459459

.222712

(*) d'r/dxis for discrete change of dummy variable from 0 to I

If nodiscrete iS specified, mfx will treat the:dummy variable as continuous. Ifx compute,

no_iscrete

!Ma_gilLaleffects a_ter iogit y = Pr(foreign) (predict) = . 218900_34 _vari le

dyjdx

Std. Err.

z

P>Izl

[

957,C.I.

]

X

m_g I

,0184528

.01017

1.81:

0.070

-.001475

.038381

21.2973

recI'rd ]

.4163552

.10733

3.88

0.000

.205994

.626716

.459459

q

Obtaining rr arginal effects after multiple-equation (ME) estimation If you ha ;e not read the discussion above on using mfx after SE estimations, please do so. Except tbr the abilily to select specific equations for the calculation of marginal effects, the use of mfx after ME models Olk)ws atmost exactly the same form as fix SE models. The details of prediction statistics that are specific to particular ME models are documented with the estimali,)n command. Users of mfx after ME commands should first read the documentation of predict for the estimation command. For a general introduction to the ME models, we will demonstrate mfx after heckman and mlogit.

...........

F /_i

_ ........

_v,_

v, _,aau_lu_a,

_ImatlON

Example Heckman selection model Number of obs . heckman mpg weight length, sol(foreign = displ) nolog (regression model with sample selection) Censored obs

: i

_Ik_r

Log likelihood = -87.58426 Coef.

Std, Err.

z

=

74

=

52

Uncolored Wald chi2(2)obs

= =

22 7.27

Prob > chi2

=

0.0264

P>Izl

[95_ Conf. Interval]

mpg weight length _cons

-.0039923 -.1202545 56.72567

.0071948 .2093074 21.68463

-0.55 -0.57 2.62

0.579 0.566 0.009

-.0180939 -.5304895 14.22458

.0101092 .2899805 99.22676

-.0250297 3.223625

.0067241 .8757406

-3.72 3.68

0.000 0.000

-.0382088 1.507205

-.0118506 4.940045

/athrho

-.9840858

.8112212

-1,21

0.225

-2.57405

.6058785

/lnsigma

1.724306

.2794524

6.17

0.000

1.176589

2.272022

-.7548292

.349014

-.9884463

.5412193

5.608626 -4,233555

1.567344 3.022645

3.243293 -10.15783

9.698997 1.690721

foreign displacement _cons

rho sigma lambda

LK test of indep, eqns. (rho = 0):

chi2(1) =

1.37

Prob > chi2 = 0.2413

heckman estimated two equations, mpg and foreign; see [R] heckman. Two of the prediction statistics a_er heckman are the expected value of the dependent variable and the probability of being observed. To obtain the marginal effec_ of the independent variables of all the equations for the expected value of the dependent variable, specify predict (yexpected) with mfx. . mfx compute, predict(yexpected) Marginal effects after heckman y = E(mpg*IPr(foreigm)) = .56522778 variable

dy/dx

weight length displa~t

-.0001725 -.0051953 -.0340055

(predict, yexpected)

Std. Err. .00041 .01002 .02541

z -0.42 -0.52 -1.34

P>Izl

[

95Z C.l.

0.675 0.604 0.181

-.000979 -.02483 -.083802

]

.000634 .01444 .015791

X 3019.46 187.932 197.297

To calculate the marginal effects for the probability of being observed, since only the independent variables in equation foreign affect the probability of being observed, specify eqlist (foreign) to restrict the calculation. mfx compute, eqlist(foreign)

predict(psel)

Marginal effects after beckman y = Pr(foreign) (predict, psel) = .04320292 variable

dy/dx

dispia-t

-.0022958

Std. Err. .00153

z -1.50

P>Izl

[

95Z C.I.

0.133

-.005287

]

.000696

X 197.297

q

V

mfx -- Obtain ma

inal effects or elasticities after estimation

329

E>ExampleI predict after mlogit has a special feature that most other estimation commands do not. It can predict m_ltiple new variables by issuing predict only once; see [R] mlogit. This feature cannot be adopte 1into mix. To calculate the marginal effects for the probability of each outcome, run _z septtra_ely for each outcome.

• _ ogit rep78 mpg disp_,

notog

MuI_inomialregression

Log likelihood=

-82.27874

rep78 I

Numberof obs LR c_i2(8) Prob > chi2 PseudoR2

Coef.

Std. Err.

z

P> z[

= = = =

69 22.83 O. 0036 0,1218

[95Z Conf, Interval]

i 1

mpg

displacement cons

-.0021573

.2104309

-0.01

0.992

-.0052312 -1.566574

.0126927 6.429681

-0.41 -0.24

0.680 0.808

-.0301085 -14.16852

.4145942

.4102796 .0196461 11.03537

• mpg _is_,acement _cons

.01509S4 .1235325 ,0020254 ,0063719 -2.09099 3.664348

0.12 0,32 -0.57

0.903 0.751 0.568

-.2270239 -.0104634 -_.272981

.2572147 .0145142 5.091001

mpg d_s .acement _cons

.0070871 ,0883698 -,0066993 .0053435 .7047881 2.704785

0,08 -1.25 0.26

0.936 0.210 0.794

-.1661146 -.0171723 -4.596492

.1802888 .0037737 6,006069

,0808327 ,0983973 -.0231922 .0119692 .652801 3.545048

0.82 -1.94 0,18

0.411 0.053 0.854

-.I120224 -.0466514 -6,295365

.2736878 ,0002671 7,600967

5 i

mpg displacement _cons (Dut,:ome rep78==3 • mf: compute,

is the comparison

group)

predict(outcome(1))

Marg nal effectsafter mlogit y = Pr(rep78==1)(predict,outcome(I)) = ,03438017

-.0003566 _y/dx -.0000703

.C_679 Std. Err. -0.05 z .00041

-0.1V

0.958 .01295] P>lz[ -.013663 [ 95Z C.I.

21.2899 X

0.864

198.000

-.000873

.000732

• mfI compute,predict(outcome(2)) Marginaleffects after mlogit = .12361544 l y =I Pr(rep78==2)_predict,outconm(2)) variable

I

dy/dx

mpg d_sp]a-t

[ ] i

.0008507 .0006444 4

Std. Err.

.01277 .00067

z

0.07 0.96

P>Iz]

[

95Z C.I.

0.947 0.336

-.024183 -.000668

]

.025885 .001957

X

21.2899 198.000

....v

,,,,_ --

i":_i

,JUmm

marginal

effects

or elasUciUes

after estimation

mfx compute, predict(outcome(3)) Marginal effects after mlogit y = Pr(rep78==3) (predict, outcome(3)) = .48578012

' 1

variable

I

mpg displa-t

t

t

dy/dx

-.0039901 .0015484

Std. Err. .01922 .00108

z -0.21 1.43

P>Iz[

[

95_ C.I.

0.836 0.151

-.041682 -.000567

]

.033682 .003664

X 21.2899 198.000

. mfx compute, predict(outcome(4)) Marginal effects after mlogit y = Pr(repYS--=4) (predict, outcome(4)) = .30337619 variable

dy/dx

mpg displa-t

-.0003418 -.0010654

Std. Err. .01707 .00106

z -0.02 -1.01

P>IzF

[

95_ C.l.

0.984 0.313

-.033805 -.003136

]

.033122 .001005

X 21.2899 198.000

• mfx compute, predict(outcome(5)) Marginzl effects after mlogit y = Pr(rep78==5) (predict, outcome(5)) = .05284808

variable I

dy/dx

displa~t mpg I

-.0010572 .0038378

Std. Err.

.00047 .00561

z

-2.24 0.68

P>[zl

[

95_ C.I.

]

0.025 0.494

-°001984 -.000131 -.007167 .014843

X

198.000 21.2899

q

Obtaining three forms of elasticities mfx can also be used to obtain all ttu'ee forms of elasticities. option

elasticity

eyex,

0 log y/O log x Oy/O log x 0 log y/Ox

dyex eydx

b, Example We estimate a regression model using the auto dataset. The marginal effects for the predicted value y after a regress are the same as the coefficients. To get the elasticities of torm Olog;_/Ologz, specify the eyex option:

r

mfx -- Obtain marginaleffects or elasticitiesafter estimation

, re_ress =_

weight length

Source

SS ' 1_616.08062

_ Model Ftesidual

df

827.378835

Number of F( 2, Prob > F

2

MS _ 808.040312

71

11.643223

73

33.47;J0474

'

Std. Err. .001586 ,0553577 6.08787

obs = 71) = =

R-squared Adj R-squared Root MSE

t i-2.43 _-1.44 7.87

P> ]t ] 0.018 O. 155 O. 000

= = =

33t

74 69.340,0000 0.6614 0,6519 3,4137

[957 Conf. Interval] - .0070138 -. 1899736 35,746

-.0006891 .0307867 60,02374

•'mr: compute, eyex Elasl icities after regress y = Fitted_values (predict) = 21.297297 I

varieble I

e_/ex

Std. Err.

z

P>Izl

[

95_,C.I.

]

0.015 0.151

-.987208 -.104891 -1.66012 .2554t4

X

t weJght [ -.5460497 !length { -.7023518 --

.22509 .48867

-2.4B -1.44

3019.46 187.932

,, [

The first line of the output indicates that the elasticities were calculated after a regressestimation, I

The titlecha o_ _ge the insecond of the table ingives percent b' for column a 1 percent change x. the form of the elasticities, 0 log y/O log x, the If'the in lependent v_ables have been log-transformed already, then we will want the elasticities of the fO_ 0 log y/Oz _stead. ge_ Inweight = In(weight) gen inlength = _n(lengZh) reg::essropeInweight inlen_h Source SS df

•

Number of obs =

MS

Model

" I_51.28916

2

825.644581

R,sidual

79_2.170298

71

11.1573281

2_43.4594_6

73

33.4720474

F( 2, Prob > F

' Total

,

mpg

Coef.

4 I

1 weight l_length | cons 1

I! I [

• =fx lcompute,

-_3.5974 -9.816726 181. 1196

Std. Err.

P>It I

t

'

4.692504 t0.40316 22.18429

-2.90 -0.94 _.16

0.005 0.349 0.000

74

71) = =

74.00 0.0000

R-squared = Adj R-squared = Root MSE =

0,6758 0.6667 3.3403

[957. Conf. Interval] -22.95398 -30.56004 136.8853

-4.24081i 10.92659 225.3538

eydX

_Elastlcitiesafter': regress _y = Fitted _alues (predict) =[ 21.2972_}7 va_iat [e l

eyldx

Std. Err.

ln_en _h ,ln_tei_ at I

-.4609376 -. 6384565

.48855 .22064

z -0.94 -2.89

P>Izl

[

957.C l

0.345 0.004

-1.41847 .496594 -1,0709 -.206009

]

X 5.22904 7.97875

t

...... _ _ii; • '

........

-.w

umL_=:= _;,,,_illlli_ll,

lon

Note that although interpretation is the same, the results for eyex and eydx differ since we are estimating different the models. If the dependent variable were log-transformed, we would specify dyex instead. <1

Speed and accuracy mfz numerically calculates the derivatives and the second derivatives, so half of the digits of the accuracy of the estimation command are expected to be lost. For instance, if the predicted values from an estimation command are of 16 digits accuracy, i.e., they are accurate to le- 16, the accuracy of the marginal effects calculated by mfx might fall to le - 8, and the accuracy of the standard errors of the marginal effects might fall to le - 4 in the worst case. Users of mfx should also be aware of the speed issue. The linear method is generally much faster than the nonlinear method, but it might still take a while if there are multiple equations and quite a few independent variables. For those cases where y fails to meet the linear-form restriction, such as after mlogit,mfx might take a long time, varying from seconds to hours depending on the number of independent variables. Specifying nose will reduce the running time of mfx considerably. The table below gives a general idea of the accuracy of the linear and nonlinear methods and how long it takes to produce variables. the results. All the estimations listed in the table were run on the auto dataset, usingthem 9 independent

Estimation regress probit

logit tobit

Method linear nonlinear linear nonlinear linear nonlinear linear nonlinear

Speed (sec.) Accuracy of dydx 1.32 --1.263e - 09 13,24 3.289e - 10 1.35 9.221e - 13 65.43 4.336e - 12 1.31 1.647e - 11 85._1 5.921e- t2 1.44 2.683e- 10 12.39 1.353e- 10

(Continued on next page)

Accuracy of stds 1,139e - 09 5,361e - 07 1.888e - 08 7.841e - 08 t.889e - 07 1.356e- 06 3.059e 10 9.734e - 07

}

_-

, i

mfx -- Obtain marginal effects or elasticities after estimation

t

333

Say ed!R_uits

!

:mftx s_ves in e(): 1 SCala q (Xmfx_y) Maw

value of _ gwen x

s e (Xmfx_tTpe)

dydx,

e Xmfx_discrete)

discrete

e Xmfx_cmd)

mfx

elXmfx_dummy)

corresponding toinde_ndent variables: I meansdummy,0 meanscontinuous

e Xmfx_label_p)

labelof the predict

e Xmfx_method)

linear or nonlinea_

I

Mittric :s e Xmfx_dydx)

i

eyex,

eydx

or;dyex

or nodiscrete

option

marginal effects

e Xmfx.._e_dydx)

standard erro_ of the marginal effects

Xmfx_eyex) e!Xmfx_se_eyex)

elasticities of form eyex standard errors of elasticities of form eyex

e_Xmfx_eydx) e_ Lmfx_se_eydx)

elasticities of form eydx sta_ard errors of elasticities of form eydx

e__mfx_dyex)

elasticities of formdyex /

i!

e_ _fx_se_dyex e(_,mfx_..X)

I)

standard points

errors which of elasticities form or dyex around marginalofeffects elasticities were estimated

,

1

MethOdS_LndFormulas m_fx:is mplemented

as an ado-file.

Nonlinear_ethod Suppose

the function

margined elect

variable

xi:

effects is g = F(X,/3).

OF(v__'fl)

roo l

where

i

"

of independent

for the marginal

var(ra_)= Om_ Var(3)L O_j

° i

to be calculated

1

OJ_

_ 0_1

_f12

Oq3n /

Let mi

be the

334

mfx -- Obtain marginal effects or elasticities after estimation

Linear method y meets the linear-form

restriction if

1. y = F(XlOl

, X2_2 ....

, Xs/3s),

2. X1, X2 .....

X8 are mutually exclusive.

Under the linear-form restriction, be calculated as

where s is the number of equations,

tile marginal effects of independent

and

variable j in equation i can

OF(X,_) mij

--

_

"/Jij

The variance of rr,ij is !

var(.',,u) = L-5__I

L o_ J

where

CQ/_

Ornij/c)3kt

\0_11

'

"'"

'

O_Xs

'

"'"

'

Oq_sl

"'"

'

O_ss

]

can be calculated as Omij

02 F

O_kl- o(x_)o(xk_)_7,z where I(i -- k,j

OF

+ Ox_ _r(i= k,j = l)

= l) .- 1 if i = k and j -- l, and = zero otherwise.

AlsoSee Related:

[R] probit, [R] truncreg

Background:

[U] 23 Estimation and post-estimation [R] predict, [P] _predict

commands,

I I

I ,,! i

_

Title

,..... mkdir

Create directory

ii Syntax

Doul3le quot_s may be used to enclose

the director,,', n_me and the quotes must be used if the director,

contains

embeidde_ blanks.

Desc_i_icn m_dir

i

:reates a new directory (folder).

i

l Ii i

i

i

!

public specifies Options t

that directory-name is to be readable by everyone: otherwise, the directoD' will be created !according to the default permissions 6f your operating system.

Remarks Exampl i._s: Windov's . mk_ir myproj . mk@ir c:\projeCts\myproj mk_ir "c:\My Projects\Project I"

Unix

i

[

i

I

:_

. mk ir myproj !

i

i

mk_ir ~/projectis/myproj

Macim_ sh mk_firmyproj mk,fir

:hdisk :p_ojects :projecti

. mk_[ir ":Hard Disk:My Projec%s:Project 1"

AlsoSee Related:

[R] ed, [R] copy, JR] dir, [R] drase, [R] shell, [R] type

Backgrour d:

[U] 14.6 File,naming conventions

_35

mkspline -- Linear spline construction

1 Syntax mkspline

newvarl #1 [newva,_ #2 [...]]newvark

mkspline

stubname

# = oldvar

[if

exp]

= oldvar [if exp] [in range]

[in range]

[, marginal

_pcZile

]

Description mkspline

creates variables containing a linear spline of oldvar.

In the first syntax, mkspline creates newvarl,..., with knots at the specified #1, -.., #k--i-

newvark containing a linear spline of oldvar

In the second syntax, mksplino creates # variables named stubnamel ..... stubname# containing a linear spline of oldvar. The knots are equally spaced over the range of oldvar or are placed at the percentiles of oldvar.

Options marginal specifies that the new variables are to be constructed so that, when used in estimation, the coefficients represent the change in the slope from the preceding interval The default is to construct the variables so that, when used in estimation, the coefficients will measure the slopes for the interval. pctile is allowed only with the second syntax. It specifies that the knots are to be placed at percentiles of the data rather than equally spaced based on the range.

Remarks Linear splines allow estimating the relationship between b' and x as a piecewise linear function. A piecewise linear function is just that: it is a function composed of linear segments straight lines. One linear segment represents the function for values of x below zo. Another linear segment handles values between z0 and zl, and so on. The linear segments are arranged so that they join at z0, zl, .... which are called the knots. An example of a piecewise linear function is shown below.

336

_kspline -- Linear spline construction A piecewise

i :_ I

linear functidn

i

j 5i 4 N

i

I

337

I

,

I

Z

0

i

2

3

X

Example You wish ito estimate a model of log income on education and age using a piecewise linear function for age: lninc = bo + bt ettuc + f(age) -}-_t The knots at to be at ten-year intervals: 20. 30, 40, 50, and 60. mksp

ine

age1

. regr._ss lnin¢

20 educ

age2

30

age3

40

age4

! 50 age5

60

age6

= age,

marginal

agel'age6

(ou_ t omitted)

:

i

Since you s[ecified the marginaloption, you could test whether the age effect is the same in the

i

30-4(I" and @-50 intervals by asking whether th_ age4 coefficient were zero. With the marginal option, coeft_cients measure the change in slope :from the preceding eroup, Specifvine marginal chanoe_ onN the interpretation of the coefficients: the same model is estimated in either case. That is. without tl_e marginaloption, the interpretation of the coefficients _ould ha_e been

--

I

al

if age<20 30 it 20 a2 _
i l

dy _ dage

as a4 %

l ;i

With the ma_ ginal option specified, the interpretation is a6 Otherwise.

i!

i I

i g

if 30 <_age < 40 if 40 < age < 50 if 50_
al al

dy dage

+ a2

if age < 20 if 20 _< age < 30 if 30 < age < 40

!

al + ct2 + o3 al + a2 + a3 + a4 ' o,3 + a4 al + a2 -r-, Ctl +

a2 +

03

_

+ a4 _

if 40 < age < 50 if 50 _< age < 60

o.5 (t5 +

a6

otherwise.

_ _'

mKspllne-- Linearspline construction

> Example

l

As a second example, pretend you have a binary outcome variable called outcome.You are beginning an analysis and wish to parameterize the effect of dosage on outcome. You wish to divide the data into five equal-width groups of dosage for the piecewise linear function. mkspline dose 5 = dosage logistic outcome dosel-dose5 (output omitted )

mkspline dose 5 = dosage creates five variables, dosel, dose2 ..... dose5, equally spacing the knots over the range of dosage, ff dosage varied between 0 and 100, mkspline dose 5 = dosage has the same effect as typing • mkspline dosel 20 dose2 40 dose3 60 dose4 80 dose5 = dosage

The pctile option sets the knots to divide the data into five equal sample-size groups rather than five equal-width ranges. Typing mkspline dose 5 = dosage, pctile

places the knots at the 20th, 40th, 60th, and 80th percentiles of the data.

Methods and Formulas mksplineis implemented as an ado-file. Let V/, i - 1.... , n, be the variables to be created, ki, i = 1,.... , n- t be the corresponding knots, and 3) be the original variable (tile command is mkspline V1 kl V2 k2 ... Vn = V). Then V1 = min(V, kl)

If the marginal

option is specified, the definitions are V_=V V/ - max(0,12 - ki-1)

i = 2 .... ,n

In the second syntax, mkspline stubname # = 12, let rrt and M be the minimum and maximum of "12.Without the pctile option, knots are set at rn + (M - m)(i/_) for i - 1,... ,rt - 1. If pctile is specified, knots are set at the lO0(i/n) percentiles, for i = 1..... n, - 1. Percentiles are calculated by egen's pctile() function.

References Gould, W. W. 1993. sg]9: Linear splines and piecewise linear functions. in Stata Technical Bulletin Reprints, vol. 3, pp, 98-104. Greene.

W. H. 2000.

Newson, R. 2000. 57: 20-27.

Econometric

B-splines

Analysis.

parameterized

4th ed. Upper

Saddle

Also See Related:

i

[R] fracpoly

River.

by their values at reference

Panis. C. t994. sg24: The piecewise linear spline transformation. Stata Technical Bulletin Reprints, vol. 3, pp. 146-149.

Stata

Techt_ical Bulletin

15: 13-17.

Reprinted

NJ: Prentice-Hall.

points on the z-axis.

Stata Technical

Bulletin

Stata

Technical

18: 27-29.

Bulletin

Reprinted

in

l

l

! ! _

_

Title .....

ml --

_1 ,

aximum likelihood estimation i _

,ml.

] il

il

]

i i[

i T

I

i i

ii

I

Syntax

I

ral regale,

i

method progname eq [eq ...] iweight] [if exp] [in range] [, robust. __cluster(varname) _tle(string) nopreserve collinear missing IfO(#k #it)continue waldtest(#) const_aints(numlist) obs(#) noscvars ]

i i

!

ml sea@h

[ [/]eqname[:]

#tb #ub ] [.. i]

F repeat (#) nolog trace

i

L,- ,

,

,'estart norescale

--

]

I i

I i !

ml plot

i

[eq,,a/ne:]name

[# [# [#]]].[, saving(filename[,

replace]>

i

mt init{ m! init l

{ [eqname:]name=#1/eqnaine=# # [# ...], copy

i

} [...]

:

ml init

,,,atname

[, skip

copy]

{ }

ml repcrt i ml trac_

{ on i off }

]

oi

mlco_

'i

ml max

[cle_ Ionlof_ _] ize [, difficult nolog __trace'gradient hessian shovstep iterate(#) Itolerance(#)!tolerance(#) nowarning novce

_.

i

i

sscore:(newvarnames) noout_t level(#) eform(string) noc!e_r

ml

graph

[#] [

saving(fiienarae[, r_place])

]

{

mt dis

lay

[

no_eader

eform(string)

tirst 389

neq(#)

plus

level(C/)

i

i

•,--,, ";i_i I

,,,, -- Maximum IIKelIItOOCl estimation

where method is { If [ dOIdl Idldebug Id2 1d2debug } and eq is the equation to be estimated, enclosed given to the equation, preceded by a colon:

[

in parentheses,

and optionally

with a name to be

([eqname: l [varnames =] [varnames] I" eq-options] ) or eq is the name of a parameter such as sigma with a slash in front /eqname and eq_options are

which is equivalent

noconstant offset

(varname)

to

(eqname:)

exposure

(varname)

fweights, pweights, aweights, and iweights are allowed, see [U] 14.1,6 weight. With all but method lf, you must write your likelihood-evaluationprogram a certain way if pweights are to be specified, and pweights may not be specifiedwith method dO. ml shares features of all estimationcommands;see [U] 23 Estimation and post-estimation commands. To redisplay results,the type ml display.

Syntaxof ml model in noninteractivemode ml model method progname

eq [eq ...]

[weight]

lif exp]

[in rangej,

maximize

I robust cluster(varname) title(string)nopreserve collinear missing ifO(#k #u) continue waldtest(#) constraints (numlist)obs(#) noscvars init(ml_init_args) search({on Iquietly Ioff}) _repeat(#) b_ounds(ml_search_bounds) difficult .nologtrace gradient hessian showstep iterate(g) _Itolerance(#) tolerance(g) nowarning novce score(newvarlist) I Noninteractive

mode is invoked by specifying

option maximize.

Use maximize

when ml is to

be used as a subroutine of another ado-file or program and you want to carry forth the problem, from definition to posting of final results, in one command.

Syntax of subroutinesfor use by method dO,dl, and d2 evaluators mleval

newvarname

= vecname

mleval

scalarname

mlsum

scalarnamelnf

= exp [if exp] [, noweight

mlvecsum

scalarnamelnf

rowvecname : exp [if exp] [, eq(#)

mlmatsum

scalarnamelnf

matrixname = exp [if

= vecname,

[, eq(#)] scalar

[eq(#)]

]

exp] [, eq(#[,#i)

i ]

ml -- Maximum likelihood estimation

341

Syntax of user-writtbn evaluator Summar

of notation

The lo :-likelihood fdnction is In L(Olj, 02j,..:., OEj) where Oij = xijbi andj = 1,..., N indexes observ ttions and i _ t,..., E indexes the liftear equations defined by mt model. If the likelihood satisfie[ the linear-fc_rm restrictions,

Method

it can be decomposed

as In L = Z;=I

In g(Olj, 02j,...,

It evaluators:

program dv:fr_n:jvgname args

inf _,thetal [theta2

...

]

/* if you _eed to createany intermediate results: */ tempvax trap1 trap2... quietly gen double "tmpl" = ... }

°''

;

quietly

replace

"lnf"

= ...

end t

wbe ' !

"!n_]" "the_tal" "th_ta2"

vafihble to be filled in with observation-by-observation values of In £j vari_bte containing evaluation of 1st equation Ou=xljb_ varihbte containing evaluation of 2nd equation Ozj=x,._b2

l

Method d D evaluators:i _ '

prol_ :am define p_ogname version args Code b inf tempvar_etal theta2 ... mleval "_hem1" = "b', eq(1) mleval "_eta2" = "b', eq(2)

/* if there is a 02 */

/* if you _eed to create any intermediateresutts: */ •

!

tempvar

1 i

mlsum "l_ff" --- ...

}

end!

'f_npt trap2...

gen double

"tmpl " = ...

"

where} i i

i !

"todd"

"b" i "lnf_ t

always contains 1 {may be ignored) full parameter row vector b=(bl,,b2,...,bE) scal_ to be filled in with overat_ In L

Method dl evaluators: prog 'am defineprOgname version

! i

tempvar tbetal lheta2 ... mleval "t_etal" = "b', eq(I) mleval "ttheta2" = "b', eq(2)

i _!

/* if there is a 02 */

/* if you n_ed to create any intermediateresults: */ tempvar t_pt trap2 ... gen doubl_e "tmpl " = . . . ...

0,_).

F

.

o,,,_

ml --mls_ Maximum "lnf"

Ii''_P

ilKellrlooO = ...

estimation

if "todo'==O ] "Inf'==. { exit }

iI i

tempname dl d2 ... mlvecsum "Inf" "d1" = formulaforOlngj/O91j, eq(1) mlvecsum "Inf" "d2" = formulafor0 Ing#/002#, eq (2) matrix

"g" = ('dl','d2", ... )

end where "todo"

contains0 or 1

"b" "lnf"

O_'Inf'to be filled in;l_'lnf" and "g" to be filled in fullparameterrow vectorb=(bl,b_,...,bE) scalar to be filled in with overall In L

"g"

row vector to be filled in with overall g---01nL/0b

"negH"

argument

"gl" "g2"

variable variable

to be ignored optionally optionally

to be filled in with 01ng#/0bl to be filled in with ,9 In £_/0b2

Method d2 evaluators: program

define progname version 7 args todo b inf g negH [gl [g2 ... ] ] tempvar thetal thcta2... mleval "thetal" mteval "theta2"

= "b', = "b',

eq(1) eq(2) /*

if there is a 02 */

/* if you need to create any intermediate tempvar tmpl unp2 ... gen double "tmpl= ... mlsum "lnf"

results:

./

= ...

if "todo'==O

[ "inf'==. { exit }

tempname dl d2 ... mlvecsum mlvecsum

"lnf ""dl" "lnf" "d2"

= formula for 0 In gj/OOL7, = formula for 0 tn ej/OO2j,

eq(1) eq(2)

matrix "g" = ('dl','d2", ... ) if "todo'==l [ "inf'==. { exit } tempname dll d12 d22 ... mlmatsum "inf" "dll"= fotmtlla for 02 In_3/08_j, eq(1) mlmatsum "Inf" "dl2" = formulafor- 02 Ingj/OOljOe2j, ecl(l,2) mlmatsum "Inf" "d22" = fonntlla for-02 Inej/OO_j, eq(2) end

matrix

"negH"

= ('dll','dl2",

...

\

"d12'','d22",

...

)

where "todo"

contains

0, 1, or 2

"b" "lnf"

O_'lnf" to be filled in: l_'lnf" and "g" to be filled in: 2_" lnf', "g', and "negtt" to be filled m full parameter row vector b--(bl,b2,...,bE) scalar to be filled in with overall In L

"g" "negH"

row vector to be filled in with overall g=01n L/Ob matrix to be filled in with overall negative Hessian -H=--02

"gl" "g2"

variable variable

optionally optionally

to be filled in with Oln_j/Obl to be filled in with Olngj/Ob2

In L/ObOb"

i

_

ml -- Maximumlikelihood estimation

1

Global

SML__rl SML@2

_ _ i

name of first dependent variable nam_ of second dependent variable, if any

SHL_ _amp

variable containing

1 if observation

SML_'

variable containing

weight associated with observation or 1 if no weights specified

to be used; 0 otherwise

Method If evaluators can ignore SML_samp, but restricting calculations to the SML_samp==l subsaml_te will speedi execution. Method If evaluators must ignore SML_w;application of weights is handl_ by the me_hod itself. Method dO. dl. and _]2 can ignore $ML_samp as tong as ml model's nopreserve

:i

i ,_ ! i

"

[• i

343

m_crosfor use!byall evaluators !

option is not

specifie_l. Method d0_ dl, and d2 will run more quickly if nopreserve is specified. Method dO, dl. and d2 evaluator_ can ignore $ML_w only if they use mlsum, mlvecsum, and mlmatsum to produce final results.:

Description ml cle

r clears the current problem definition. This command is rarely, if ever. used because,

when you t_pe ml modest, anv, previous problem is automatically cleared. m2 mod_t defines the!currenl problem. ml query displays a 8escription of the current problem.

ral check verifies thai the log-likelihood evahaator you have written seems to work, We strongly recommend using this command. ml sea: ch searches for (better) initial values. We recommend using this command. ml plot

provides a g_aphical way of searchip_g for (better) initial ]

values.

ml init Iprovides a Way_of setting initial values to user-specified values. ml report reports thi: values of tn L, its gradient, and its negative Hessian at the initial values •l " or current l_ameter estimates b0. ra! trac_

traces the execution of the user-deft=ned log-likelihood evaluation program.

,

ml co counts the _umber of times the user-defined log-likelihood evaluation program is called. It was inteqded as a del_ugging tool for those developing ml and it now serves little use besides entertainment, ml count! clear clears the coun{er, ml count on turns on the counter• ml count

!

without argt}ments report S the current values of the counters, ml count off stops counting calls.

i i

ml maxillizemaxlmi_es the likelihood function and reports final results. Once ml maximize has succe.tsfully completed, tl_e previously mentioned ml commands may no longer be used--ml graph

[

and rat dis ,lay may be ,fused" m! grapl graphs the ltog-likelihood values against the iteration number.

i ! ?

! , :_ [

,

. i

rnl disp: ay redisplav_ final results. prognam_I is the namd of a program you write to evaluate the tog-likelihood function. In this documentation, it is referled to as the user-written evaluator or sometimes simply as the cvaluator. The progra_._._youwrite is,_,ritten in the style required by the method you choose. The methods are If, dO. d l. If evaluator. and ]t2. Thus. if _ou choose to use methotl If, your program is called a method Method If e_aluators are Zrequired to evaluate th_ observation-by-observation log likelihood In (_, .j : I,.....,'It. Method dO evaluators are required |o evaluate the overall log likelihood tn L. Method d I evaluator_ are required io evaluate the overall log likelihood and its gradient vector g = 0 In L/Oh Method d2 e_,aluators are [equired to evaluate the Overall log likelihood, its gradient, and its neeative Hessian matrix -H = -0-°In L/ObOb'

:_

344 ml -- Maximum likelihood estimation mleval is a subroutine used by method dO, dl, and d2 evaluators to evaluate the coefficient vector b that they are passed. mlsum is a subroutine used by method dO, dt, and d2 evaluators to be returned.

to define the value In L that is

mlvecsum is a subroutine used by method dl and d2 evaluators to define the gradient vector g that is to be returned. It is suitable for use only when the likelihood function meets the linear-form restrictions. mlmatsum is a subroutine used by method d2 evaluators to define the negative Hessian -H matrix that is to be returned. It is suitable for use only when the likelihood function meets the linear-form restrictions.

Optionsfor use with ml model in interactive or noninteractive mode robust

and cluster

(varname)

specify the robust variance estimator, as does specifying pweights.

If you have written a method If evaluatOr, robust, is nothing to do except specify the options.

cluster

(), and pweights

If you have written a method dO evaluator, robust, cluster(), Specifying these options will result in an error message.

and pweights

will work. There wilt not work.

If you have written a method dl or d2 evaluator and the likelihood function satisfies the linear-form restrictions, robust, cluster (), and pwe±ghts will work only if you fill in the equation scores; otherwise, specifying these options will result in an error message. title

(stritTg) specifies the title to be placed on the estimation

output when results are complete.

nopreserve specifies that it is not necessary for ml to ensure that only the estimation subsample is in memory when the user-written likelihood evaluator is called, nopreserve is irrelevant when using method lf. For the other methods, if nopreserve is not specified, then ml the original dataset) and drops the irrelevant observations before This way, even if the evaluator does not restrict its attentions results will still be correct. Later, ml automatically restores the ml need not go through these machinations evaluator calculates observation-by-observation

saves the data in a file (preserves calling the user-written evaluator. to the $ML_samp==l subsample, original dataset.

in the case of method If because the user-written values and it is ml itself that sums the components.

ml goes through these machinations if and only if the estimation sample is a subsample of the data in memory. If the estimation sample includes every observation in memory, ml does not preserve the original dataset. Thus. programmers must not damage the original dataset unless they preserve the data themselves. We recommend interactive users of ml not specify nopreserve; chances of incorrect results.

the speed gain is not worth the

We recommend programmers do specify nopreserve, but only after verifying that their evaluator really does restrict its attentions solely to the $ML_samp==l subsample. collinear bpecifies that ml is not to remove the collinear variables within equations. There is no reason one would want to leave coltinear variables in place, but this option is of interest to programmers who. in their code. have already removed collinear variables and thus do not want ml to waste computer time checking again.

+ {

I ,

_

ml -- Maximum likelihoodestimation

345

missings])ecifies that observations containing variables with missing values are not to be eliminated from th_ estimation somple. There are two reasons one might want to specify missing:

! _

Prograrr _ers may wi_h to specify" missingbecause, in other parts of their code, they have already eliminat_ observatio+hs with missing values and thus do not want ml to waste computer time

i ! + •

looking again. All user_ may wish tO specify missingif their model explicitly deals with missing values. Stata's heckmaa command i_ a good example of this. In such cases, there will be observations where

_

missing values are allbwed and other observations where they are not--where their presence should cause the observationl to be eliminated. If you specify missing,it is your responsibility to specify an if e rp that elimidates the irrelevant obserVations.

_; !

If0(#k #u_ is typically _sed by programmers. It Specifies the number of parameters and log-likelihood value o_the constant-only model so that ml++ can report a likelihood-ratio test rather than a Wald

i

i } + + i

+_

test. Th_se values w_re, or they may have been determined by t + perhaps, analytically_idetermined, • • a previous estimation! of the constant-only m6del on the estimation sample. "

1

Also so the continueoption directly below: If you specify IfO(),it must be safe for you to specify the missing option, too, else how did you cal
even if specified, is ignored if robus,t,

cluster(),

or pweights

is specified because in

that casf a likelihood-ratio test would be inappropriate. continue is typically specified by programmers. It does two things:

!

First, it specifies tha_ a model has just bee_ estimated, by either ml or some other estimation

i !

commmtd such as l_git,and that the likelihood value stored in e(ll) and the number of parameters stored in ]e(b) as of this instant are the relevant values of the constant-only model.

i

The cmrent value of!the log likelihood is used to present a likelihood-ratio test unless robust. cluster(), or pwe_ghts is specified because, in that case. a likelihood-ratio test would be inappro _riate. +

+ + ;

_

Second. continue e(b)

sets the initial values b0 for the model about to be estimated according to the

c_trrently sto

re+ ,

The cot relents madel about specifying missing

with lIO () apply equally well in this case.

wa!dtest #) is typically specified by programmers. By default, ml presents a Wald test. but that is overridclen if option_ lf0() or continue are specified, and that is overridden again _so we are back tolthe Wald testi if robust, watdt

cluster(),

or pweights

are specified.

st (0) prevet_ts even the Wald test frOm being reported. !

waldte_st (-1)

"

!

"

"

"

"

"

t

+stht_ default. It specifies that a Wald test _s to be performed, ff tt _s performed, b}

! ;

constra@ng all coef_cients except for the intercept to 0 in the first equation. Remaining equations are to b_ unconstrai@d. The logic as to whet_er a Wald test is performed is the standard: perform

! +

the Wald test if nei!_er lf0() nor continue cluster, or pweig_ts were specified.

+ i

were specified, but force a Wald test if robust.

watdtest(k) for < --I specifies that a Wald test is to be performed, if it is performed, by constraining all coefficients except for intercepts to 0 in the first hi equations: remaining equations are to e unconstrair_ed The logic as to whether a Wald tes_ is performed is the standard.

+

_ l:J

o,m ml m Maximum iiKellrlOOdestimation waldtest (k) for k > 1 works like the above except that it forces a Wald test to be reported even if the infommtion to perform the likelihood-ratio test is available and even if none of robust, cluster, or pweights were specified, waldtest(k), k > 1, may not be specified with lf0(). constraints (numlist) specifies the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are defined using the constraint command and are numbered: see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. obs (#) is used mostly by pro_mers. It specifies that ultimately stored in e (N), is to be #. Ordinarily, ml Programmers may want to specify th_s option when, in for N observations, they first had to modify the dataset observations.

the number of observations reported, and works that out for itself, and correctly. order for the likelihood-evaluator to work so that it contained a different number of

noscvars is used mostly by programmers. It specifies that method dO, dl, or d2 is being used but that the likelihood evaluation program does not calculate nor use arguments "gl ", "g2", etc., which are the score vectors. Thus, m! can save a little time by not generating and passing those arguments.

Options for use with ml model in noninteractive mode In addition to the above options, the following options are for use with ml model in noninteractive mode. Noninteractive mode is for programmers who use ml as a subroutine and want to issue a single command that will carry forth the estimation from start to finish. maximize

is not optional. It specifies noninteractive

mode.

init (ml_init__args) sets the initial values be. mI_init_args ml init command.

are whatever you would type after the

search({onlquietlyloff }) specifies whether ml search is to be used to improve the initial values, search (on) is the default and is equivalent to running separately ml search, repeat (0). search(quietly) is the same as search(on) except that it suppresses ml search's output. search(off) prevents the calling of ml search altogether. repeat(#) repeat

is ml search's repeat (0) is the default.

bounds(ml_search_bounds) model issues is 'ml search

() option and is relevant only if search(off)

is not specified.

is relevant only if search(off) is not specified. The command ml ml--search_bounds, repeat (#) '. Specif_,ing search bounds is optional.

difficult, nolog, trace, gradient, hessian, showstep, iterate(), Itolerance(), tolerance(),nowarning,novce,and score() are ml maximize'sequivalent options.

Options for use when specifying equations noconstant specifies that the equation is not to include an intercept. offset (varname) specifies that the equation is to be xb + varname: that the equation is to include varname with coefficient constrained to be 1. exposure (varname) xb + ln(varname).

is an alternative to offset (varname); it specifies that the equation is to be The equation is to include In(varname) with coefficient constrained to be 1.

V

ml -- Maximumlikelihoodestimation

347

Options f( use withlml search repeat(#1

specifies the number of random attempts that are to be made to find a better initial-_alue

! ! ! ) } !

vector. "he default ropeat(lO). repea_ (0) specifiesi__at no random attempts are to be made. More correctly, repeat (0) specifies that no "andom attempts are to be made if the initial initial-value vector is a feasible starting point. If it is aot, ml search will make random attempts even if you specify repeat (0) because it has no _tltemative. Tl_e () option refers to the number of random attempts to be made to ) repeat improv_ the initial values, When the initial starting value vector is not feasible, ml search will make u]_to 1,000 rafidom attempts to find starting values. It stops the instant it finds one set of values t aat works and then moves into its improve-initial-values logic.

i

repeat (k), k > O. _pecifies the number of random attempts to be made to improve the initial values. !

)

nolog spe!ifies that no 6utput is to appear while ml search !

looks for better starting values. If you

!

specify tolog and th_ initial starting-value vector is not feasible, ml search wilt ignore the fact that youIspecified the nolog option. If ml search must take drastic action to find starting values, it feels },ou should kriow about this even if you attempted to suppress its usual output.

! !

trace spee ifies that you want more detailed output about ml search's actions than it would usually provide. This is more entertaining than useful, ml search prints a period each time it evaluates the likel hood functioh without obtaining a better result and a plus when it does.

! l i

restart stecifies that rlndom actions are to be taken to obtain starting values and that the resulting starting _alues are notto be a deterministic function of the current values. Users should not specify this opti )n mainly bejzause, with restart, mt search intentionally does not produce as good a set of st lrting values _s it could, restart is included for use by the optimizer when it gets into

I :

workingltogether, serious tlouble. Thedor_ndom hot result actions in a are long, to endless ensure that loop. the actions of the optimizer and ml search, restar I implies no_escale, which is why !we recommend you do not specify restart. In i i )

)

testing, 4ases were discovered where rescale _worked so well that, even after randomization, the rescaler _;ould bring (he starting _alues right back to where they had been the first time and so defeated lthe intended irandomization. norescale! specifies that_ml search is not to engage in its rescaling actions to improve the parameter vector. We do not recbmmend specifying this Option because rescaling tends to work so well.

Options for use with ml plot saving(fil_'name[,

replace I)_

specifies that the graph is to be saved in fiIename . gph.

Options for use with ml init skip specif es that any phrameters found in the specified initialization vector that are not also found in the m,)del are to bE ignored. The default action is to issue an error message. ) ,_ector b) position ratbter than by name. i )

)

copy speci_es that the ]isi of numbers or the initialization vector is to be copied into the initial-value

r

348

ml -- Maximum likelihood estimation

Options for use with mi maximize difficult specifies that the likelihood function is likely to be difficult to maximize. In particular, difficult states that there may be regions where -H is not invertible and that, in those regions, ml's standard fixup may not work well. difficult specifies that a different fixup requiring substantially more computer time is to be used. For the majority of likelihood functions, difficult is likely to increase execution times unnecessarily. For other likelihood functions, specifying difficult is of great importance. nolog,

trace,

gradient,

hessian,

and showstep

control the display of the iteration log.

nolog

suppresses reporting

trace

adds to the iteration log a report on the current parameter

gradient

vector,

adds to the iteration log a report on the current gradient vector.

hessian

adds to the iteration log a report on the current negative Hessian matrix.

showstep iterate

of the iteration log.

adds to the iteration log a report on the steps within iteration.

(#), ltolerance

iterate(16000) Convergence

(#), and tolerance

tolerance(le-6)

(#) specify the definition of convergence.

ltolerance(le-7)

is the default.

is declared when mreldif(bi+l,bi) _< tolerance () or

reldif{lnL(bi+l),InL(bi)}< Itolerance()

In addition, iteration stops when i -- iterate(); in that case. results along with the message "convergence not achieved" are presented. The return code is still set to 0. nowarning is allowed only with iterate (0). nowarning suppresses the "convergence not achieved" message. Programmers might specify iterate (0) nowarning when they have a vector b already containing what are the final estimate,; and want ml to calculate the variance matrix and post final estimation results. In that case, specify 'init(b) search(off) iterate(0) nowarning notog'. novce is allowed only with iterate (0). novce substitutes the zero matrix for the variance matrix which in effect posts estimation results as fixed constants. score (newvarlist) specifies that the equation scores are to be stored in the specified new variables. Either specify one new variable name per equation or specify a shorl name suffixed with a *. E.g., score(sc*) would be taken as specifying scl if there were one equation and scl and sc2 if there were two equations. In order to specify score(), either you must be using method If, or the estimation subsample must be the entire dataset in memory, or you must have specified the nopreserve option. nooutput quietly

suppresses display of the final results. This is different from prefixing ral maximize in that the iteration log is still displayed (assuming nolog is not specified).

with

level(#) is the standard confidence-level option. It specifies the confidence level, in percent, for confidence intervals of the coefficients. The default is leveI(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals, eform(string)

is ml display's

eform()

option.

noclear specifies that after the model has converged, the ml problem definition is not to be cleared. Perhaps you are having convergence problems and intend to run the model to convergence. If so, use ml search to see if those values can be improved, and then start the estimation again.

ml -- Maximumlikelihood estimation

349

!

i

__

! ! l ,

Options for use with ml graph saving(f/enamel,

replace])

specifies that the graph is to be saved in filename.gph.

Options fo_"use witfi ml display noheader suppresses the display of the header above the coefficient table that displays the final log-likelihood value.!the number of observations, and the model significance test. e:form (str than b intercer coeffici first

_ng) displays !the coefficient table in exponentiated form: for each coefficient, exp(b) rather s displayed _tnd standard errors and _onfidence intervals are transformed. Display of the t, if any. is suppressed, string is the table header that will be displayed above the transformed rots and musi be 1t characters or fewer in length, for example, efornl("0dds ratio").

displays a coefficient table reposing resqlts for the first equation only, and the report makes

it appe_ that the fi@ equation is the only e_uation. This is used by programmers who estimate ancillary, parameters !in the second and subsdquent equations and will report the values of such parameters themselves. neq(#) is!an alternative to first,

neq(#)

displays a coefficient table reporting results for the first

# equat ons. This is !used by programmers who estimate ancillary parameters in the # _I_1 and subsequent equations!: and will report the values of such parameters themselves. i

plus displays the coefficient table just as it would be ordinarily, but then, rather than ending the table

I

in a lin._ of dashes, _nds it in dashes-plus-sign-dashes. This is so that programmers can write additional display c_e to add more results to the table and make it appear as if the combined result ii one table. Pl'ogrammers typically specify plus with options first or neq().

i

t i i

i

i

level.(#) !is the standard confidence-level option. It specifies the confidence level, in percent, for confidence intervals 6f the coefficients. The default is level(95) or as set by set level" see [U] 23,_ Specit'ying lhe width of confidence intervals.

Options fol use with_mleval eq(#) spet:ifies the equation number _ for which Oij = xijbi if eq() is not specified.

is to be evaluated, eq(1)

is assumed

scalar as_erts that the _th equation is known to evaluate to a constant; the equation was specified as () (ha he: ). or/ndtne on the ml model staiement. If you specify this option, the new variable " • created is created as !a scalar. If the ith equation does not evaluate to a scalar, an error message is issue4.

Options fm use with imlsum t

noweight

•

_pecifies that'welghts_ ($ML_v) are to be ignored when summing the likelihood function.

Options for use with 1 i rnlvecsum eq(#)

specifies the equa_ion for which a gradient vector Oln L/Obi

is eq(ll.

is to be constructed. The defauh

_,,

_f,_w

Ill!

--

IflaA|lilUlll

III_III|U_

u_||lna[|on

Options for use with mlmatsum eq(#[,#])

specifies the equations for which the negative

default is eq(:t), which means the stone as eq(1,1), eq(i,j) results in -021n L/Ob_Ob_.

Hessian matrix is to be constructed.

which means -021n L/OblOb'

The

1. Specifying

Remarks For a thorough discussion of ml, see Maximum Likelihood Estimation with Stata (Gould and Sribney 1999). The book provides a tutorial introduction to ml, notes on advanced programming issues, and a discourse on maximum likelihood estimation from both theoretical and practical standpoints. ml requires that you write a program that evaluates the log-likelihood function and, possibly, its first and second derivatives. The stvle of the program you write depends upon the method chosen; methods If and dO require your program evaluate the log likelihood only; method dl requires your program evaluate the log likelihood and gradient; method d2 requires 3,'our program evaluate the log likelihood, gradient, and negative Hessi_m. Methods If and dO differ from each other in that, with method If, your program is required to produce observation-by-observation log-likelihood values In gj and it is assumed that In L = }"_j In £j; with method dO, your program is required to produce the overall value In L. Once you have written the program--called an evaluator--you define a model to be estimated using ml model and obtain estimates using ml maximize. You might type • ml model ... ml maximize

but we recommend that you type • ml ml • ml • ml

model ... check search maximize

ml check will verify your evaluator has no obvious errors, and ml search will find better initial values. You fill in the ml model statement with (1) the method you are using, (2) the name of your program, and (3) the "equations". You write your evaluator in terms of 01, 02, ..., each of which has a linear equation associated with it. That linear equation might be as simple as 0i -- b0, or it might be 0i = blmpg + b2weight+ ha, or it might omit the intercept b3. The equations are specified in parentheses on the ml model line. Suppose you are using method If and the name of your evaluator program is myprog. statement ml model If myprog

The following

(mpg weight)

would specify a single equation with 0_ = blmpg + b2weight _-b3. If you wanted to omit ha, you would type . ml model if myprog

(mpg weight, nocons)

and if all you wanted was 0i -- b0, you would type • ml model if myprog With

multiple

equations,

you

• ml model if myprog

() list

the

equations

(mpg weight)

()

one

after

the

other:

so if you

typed

ml -- MaximumIikelih_

I I

t

i

i

I

351

i you would ,e specifying 01 = blmpg+ b2weight_ b3 and Oz = b4: You would write your likelihood in terms of )t and 02. If _he model was linear reg_ssion, 01 might be the xb part and 0.2.the variance of the resid lals. i When you specify thelequations, you also specify any dependent variables. If you type . m! +odel L

if myp+g

(price = mpg weight_

()

price woulu be the one !and only dependent varihble, and that would be passed to your program in SML,yl. If _sour model _d two dependent variables, you could type 1 • ml _odel

If

mypr_g

I

i )

estimation

(price

displ

= mpg Oeight)

()

i

and then $_L_yl

would !be price

and $ML_y2 _ould be displ.

You can specify however many

dependent "_ariables are _ecessary and specify them on any equation. It does not matter on which equation vo6 specify them: the first one specifie4 is placed in $ML.yl, the second in SNL'y2, and SO on,

Example

I

Using m4thod If. we aketo produce observation-by-observationvalues of the log likelihood. The probit ]og-li_elihood funclion is

= f lne(O j)

lngj

l lncI'(4Olj)

Otj

=

xjb

if

=1

if yj = 0

1

The foltowiI_g is the mettiod If evaluator for this ikelihood function: progrem

define myp_obit version 7. args

Inf t_etal

quietly quietly

re_lace re,lace

"inf" = in(norm('thetal')) "inf" = in(norm(-_thetal'))

if SML_yI==I if $ML_yI==O

end

i If we wante_ to estimate a model of foreignonmpg and weight, we would type i

. ml m_del

!

if mypr_bit

i

(foreign

= mpg weight)

• ml m _ximize

i

The 'forei_ =' part siecifies that y is foreign.The 'mpg weight'part specifies that OIj == blmpgj + b_9_eightj + b,t. The result of running this is

i

I ml m_del If myprrbit ml m _ximize

(foreign

= mpg weight)

initia [:

log likelihood

e -51. 29289

altern itive:

log likelihood

= -45.055271

Iterat ton O:

log likelihood

± -45.05527:

Iterat [on I: festal _: Iterat [on 2:

log likelihood log likelihood log likelihood

= -27.9041L = -45.05527: = -26.8578

Iterat Lon 3: iterat LOn 4:

log likelihood log likelihood

= -26,844191 = "26.84418!

Iterat .on 5:

log likelihood

= -26.84418!

_.og li]_elihood t

i i

Number

of obs

=

74

)

Wald

chi2(2)

=

20.75

= -6.844189

Prob

> chi2

=

0.0000

I i

352

mi -- Maximum likelihood estimation foreign

Coef.

mpg weight _cons

Std. Err.

-.1039503 .0515689 -.0023355 .0005661 8.275464 2.554142

z

P>_zl

[95_Conf.Interval]

-2.02 -4.13 3.24

0.044 0.000 0.001

-.2050235 -.0028772 -.003445 -.0012261 3.269438 13.28149

q

> Example A two-equation, two-dependent variable model is little different. Rather than receiving one theta, our program will receive two. Rather than there being one dependent variable in $ML_yI, there will be dependent variables in $ML_yl and $gL_y2. For instance, the WeibuI1 regression log-likelihood function is Ingj=(tje°'_)

exp(°_) +dj{Olj-Olj+(e

°lj

t)(lntj

Ou) }

Olj = Xjbl 023

_

S

where tj is the time of failure or censoring and dj = I if failure and 0 if censored. We can make the log likelihood a little easier to program by introducing some extra variables: pj -- exp(O2j) Mj = tj exp(-01j)

pj

Rj = In tj - 01j In gj -= -Mj

+ dj{Olj - Olj -}-(pj - 1)nj }

The method If evaluator for this is program

define myweib version 7.0 args

Inf

thetal

theta2

tempvar

p M R

quietly

gen

double

"p"

= exp('theta2")

quietly

gen

double

"M"

= ($ML_yl*exp(-'thetal'))"p"

quietly

gen

double

"R" = in($ML_yl)-'thetal"

quietly

replace

"Inf"

= -'M"

+ SML_y2*('thetal"-'thetal"

+ ('p'-l)*'R')

end

We can estimate a model by typing • ml model If myweib ml maximize

(studytime

died

= drug2

drug3

age)

()

Note that we specified ' ()" for the second equation. The second equation corresponds to the Weibull shape parameter s and the linear combination we want for s contains just an intercept• Alternatively, we could type ml model

if myweib

(studytime

died

= drug2

drug3

age)

/s

Typing /s means the same thing as typing (s:) and both really mean the same thing as ()• The s, either aher a slash or in parentheses before a colon, labels the equation. It makes the output look

f

prettier and that is all:

ml -- Maximumlikelihoodestlmation

• ml_ odel

if mywe Lb (studytime

died

= dr_g2

drug3

353

J

age) /s

• ml _ ax ,

initi_ .i:

log likelihood

=

)

alterl .drive :

log likelihood

= -356.142716

resca]e:

log likelihood

"- -200.802_I

log likelihood

= -138,692342

log

= -136.692_

rescale Iterat

eq:

ion

O:

likelihood

-7_

I

Iteral ion i:

log)likelihood

= -124.117_

i {

Iteral ion 2: Iteral ion 3:

log) likelihood log I likelihood

-113.889iB -iI0.303_

I

Iteralion Iteraiion 5: 4: Iteral;ion 6: i i

log log)ilikelihood likelihood log I likelihood ! .!

Log l:kelihood

= -_I0.26736

i

I Coef.

(not

concave)

-110.267_ -II0.267_7 = -110.267_

Std.

Err.

z

Number of obs Weld chi2(3)

= =

Prob

=

> chi2

P>_zl

48 35.25 0.0000

[95_, Conf.

Interval]

!

drug2 drug3

I. 12966 I). 45917

.2903917 .2821195

3.488 _.172 5

O.000 0.000

.4438086 .9062261

I. 582123 2.012114

)

age _cons

-.0_7&728 6._60723

.0_5688 1. i52845

_3,266 5.257

0.00t 0.000

-. t074868 3. 801188

-. 0268587 8L 320269

.1402154

3.975

0.000

.2825162

.8321504

S

i

i

I

lej,

, OE,land they are required to evaluate the overall log-likelihood In/J rather than

!

Use mle;al to produce) the thetas from the coefficient vector. Use mlsm to sum th_ components that enter into InL. In, the ca se of WeibuI!, In L = _ In gj, and otir method dO evaluator is version progr tm define

7.!0 we4bO

i

tempvar args todot_etal )b inf theta2 mleval "tl_etal" = "b', eq(1)

i

mleval "t_eta2" = "b" local t "_ML_yl"

I

local

l

eq(2) : /* th_s

is just

for readability

d "_ML_y2"

=empvar quietly

p M double gen R

quietly gEn double quietly g(n double mlsum "In_" =-'M"

i

I

i

4

> Example Method 0 evaluatorsTeceiveb = (bl, b2,..., bE), the coefficientvector,ratherthanthe already evaluated 0_. 02....

i

_73333

i i

!

i I

•

i _cons

l i

i

"p, = exp('theta2") "R" = in('t')_'thetal" "M' = ('t'*ex_(-'thetal'))_'p _ + 'd'*('theta_'-'thetal_ + ('p'-I)*'R')

To estimatei our model using this evaluator, we would type . ml

odel dO wei 0 (studytime

i

died = dr_g2

-

drug3

age) /s

*/

i i !

[3 Technical Note i

354 ml -- Maximum likelihood estimation Method dO does not require In L = _-']d In gj, j = 1,..., N, as method If does. function might have independent components only for groups of observations. Panel have a log-likelihood value in L - Y-_iIn Li where i indexes the panels, each of multiple observations. Conditional logistic regression has In L = _k in Lk where k pools. Cox regression has In L = _--_(t)In L(t ) where (t) denotes the ordered failure

Your likelihood data estimators which contains indexes the risk times.

To evaluate such likelihood functions, first calculate the within-group log-likelihood This usually involves generate and replace statements prefixed with by, as in tempvar

contributions.

sumd

by group:

gen

double

"sumd*

= sum($ML_yl)

Structure your code so that the log-likelihood contributions are recorded in the last observation each group. Let's pretend that variable is named "cont ". To sum the contributions, code t empvar

of

last

quietly by group: gen byte "last" mlsthm "inf" ="cont" if "last"

=

(_n==_N)

It is of _eat importance that you inform mlsum as to the observations that contain log-likelihood values to be summed. First, you do not want to include intermediate results in the sum. Second, mlsttm does not skip missing values. Rather, if mlsum sees a missing value among the contributions, it sets the overall result "lnf" to missing. That is how ml maximize is informed that the likelihood function could not be evaluated at the particular value of b. ml maximize will then take action to escape from what it thinks is an infeasible area of the likelihood function. When the likelihood function violates the linear-form restriction In L - _j In t74, j -- 1,..., N, with In gj being a function solely of values within the jth observation, use method dO. In the following examples we will demonstrate methods ,51 and d2 with likelihood functions that meet this linear-form restriction. The d t and d2 methods themselves do not require the linear-form restriction, but the utility routines mlvecsum and mlmatstm do. Using method dl or d2 when the restriction is violated is a difficult programming exercise. El

> Example Method dl evaluators are required to produce the gradient vector g = 0 In L/Ob as well as the overall log-likelihood value. Using mlvecsura, we can obtain O ln L/Ob from 0 in L/OOi, i = 1,..., E. The derivatives of the Weibull log-likelihood function are

Olntj OOla

= pj(Mj

dj)

Oln t 9

o02j= dj - Rjpj(Mj-dj) The method dl evaluator for this is program

define weibl version 7 args

todo

tempvar

b inf g

/* g is new

tl t2

mleval

"tl"

= "b',

eq(1)

m!eval

"t2"

= "b',

eq(2)

local local

t "@HL_yl" d "$ML yi"

_empvar

p M R

quietly

gen

double

"p"

= exp('ti")

*/

5

ml-- Maximum likelihoodestimat_n

355

i

quietly g+n double "M" = ('t'*ex _(-'tl'))_'p" quietly g,..n double "R" = in('t')_'tl" mlsum "in:'"= -'M" + _d'*('t2"-'_1" + ('p':-I)*'R') if "todo'::=0I "Inf"--=.{ exit }:

/* <-- new */

tempname [i d2 mlvecsum Inf" "dl" = "p'*('M'-'d'), eq(1) mlvecsum inf" "d2" = "d" - "R'*!p'*('M'-'d'), eq(2) matrix "g = ('dl','d2")

/* /* /* /*

<-<-<-" <--

new new new new

*/ */ */ */

end

i

We obtained this code )y starting with our meihod dO evaluator and then adding the extra lines method dl Irequires. To }stimate our model using this evaluator, we could tvpe i

. ml

i ! but we rec(,mmend that _'ou first substitute method dtdebug for method dl and type t i . ml _odel dldebu, weibl (studytime died= . ml _aximize

i

!

Example i evaluatorsfarerequired to produce _H Method_2

= 021nL/ObOb', the negative Hessian matrix, as well as _he gradmnt aM log-hkehhood value, mlmatsumwall help calculate 021nL/ObOb from the negativ_ second derivatives with respect to theta. For the Weibull model, these negative second derivatives are

t,=

0013r_ gJ O21n -

I

pj(alj--dy

+ RjpjMj)

O01jOO2J

'i

i

d__ag2drug3 age) /s

Method d l([ebug will codaparethe derivatives we calculate with numerical derivatives and thus verify that our pr(,gram is corrdct.

!

i

drug3 age) /s

. ml _ximize

i

i

del dl weir1 (studytime died = _g2

02In gj O0_j_ - pjRj(RjpjMj

+ Mj - dj)

i

The metho¢ d2 evaluatol is progr am define version args

wei b2 7

todo

b lnf

g negH

/* negH added

*/

i

tempvar t: t2 mleval "t:" = "b', eq(1) local t ",_;ML_yI" local d "_IML_y2" tempvar p M R mleval "t,'i" = "b', eq(2) quietly g_n double "p" = exp('t2/) quietly g_n double "M" = ('t'*exp(-'tl'))"p" quietly g_n double "R" = in('t')_'tl" mlsum "InJ" = -'M" + "d'*('t2"-'_l" + ('p'-I)*'R') if "todo'_=O I "inf"==. { exit } tempname il

i

d2

mlvecsum .lnf"

"d2

"R'*_p'*('M'-'d'),

mlvecsum "dl" : :p:*!'M'-','), matrix "g]41nf" = ('d1",'d_'){ if "todo'!=l I "Inf'==. exit }

eq(2)

eq(1) /* new from here down */

I

i

llKellnOOO dll d12 d22

estimation

mlmatsum

"Inf"

"dll"

= "p"2

mlmatsum

"inf"

"d12"

=-'p'*('M'-'d"

ml --tempname maxlmum

ouu

. TI'I I :

* "M',

eq(1) + "R'*'p'*'M'),

mlmatsum "inf" "d22" = "p'*'R'*('R'*'p'*'M" matrix "negH" = ('d11",'d12" \ "d12'','d22")

eq(l,2)

+ "M" - "d')

, eq(2)

end

We started with our previous method dl evaluator and added the lines that method d2 requires. We could now estimate a model by typing . ml model d2 weib2 • ml maximize

(studytime

died

= drug2

danlg3 age)

/s

but we would recommend you fist substitute method d2debug for method d2 and type . ml model d2debug . ml maximize

weib2

(studytime

died

= drug2

drug3

age)

/s

Method d2debug will compare the first and second derivatives we calculate with numerical derivatives and thus verify that our program is correct. <3 As we stated earlier, to produce the do except specify robust, cluster(), For methods dl and d2, these options restrictions and you fill in the equation

robust variance estimator with method lf, there is nothing to and/or pueights. For method dO, these options do not work. will work if your likelihood function meets the linear-form scores. The equation scores are defined as

Olngj

01nt_

00 U '

002j ' ""

Your evaluator will be passed variables, one for each equation, which you fill in with the equation scores. For both method d t and d2, these variables are passed in the sixth and subsequent positions of the argument list. That is. you must process the arguments as args

todo

b Inf

g negH

gl g2

...

Note that for method dl, the "negH" argument is not used; it is merely a place holder.

> Example If you have used mlvecsumin your method dl or d2 evaluator, it is easy to turn it in a program that allows the computation of the robust variance estimator. The expression that you specified on the right-hand side of mlvecsum is the equation score. Here we turn the program that we gave earlier in the method d l example into one that allows robust,cluster(),and/or pweights. program

define weibl version 7 args

todo

tempvar

b Inf

g negH

gl g2

/* negH,

tl t2

mleval

"t1"

= "b',

eq(1)

mleval

"t2"

= "b',

eq(2)

local

t "SML_yl"

local

d "$ML_y2"

tempvar

p M R

quSo.t]y

gen

quietly quietly

gen double gen double

double

"p"

= exp('t2")

"M" = ('t'*exp(-'tl'))^'p" "R" = in('t')-'t1"

gi,

and

g2

are

new

*/

t

......

ml --

Maximum

likelihood

estimation

357

mlsum "Inf_ = -'M" + "d'*('t2"-'til"+ ('p'-l)*'it') tempname di d2 if "todo'--POi "inf'==. { exit } quietly re_lace "gl" = "p'*('M'-'d') quietly replace "g2" "d" - "R'*_p'*('M'-'d") mlvecsum "_nf" "dl" "gl', eq(1) mlvecsum "Inf" "d2" = "g2", eq(2)

i

i i

/* /* /* /*

<-<-<-<--

new new changed changed

=/ */ *I */

matrix "g'l= ('dl",'d2")

i !

i

To estimate _ur model an_t get the robust variance estimates, we type . tal _bdel dl weib ! (study'time . ml m_ximize

Saved ResUlts ml saves!n Scalars

died

= drug2

clrug3

age)

/s,

robust

,

e()"

i

e(r4)

number o ! observations

e(N_clust)

number of clusters

number o_ _arameters number of equations

e(rc) e(chi2)

return code X2

e(k_dv)

number of dependent variables

e(ic)

number of iterations

e(df_m)

model degrees of freedom

e(rank)

rank of e(V)

e(il)

log tikelih)od

e(rank0)

rank of e(V)

e(ll_0)

log likelih)od, constant-only

e(k) e(k_eq)

_

for constant-only

model

model

if Ltt sak,ed by e(chi2type) Macros i !

e(cmd) e (depvar) I

ml name of d_:pendent variable

e(vcetype) e(user)

covariance estimation metl_od name of liketihood-evatuator program

e(wtype) e(wexp)

weight weight typ,: expression

e(opt) e(chi2type)

type Wald ofor optimization LR: type of model X2 test

name of cluster variable

e(cnslist)

constraint

e(V)

variance-covariance

!I

e(clustva_) I

Matrices e(b)

i

e(ilog) Functions

I

e (sample)

]

References!

coefficient

!

_ector

iteration lo

(up to 20 iterations) t marks estin_ation sample

Complementaiy:

Also See

matrix of

the estimators

i

Gould. W. and V(. Sribne_. 1999,. Maximum Likelihood

i

numbers

i

[It] maximize, [R] nl, [P] es_:imates, [P] matrix

Estimation with Stata. College Station, TX: Stata Press.

I Title I mlogR-

Maximum-likelihood

'

I

multinomial

[[1

I

(polytomous)logistic

I

I

I

cluster

(varnarne)

.

!

regression

I

!

I

Syntax (clist)

constraints

r_Errnoconstant

robust

maximize_options

: may be used with mlogit;

fweights, mlogit

iweights,

and pweights

shares the features

level

(#)

]

wh reli,, oftheform <-#][, by ...

(newvartist)

score

]

see [R] by. are allowed;

of all estimation

see [U] 14.1.6 weight.

commands:

see [U] 23 Estimation

and

post-estimation

commands.

Syntaxfor predict predict [_pe] newvarname(s) o_utcome(outcome)

]

Note that you specify one new variable These statistics are available the estimation sample.

[if exp] [in range] [, { p I xb [ stdp [ stddP }

with xb, stdp,

both in and out of sample;

and stddp

and specify

type predict

. ..

either one or k new variables if

e(sample)

...

with p.

if wanted only for

Description mlogit estimates maximum-likelihood multinomial logit models, also known as polytomous logistic regression. Constraints may be defined to perform constrained estimation. Some people refer to conditional logistic regression as multinomial logit. If you are one of them. see [R] elogit. See [R] logistic for a list of related estimation commands. A maximum of 50 categories can be; estimated with Intercooled Stata.

Stata: 20 categories

with Small

Options basecategory (#) specifies the value of depvar that is to be treated as the base category. The default is to choose the most frequent category. constraints(ctist) specifies the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are defined with the constraint command: see [R] constraint, constraints(I) specifies that the model is to be constrained according to constraint 1: constraints(I-4) specifies constraints 1 through 4: constraints(I-4,8) specifies constraints 1 through 4 and 8. It is not considered an error to specify nonexistent constraints so long as some of the constraints exist. Thus, constraint (1-999) would specify that all defined constraints be applied. 358

!t mlogit

M_lximum-likelihoodmultihomial(polytomous)logistic regression

359

I

robust

specities that • thdI Huber/White/sandwich.

_

estimator of variance is . to be used in place of the °

i

tradition_ calculatlon_ see [U] 23.11 Obtaining robust variance estimates, robust combined with ct_ ster () alley, s observations which ard not independent x_ithin cluster (although they must

l

1 be indep-mdent between clusters). If you sl_ecify pwei ts, robust is implied: see [U] 23.13 Weighted estimation.

! i

ctuster(vmzame) specifies that the observations are independent across groups (clusters) but not nece;sarily withi_ I.groups• vamame specifies to which _oup each observation belongs; e.g., clustez (personid) I In data with repeated Observations on individuals, cluster() affects the

_i !

estimate_ standard errors and variance-covaflance matrix of the estimators (VCE). but not the estimate_ coefficients: see [U] 23.11 Obtaining robust variance estimates, cluster() can be

I

used wi_ pweights to produce estimates for unstratified cluster-sampled data_ but see the svymlo; it command in [R] s_3_estimators for a command designed especially for survey data.

i

clustm

I

by itself.

I :: ! i

specifying robust

cluster()

is equivalent to typing cluster()

score(neu:varlist) creat :s k - 1 new variables, where k is the number of observed outcomes• The first vari _ble contains OlnLj/O(x3bl): the second variable contains OlnLj/O(xjb2); and so on. Note tha if you were o specify the option scoi'e(sc*), Stata would create the appropriate number of new ariables and Lheywould be named sol. sc2 .... sc(k - 1). level(#) specifies the c, ,nfidence level, in percent, for confidence intervals. The default is level(95) or as set by set lew.q: see [U] 23.5 Specifying the width of confidence intervals. rrr

i !

I.

() implies rbbust:

reports the estimate coe cients transformed to relative risk ratios, i.e.. e b rather than b: see De.sculpton of modellbeto,a for an explanation of this concept. Standard errors and confidence intervals are similarl_ transformed. This opti0n affects how results are displayed, r_ot how they are estin,ated, rrr m_v be specified at estimation or when replaying previously estimated results.

noconstantsuppresses the constant term in the model.

'

ma.vimize_@tions contr specify hem.

ohthe maximization

process: see [R] maximize. You should never have to

Options for predict p, the defmtlt, calculates the probabifity of a positive outcome conditional on one positive outcome within g_oup. If you d._ not also specify the outcome(outconm) option, you must specify k new variables. For instance I say you estimated your model by typing mlogit insure age male and that insure takes on!three values. Then you could type pt-edict pl p2 p3, pr to obtain all three predicted

i

probabilities. It does trot matter _xhich category mlogit calculatel all three probabilities correctly.

chose as the base category: predict

wilt

'_:

If you a!so specify t_e outcome(outcome)

i , i,_

insure Itook on values I. 2, and 3. Typing predict p:t, pr outcome(i) would produce the same pt_ as above, pzedict p2, pr outcome(2) the same p2 as above, etc. If insure took on ' 7_ 22. and 93. you would specify outcome(7) . outcome(22), _aiue._ and outcome(93) •

!

xb

_l

strip

i

=

calculatis

the

linear

_rediction•

You

must

option, then you specie, one new variable. Say that

also

specify

the

outcome(Ol',tC_'_t*Te)

option•

calct}lates the standard error of the linear prediction. You must also specify, the out-

! come(ot_tcome) option.

i:._I ,

oou mloglz -- Maxtmum-I|l(ellltood multinomial (polytomous) logistic regression stddp calculates the standard error of the difference in two linear predictions. You must specify option outcome(outcome) and in this case you specify the two particular outcomes of interest inside the parentheses; for example, predict sed, sgddp outcome(i,3). outcome (outcome) specifies for which outcome the statistic is to be calculated, equation() is a synonym for outcome ()" it does not matter which you use, and the standard rules for specifying an equation() apply.

Remarks Remarks are presented under the headings Description of model Estimating unconstrained models Obtaining predicted values Testing hypotheses about coefficients Estimating constrained models

mlogit performs maximum likelihood estimation of models with discrete dependent (left-handside) variables. It is intended for use when the dependent variable takes on more than two outcomes and the outcomes have no natural ordering. If the dependent variable takes on only two outcomes, estimates are identical to those produced by logistic or logit; see [R] logistic and [R] logit. If the outcomes are ordered, see [R] ologit.

Descriptionof model For an introduction to multinomial logit models, see, for instance, Aldrich and Nelson (1984, 73-77), Greene (2000, chapter 19), Hosmer and Lemeshow (t989, 216-238), and Long (1997, chapter 6). For a description with an emphasis on the difference in assumptions and data requirements for conditional and multinomial logit, see Judge et al. (1985, 768-772). Consider the outcomes 1, 2, 3, ..., rn recorded in y, and the explanatory variables X. For expositional purposes, assume there are rn = 3 outcomes. Think of these three outcomes as "buy an American car", "buy a Japanese car", and "buy a European car". The values of y are then said to be "unordered". Even though the outcomes are coded 1, 2, and 3, the numerical values are arbitrary in the sense that 1 < 2 < 3 does not imply that outcome 1 (buy American) is less than outcome 2 (buy Japanese) is less than outcome 3 (buy European). It is this unordered categorical property of y that distinguishes the use of mlogit from regress (which is appropriate for a continuous dependent variable), from ologit (which is appropriate for ordered categorical data), and from logit (which is appropriate for two outcomes and which can therefore be thought of as ordered). In the multinomial logit model, we estimate a set of coefficients _(i),/3(2) to each outcome category: exB(1)

Pr(y

= 1) - eX_(a) + eX_(_) + e xo(a_ eX_ (2_

Pr(y

= 2) -

eX_(_) + eX3(2) + e x_(a) eXf_(3)

Pr(y

--: 3) -- eX_(_) -I-e X3(2)

+

e X2(3)

and 9 (3) corresponding

t

_-_-

mlogit-- Maximum-likelihoodmultinomial(polytomous)logisticregression

The mcktel, however, is unidentified in the sense that there is more than one solution to 9 (t), fl(2), and /:1(3) that leads to the same probabilities for y = 1, y = 2, and y - 3. To identify the model. one of [3(1). 3 (2), or 3 (3) is arbitrarily set to 0--it does not matter which. That is, if we arbitrarily set /3{I) = 0, the remaining coefficients fir2) and /_(3) would measure the change relative to the y = [ group. If we instead set 3 (2) -- 0, the rerfiaining coefficients fl(1) and fl(3) would measure the change relative to the y = 2 group. The coefficients would differ because they have different interpretations, but the predicted probabilities for y =-- 1, 2, and 3 would still be the same. Thus, eitherparameterization would be a solution to the same underlying model.

) !

!

361

,

Se_fing ,30) = O, the equations t_ome

i

1 Pr(y = 1) =

!

1 4-,eXt3(2_+ eXt3_3_ £Xfl(2)

1

pr(y= 3)-

+ eXO (a) eX_

(a)

1 +ieX_ _(2_+ eXt3(3_

Th_ relative probability of y = 2 to the base category is Pr(y = 2) = eX¢_2_ Pr(y = 1) Let us !c_ll this ratio the relative risk and let us fu_her assume that X and 3, k(2) are vectors equal to (Zl , x2, -., xk) and ,,_i change in xi; is then

,u2

" ... , ,3k ), respectively. The ratio of the relative risk for a one-unit

(2) eft1

(2) xa+'"+t3

i

, (zi+ll+"'+/3t_

(2} xk

R!2)

Thus, the exponentiated value of a coefficient is the relative risk ratio for a one unit change in the correstlonding variable, it being understood that risk is measured as the risk of the category relative to the base category.

I

I° stimat!ng unconstrained models Exam r

You have data on the type of heat_h insurance a'Oailable to 616 psychologically depressed subjects lJ.S. (Tarlov et al. 1989: "Wellset al. 1989). The insurance is categorized as being either an in the ' indemr/ity plan (i.e.. regular fee-for-service insurance which may have a deductible or coinsurancc

l "I I i

e

_ ' * i

i r

rate) or a prepaid plan (a fixed up-front payment allowing subsequent unlimited use as provided, for instance, bv an HMO_.The third possibility is that ttie subject has no insurance whatsoever. You wish to explore the demographic factors associated with each subject's insurance choice. As an introduction Io the data, one of the demographic factors is the race of the participant, coded as white or nonwhite: " t,!_bul

: i!,

_;e insure

i

nonwh:tte,

cal.2

col

'i _*I:)

._oz

mmgrE-

nonwhite MaxJmum-Iil_elihoodmultinomial (polytomous) logistic regression

insure

0

1

Total

Indemnity

251 50.71

43 35.54

294 47.73

Prepaid

208 42.02

69 57.02

277 44.97

Uninsure

36 7.27

9 7.44

45 7.31

495 100.O0

121 100.O0

616 100.O0

Total

Pearson chi2(2) =

9.5599

Pr = 0,008

Although insure appears to take on the values Indemnity,Prepaid, and Uninsure, it actually takes on the values 1, 2, and 3. The words appear because a value label has been associated with the numeric variable insure; see [U] 15.6.3 Value labels. When you estimate a multinomial logit model, you can tell mlogit which group m use as the base category or you can let mlogit choose. To estimate a model of insure on nonwhite, letting mlogit choose the base category, we type . mlogit znsure nonwhite Iteration Iteration Iteration Iteration

O: I: 2: 3:

log log log log

likelihood likelihood likelihood likelihood

= = = =

-556.59502 -551.78935 -551.78348 -551.78348

Multinomial regression

Number of obs LR chi2 (2) Prob > chi2 Pseudo R2

Log likelihood = -551.78348

Std. Err.

z

P>[z[

= = = =

616 9.62 0.0081 0.0086

insure

Coef.

[95_,Conf. Interval]

Prepaid nonwhite _cons

.6608212 -. 1879149

.2157321 .0937644

3.06 -2.00

0.002 0.045

.2379942 -.3716896

i.083648 -.0041401

.3779585 -1.941934

.407589 .1782185

O.93 -10.90

O.354 0,000

-.4209012 -2.291236

i.176818 -1.592632

Uninsure nonwhite _cons

(Dutcome insure==Indemnity

is the comparison group)

mlogit chose the indemnity group as the base or comparison group and presented coefficients for the outcomes prepaid and uninsured. According to the model, the probability of prepaid for whites (nonwhite= 0) is e

Pr(insure

Similarly, for nonwhites,

= Prepaid)

the probability

-.188

= 1,+ e_.la8 + e_1,942 -- 0.420

of prepaid is e-,188+.661

Pr(insure

-- Prepaid)

-----1 + e--188+'661 +

e -1"942+'378

--:_ 0.570

mlogit -- Maxlmum-likelihoodmultinomial(polytomous)logistic regression

363

TtJese results agree with the column percentages presented by tabulatesince the mlogit model is fully _atutated. That is, there are enough terms in the model to fully explain the column percentage in ea_zh,,_ell'. Note that the model chi-squared and the tabulate chi-squared are in almost perfect agreemeiat; )oth are testing that the column percentages of insure are the same for both values of nonwh_$e. , q

Example_ By s_eci}'ying the basecategory() option, ),ou can control which category' of the outcome variai_le 'is ireated as the comparison group. Lef{ to its own, mlogit chose to make category, 1. indemnit},, the base category. If we wanted to make category 2, prepaid, the base, we type ilo_it insure nonwhite, base(2) iIt,_ra_ionO: !It,,ra_ion1: IIt,,ra_ion2: !It_ra1_ion3:

log log log log

likelihood = -556.5950_ likelihood -55i.78936 likelihood = -551.78348 likelihood = -551.78348

IMu;.ti_omialregression ;

;

iLog l_kelihood = -551.78348 linsure

'

Coef.

Std. Err.

z

Number of obs LR chi2(2) Prob > chi2

= = =

616 9.62 0.0081

Pseudo R2

=

0.0086

P> [zl

[95_ Conf. InterVal]

IIndem_ity n_nwhite _cons

-.6608212 .1879149

.2157321 .0937644

-B.06 2. O0

O.002 O. 045

-I.083648 .0041401

-.2379942 .3716896

_/_._nsu_e n_nwhite _cons

-.2828628 -i.754019

.3877302 .1805145

-l). 71 -_.72

O.477 O. 000

-i.0624 -2.107821

.4966741 -1.400217

'(O_tc e insu_e==Prepaid is the comparison group)

Theb_s_ca_egory()option requires thatwe specify thenumericvalueofthecategory, sowe could not type__ _a_e(Prepaid). t Alflao_ghlthe coefficients now appear to be different, note that the summary statistics reported at the toi) ale i_entical. With this paramelerization the probability of prepaid insurance for whites is • !

1 Pv(insure

= Prepaid)

-- rj+ e "188+ e-1"754 -- 0.420

Thinsi$ tNe same answer we obtained previously. q

b. Examfllei Bv_sp_cif:ing rrr,which we can do at estimation time or when we redisplav results, we ._ee the -i. tterms of relative risk ratios: model.m

• mlogit,

364

rrr

Multinomial

regression

Number

of obs

=

616

mlogit-- Maximum-likelihood multinomial(polytomous) logisticregression LR chi2 (2) = 9.62 Prob

Log

likelihood

= -551.78348

insure

KRK

Indemnity nonwhite

.516427

.7536232

> chi2

Pseudo

Std.

Err.

R2

=

0.0081

=

0.0086

z

P>Izl

[957, Conf.

Interval]

.1114099

-3.06

O. 002

.3383588

.7882073

.2997387

-0.71

0.477

.3456254

1. 643247

Uninsure nonwhite

(Outcome

insure==Prepaid

is the

comparison

group)

Looked at this way, the relative risk of choosing an indemnity over a prepaid plan is 0.52 for nonwhites relative to whites.

<1

> Example One of the advantages of mlogit over tabulateis that continuous variables can be included in the model and you can include multiple categorical variables. In examining the data on insurance choice, you decide you want to con_ol for age, gende5 and site of study (the study was conducted in three sites): . mlogit

insure

age male

nonwhite

site2

sites

Iteration

O:

log

likelihood

= -555.85446

Iteration

1:

log

likelihood

= -534.72983

Iteration

2:

log

likelihood

= -534.36536

Iteration

3:

log

likelihood

= -534.36165

Iteration

4:

log

likelihood

= -534.36165

Multinomial

Log

regression

likelihood

= -534.36165

Number of obs LK chi2(lO)

= =

Prob

> chi2

=

0.0000

R2

=

0.0387

Pseudo

Std.

Err.

z

P>IzI

[95_

Conf.

615 42.99

insure

Coef.

Interval]

age male

-.011745 .5616934

.0061946 .2027465

-1.90 2.77

0.058 0.006

.9747768

.2363213

4.12

0.000

.1i30359

.2101903

0.54

0.591

-.2989296

site3

-.5879879

.2279351

0.010

-I.034733

_cons

.2697127

.3284422

0.82

0.412

-.3740222

.9134476

age male

-.0077961 .4518496

.0114418 .3674867

-0.68 1.23

0.496 0.219

-.0302217 -.268411

.0146294 1.17211

Prepaid

nonwhite site2

-2.58

-.0238862 .1643175

.0003962 .9590693

.5115955

1.437958 .5250013 -.1412433

Uninsure

nonwhite

(Outcome

.2170589

.4256361

0.51

0.610

-,6171725

1.05129

site2 site3

-1.211563 .2078123

.4705127 .3662926

-2.57 -0.57

0.010 0.570

-2.133751 -.9257327

-.2893747 .510108

_cons

-i.286943

.59_3219

-_.17

0.030

-5.447872

-.1260135

insure==Indemnity

is the

comparison

group)

_-

mlogit -- Maximum-likelihoodmultinomial(polytomous)logistic regl'ession 365 These results suggest that the inclination of nonwhites to choose prepaid care is even stronger than it was withbut controlling. We also see that subjects in site 2 are less likely to be uninsured, <1

Obtaining predicted values i ' ' ! '

> Example After estimation, predict can be used to obtain predicted probabilities, index values, and standard errors of the index, or differences in the index. For instance, in the preceding example we estimated a model ofiinsurance-choice on various characteristics. We can obtain the predicted probabilities for outc6m¢ 1 by typing

I

. :predict pl if e(sample),

i

(Qpt_on

p assumed;

(:_9_issing

values

• :summarize

pl

%_ariable

outcome(1)

predicted

probability_

generated)

Obs

Mean

Std_ Dev.

Min

Max

!

pl

615

.4764228

.10B2279

.1698142

,71939

! i

Note!that v_e included if e(sample) to restrict the calculation to the estimation sample. If you look back at the!previous example, the muhinomial logit model was estimated on 615 observations: there mustl be,missing values in our dataset.

I _-

AlthOugh we typed outcome(I), specifying 1 for the indemnity categoD_. we could have typed outcome(_ndemnity).For instance, to obtain the probabilities for prepaid, we could type

l

_l

pr _edict

p2 if

e(sample),

(optibn p assumed; equatiion prepaid r._:30_) ; -;prddict

p2

outcome(prepaid)

predicted

probability)

not found

if e(sample),

outcome(Prepaid)

(d.ptibn p assumed;

predicted

(49 _issing

values

generated)

• summarize

p2

Vs_riable

Obs

p2

615

probability)

Mean

.4504065

Std. Dev.

.1125962

Min

.1964103

Max

.7885724

When specifying the label, it must be specified exactly as it appears in the underlying value label (or how it appears in the mlogit output), and that includes capitalization. Here, we have used predict to obtain probabilities for the same sample on which we estimated. That is not necessary. We could use another dataset that had the independent variables defined (in our example, age, male. nonwhite site2, and site3) and use predict to obtain predicted probabilities; m this case. we would not specify if e(sample). <1

........

_., --

,.._^,,,,u.,-._=l.lUUU

lpoP_omous)lOgiStiCregression

IllUlUNOml81

_k _:

predict

can also be used to obtain the "index" values--the

• predict idxl, outcome(Indemnity) (1 missing value generated) • summarize idxl Variable idxl

I

Obs

t

643

_ xi/3} )--as well as the probabilities:

xb

Mean

Std. Dev.

0

Min

Max

0

0

0

The indemnity category was our base category--the categorY for which all the coefficients were set to O--and so the index is always O. For the prepaid and uninsured categories: predict idx2, outcome(Prepaid) (1 missing value generated)

xb

predict idx3, outcome(Uninsure) (1 missing value generated) • summarize idx2 idx3 Variable Obs idx2 idx3

643 643

xb

Mean -.0566113 -1.980747

Std. Dev. .4962973 .6018139

Min

Max

-1.298198 -3.112741

1.700719 -.8258458

We can obtain the standard error of the index by specifying the predict se2, outcome(Prepaid) (1 missing value generated)

stdp

option:

stdp

list p2 idx2 se2 in 1/5 I. 2. 3. 4.

p2 .3709022 .4977667 .4113073 .5424927

5.

idx2 -.4831167 .055111 -.1712106 .3788345

se2 .2437772 .1694686 .1793498 .2513701

-.0925817

.1452616

(We obtained the probability p2 in the previous example.) Finally, predict can calculate the standard error of the difference in the index values between two outcomes with the stddp option: predict se_2_3, outcome(Prepaid,Uninsure) (1 missing value generated)

stddp

list idx2 idx3 se_2_3 in 1/5 idx2 I. -.4831167 2. .055111 3. -.1712106 4. .3788345 5. -.0925817

idx3 -3.073253 -2.715986 -1.579621 -1.462007 -2.814022

se_2_3 .5469354 .4331917 .3053815 .4492552 .4024784

In the first observation, the difference in the indexes is -.483 - (-3.073) of that difference is .547.

= 2.59. The standard error

mlogit-- Maximum-likelihoodmultjnomial(potytomous)logistic regression i

367

>Ex mp, b It is more difficult to interpret the results from mlogit than clogit or logit since there are multiple equations. For example, suppose one of the independent variables in your model takes on the Values 0 and I and you are attempting to bnderstand the effect of this variable, Assume the coefficient On this variable for the second outcome, fl(z), is positive. You might then be tempted to reason '_hat!theprobability of the second outcomd is higher if the variable is I rather than O. Most of

]

i

the time, t_at will be true but occasionally you will be surprised. It could be that the probability of; som_ oiheri category will increase even more (s@ fl,,a) > fl(2)) and thus the probability of outcome

i

i

i

i

i

i

I

!

i

Ptedicti6n can be used to aid interpretation. 2 ac|uaily (ells relative to that outcome•

! ]

predlctibns;by race. our Forpreviously this purpose, we caninsurance-choice use the "methodmodel, of recycled predictions", we Cbndnu!ng with estimated we wish to describe inthewhich model's

i

varv,, charadteristics of interest across the whole dmaset and average the predictions. That is, we have , data !on!bo_h whites and nonwhites, and our individuals have other characteristics as well. We will first preiend that all the people in our data are wfiite but hold their other characteristics constant. We then Icaltul_ite the probabilities of each outcome. Next we will pretend that all the people in our data are rJon_'hiie, still holding their other characteristics constant. Again we calculate the probabilities of each outcome. The difference in those two sets of calculated probabilities, then, is the difference due to ra_e; _hokling other characteristics constant.

_i

. ge_ I

I

l

i i

byte

nonwhold

/*

= nonwhite

. re_lace

nonwhite

= 0

(426 real

changes

made)

. pr_)ict

wpind,

outcome(Indemnity)

(@pti_bn p assumed;

save real

/* make

predicted

/*

predict

*/

race

everyone

i

white

*/

probabilities'

*/

probability_

(t missing value generated) • predict wpp, outcome(Prepaid) (dptipn p assumed; predicted (_ missing value generated) • !pre'dict wpnoi,

probabilityl

outcome(Ullinsure)

(_tibn p assumed; predicted (_ missing value generated) • _ep_ace (l_I_ real

i

probability)

i

nonwhite=l changes made)

/* make

everyone

nonwhite

*/

i i

• _re_ict nwpind, outcome(Indemnity) (_tipn p assumed; predicted probability

i _ i

(I missing

i

• _re_ict

value nwpp,

generated) outcome(Prepaid)

(_pti_n p assumed; predicted (1 missing value generated) pre

ict nwpnoi,

i I

outcome(Uninsure)

(optipn p assumed; predicted mi#sing value generated) replace (5R8 real

probability)

i

probability!

nonwhite--nonwhold changes made)

/* restore

real

race

*/

.p,n.p*

• V_riable

Obs

Mean

Std' Dev.

Min

Max

wpind

643

.5141673

.08_2679

.3092903

wpp wpnoi

643 643

.4082052 .0776275

,09( ,3286 .03( 0283

.1964103 .0273596

,6502247 .1302816

nwpind nwpp Inwpnoi

643 643 643

:3112809 .630078 .0586411

.08:7693 .095 9976 02_ 7185

.1511329 .3871782 0209648

.535021 .8278881 ,0933874

.71939

i

i

.........................

tl_,,s, Ttu=-uu=

I lug!SIlO

regress=on

unadjusted. The means reported above are the values adjusted for age, sex, and site. Combining the results gives Earlier in this entry we presented a cross-tabulation of insurance type and race. Those values were Unadjusted white nonwhite

Adjusted white nonwhite

Indemnity

.51

.36

.52

.31

Prepaid

.42

.57

.41

.63

Uninsured

.07

.07

.08

.06

We find, for instance, that while 57% of nonwhites in our data had prepaid plans, after adjusting for age, sex, and site, 63% of nonwhites choose prepaid plans,

Q Technical Note Classification of predicted values followed by comparison of the classifications with the observed outcomes is a second way predicted values can help interpret a multinomial logit model. This is a variation on the notions of sensitivity and specificity for logistic regression. Here, we will adopt a three-part classification with respect to indemnity and prepaid: definitely predicting indemnity, definitely predicting prepaid, and ambiguous. (1

predict indem, missing value predict

prepaid,

(I missing

value

outcome(Indemnity) generated)

index

outcome(Prepaid)

index

predict sediff, outcome(Indemnity,Prepaid) (i missing value generated) gen type = I if diff/sediff (504 missing values generated)

obtain

difference

*/

I "Def

type

insure

Ind"

/* _ its

standard

error

*/

> 1.96

/* definitely

prepaid

*/

_ diff/sediff!=.

/_ ambiguous

type = 2 if type==. changes made)

• tabulate

/*

indemnity

replace (404 real

values

*/

/* definitely

type = 3 if diff/sediff changes made)

clef type

indexes

stddp

< -1.96

replace (i00 real

label

obtain

generated)

gen diff= prepaid-indem (I missing value generated)

label

/*

2 "Ambiguous"

3

type

"Def

*/

*/

Prep"

/* label

results

,/

type type

insure

Def

Ind

Ambiguous

Def

Prep

Total

Indemnity

78

183

33

294

Prepaid Uninsure

44 12

177 28

56 5

277 45

Total

134

388

94

.......

616

One substantive point learned by this exercise is that the predictive power of this model is modest. There are a substantial number of misclassifications in both directions, though there are more correctly classified observations than misclassified observations. A second interesting point is that the uninsured look overwhelmingly come from the indemnity system rather than the prepaid system.

as though they might have Q

i

W

mlogit-- Maximum-likelihoodmultinomial(polytomous)logisticregression

369

Tes@nghypothesesaboutcoefficients

i '

,

E=mp,* i i HypOtheses about the coefficients are tested with test just as they are after any estimation c0mina6d;___ see [R] test. The only important point to note is test's syntax for dealing with multiple equa_tior_models. You are warned that test bases its results on the estimated covariance matrix and tlht _alikelihood-ratio test may be preferred; see Estimating constrained models below for an example

ofl t st.

I' i f o_e simply lists variables after the test c0effici_nts are zero across all equations: • itest I) (]2) (I3) (i4)

site2

command, one is testing that the corresponding

site3

[Prepaid]site2 = 0.0 [Uninsure]site2 = 0.0 [Prepaid]site3 = 0.0 [Uninsure]site3 = 0.0 chii(4) Prob > chi2

19.74 0.0006

= :

One :ca0 test that all the coefficients (except the constant) in a single equation are zero by simply typlrig tlhe outcome in square brackets:

:

. test i (i 1) i (' 2) ( 3) (i 4) (I 5) :

'

!

[Uninsure] [Uninsure]age = 0.0 [Uninsure]male = 0.0 [UninsureJnonwhiZe = 0.0 0.0 [UninSure]site2 = [Uninsure}site3 chi2(5)

=

= 0.0 9.31

Prob > ahi2 =

0,0973

S_ectfic_tion of the outcome is just as with predict; you can specify the label if the outcome variable is!lal_ele_t, or you can specify the numeric value of the outcome. We would have obtained the same te_t _is above had we typed test [3], since 3 is the value of insure for the outcome uninsured

i !

Tt_e t_vo syntaxes can be combined. To test that the coefficients on the site variables are 0 in the equation! corresponding to the outcome prepai& we can type •

_est [Prepaid]: site2 site3 (! I) ( !2) '

i

[Prepaid]site2 : 0.0 [Prepaid]site3 = 0.0 chii( 2) = Prob > chi2 =

10.78 0.0046

sFeqfied the outcome and then followed that with a colon and the variables we wanted to test We can _ _lso ' test that coefficients are equal across equations. To test that all coefficients except the constlmt ' • !are equal for the prepaid and uninsured outcomes, _est

;

(il) (i2) (!3) (!4) (i5) ! !

[Prepaid=Uninsure] [Prepaid]age - [Uninsure]age = O,0 [Prepaid]male - [Uninsure]male = 0.0 [Prepaid]nonwhite - [Uninsure]nonwhite = 0.0 [Prepaid]site2 - [Uninsure]site2 = 0.0 [Prepaid]site3 - [Uninsure]site3 = 0.0 chii( 5) = 13.80 Prob

> cbi2

=

0.0169

To test that only the site variables are equal: • test

......

[Prepaid=Uninsure]:

site2

site3

,vu,. -- ,eux-,,um-.Ke.nooo

(1) (2)

[Prepaid]site2

-

[Uninsure]site2

[Prepaid]site3

-

[Uninsure]

=

chi2(2) Prob

> chi2

multinomial (polytomous) logistic regression = 0.0

site3

= 0.0

12.68

=

0.0018

Finally, we can test any arbitrary constraint by simply entering the equation, coefficients as described in [U] 16,5 Accessing coefficients and standard errors. hypothesis is senseless but illustrates the point: test (i)

( [Prepaid] age+ [Uninsure] .5 [Prepaid]age Prob

= 2- [Uninsure] nonwhite

+ [Uninsttre]nonwhits

1)

=

> chi2

=

chi2(

site2)/2

specifying the The following

+ .5 [Uninsure]site2

= 2.0

22.45 0.0000

Please see [R] test for more information on test. All that is said about combining across test commands (the accum option) is relevant after mlogit.

hypotheses q

Estimating constrained models mlogit can estimate models with subsets of coefficients constrained to be zero, with subsets of coefficients constrained to be equal both within and across equations, and with subsets of coefficients arbitrarily constrained to equal linear combinations of other estimated coefficients. Before estimating a constrained model, you define the constraints using the constraint command: see [R] constraint. Constraints are numbered and the syntax for specifying a constraint is exactly the same as the syntax for testing constraints; see Testing hypotheses about coe£IJciems above. Once the constraints are defined, you estimate using mlogit, specifying the constraint () option. Typing constraint (4) would use the constraint you previously saved as 4. Typing constraint (1,4,6) would use the previously stored constraints 1, 4, and 6. Typing constraint (1-4,6) would use the previously stored constraints 1, 2, 3, 4, and 6. Sometimes, you will not be able to specify the constraints without knowledge of the omitted group. In such cases, assume the omitted group is whatever group is convenient for you and include the basecategory () option when you type the mlogit command.

> Example Among other things, constraints can be used as a means of hypothesis testing. In our insurancechoice model, we tested the hypothesis that there is no distinction between having indemnity insurance and being uninsured• We did this with the test command. Indemnity-style insurance was the omitted group, so we typed • test (i) (2)

[Uninsure] [Uninsure]age [Uninsure]male

= 0.0 = 0.0

(3)

[Uninsure]nonwhite

(4) (5)

[Uninsure]site2 [Uninsure]site3 chi2( Prob

5) =

> chi2

= 0.0 = 0.0 = 0.0

=

9.31 0.0973

]

_r

mlogit-- Maximum-likelihoodmUltinomial(polytomous)logistic regression

!

(Had!indemnity not been the omitted group, we would have typed test :

i

,

.)

;T_e r_sults produced by test are based On the estimated covariance matrix of the coefficients _e an approx_mauon. Since the probabthty of being uninsured is quite low, the log hkehhood m_y be honlinear for the uninsured. Conventional statistical wisdom is not to trust the asymptotic

• .

! I

[Uninsure=Indellmity]

371

_ I i

.......

answbr dnder these circumstances, but to perform a likelihood-ratio test instead. State _asa lrtest likelihood-ratio test command; to use it we must estimate both the unconstrained anktttte cbnstrained models. The unconstrained model is what we have previously estimated. Following the ifistr_ction in [R] Irtest. we first save the unconstrained model results:

)

!

.

_rtest, saving(O)

TO e_imme the constrained model, we must re-estimate our model with all the coefficients except the c6nsthnt set to 0 in the Uninsure equation. We define the constraint and then re-estimate:

I

donstraint define 1 [Uninsure]

I

_logit insure age male nonwhite site2 site3, constr(1)

I

(I)

[Uninsure]age

(3) (2) (4) (5)

[Unins_mre]nonwhite [Uninsure]male = 0,0 = 0.0 [Uninsttre] site2 = O. 0 [Unins_Ire] site3 = 0.0

!Iteration O: iIteration 1 : _Iteration 2: riferation 3:

log log log log

= 0.0

likelihood likelihood likelihood likelihood

= = = =

-555.85446 -539.80523 -539.75644 -539.7(6643

!Mu]'tinomialregression ' Log) likelihood = -539.75643

Number of obs LR chi2(5)

= =

615 32.20

Prob > chi2

=

0.0000

Pseudo K2

=

0.0290

t i insure ,

Coe_.

Std. Err.

z

P> ]Z)

[95Y.Conf. Interval]

I

Prepaid age male nonwhite site2 sil;e3 _cons

-.0107025 .4963616 .942137 .2530912 -. 5521774 .1792752

.0060039 .1939683 .2252094 .2029465 .2187237 .3171372

(dropped) (dropped) (dropped) (dropped) (dropped) -1.8735i

.1601099

-1.78 2.56 4.18 t. 25 -2.52 O.57

O.075 O. 010 O.000 0.212 O.012 O.572

-.0224699 .1161908 .5007347 -. 1446767 -, 9808678 -.4423023

.0010649 .8765324 i.383539 .6508591 -. 1234869 .8008527

Uni_sure age male inonwhite site2 site3 _cons

-11.70

0.000

-2.18732

-I.5597

(Outcome insure==Indemnity is the comparison group)

We ,ca_ new perform the likelihood-ratio test: , l_test Mlo_it : likelihood-ratio test !

chi2(5) = Proh > chi2 =

10,79 0.0557

The lil_eli_ood-ratio ehi-squared is 10.79 with 5 degrees of freedom just slightly greater than the ma_ic _ J4 .05 level. Thus. we should not call _is difference significant. l

o TechnicalNote 372In certain mlogit circumstances, -- Maximum-likelihood a multinomialmultinomial logit model(polytomous) should be estimated logistic regression with conditional logit; see [R] ciogit. With substantial data manipulation, clogit is capable of handling the same class of models with some interesting additions. For example, if we had available the price and deductible of the most competitive insurance plan of each type, this information could not be used by mlogit but could be incorporated by cZogit. 71

Saved Results mlogit saves in e(): Scalars e (N) e (k_cat) e(df..m) e(r2_p) e(11)

number of observations number of categories model degrees of freedom pseudo R-squared log likelihood

e (1I_0) e (N_clust) e(chi2) e(ibaseeat) e(basecat)

log likelihood, constant-only model number of clusters X2 base category number the value of depvar that is to be treated as the base category

mlogit

e (clustvar)

name of cluster

Macros e (emd)

variable

(depvar) name of dependent variable e(wtype) weight type e(wexp) weight expression Matrices

covariance estimation method e(chi2type)Waldor LR: _ype of model X2 test e(predict) program used to implement predict

e (b) e (cat) Functions

e (V)

e

e(sample)

coefficient vector category values

e (vcetype)

variance-covariance matrix of the estimators

marks estimation sample

Methods and Formulas The model for multinomial

logit is

Pr(Y/=

k) = rr_=1

"j-_--O

This model is described in Greene (2000, chapter t9). Newton-Raphson

maximum likelihood is used; see [R] maximize.

In the case of constrained equations, the set of constraints is orthogonalized and a subset of maximizable parameters is selected. For example, a parameter that is constrained to zero is not a maximizable parameter. If two parameters are constrained to be equal to each other, only one is a maximizable parameter.

mlogit -- Maximum-likelihoodmultinomial(polytomous)logistic regression •

'

373

L_t r!be the vector of maximizable parameters, Note that r is physically a subset of the solution p_a_ete_rs b. A matrix T and a vector m are defined b=Tr_m t

wiih [he lconsequence that df=df_ T, db dr d2f = T d2f _, db 2 d-_ -t T 'consists of a block form in which one part is a permutation of the identity matrix and the other pa_ describes how to calculate the constrained parameters from the maximizabte parameters,

ReferenQes Aldrich.J.! H. and F. D. Nelson. t984. LinearProbabilit):Logit, and Probit Models. Newburv Park, CA: Sage _Puliticaiions. Grdene.W_H. 2000. EconometricAnalysis,4th ed. Upper SaddleRiver,NJ: Prentice-HalL HamiltOn,_..C. 1993 sqv8: Interpretingmultinomia]logisticregression.Stata TechnicalBulletin[3: 24-28. Repnnted in _tat_TechnicalBulletin Reprints,vol. 3, pp. 176-181. Hefidri_kx, IJ 201)0.sbe37:Specialrestrictionsin multii_omiallogisticregression,Stata TechnicalBulletin56: 18-26_ Hosrner.D_W., Jr,, and S. Lemeshow 1989.AppliedLogisticRegression.New York:John Wiley & Sons. lSecot_d !editionIforthcomingin 2001.) Judge,G. _,, W. E. Griffiths,R. C. Hill.H. Lfitkepohl.and T,-C.Lee. 1985, The Theoo"and Practiceof Econometrics. !2db_d,New York:John Wiley & Sons. Long, 1. SI 1997.Regression Models for Categoricaland Limited Dependent Vari,_bles.ThousandOaks, CA: Sage PuNica_ions, Tarlbx:, _A. _,..1. E. Ware,Jr., S. Greenfield,E. C, Nelson,E. Pen-in.and M. Zubkoff. 1989. The medicaloutcome_ study.3aurnalof the American MedicalAssociation,262: 925-930. We_ls.K. E R. D. Hays, M A. Burnam,W, H. Rogers,IS.Greenfield,and J. E. Ware,Jr. 1989, Detectionof depressive disdrderfor patientsreceivingprepaidor fee-for-servicecare, Journalof the AmericanMedical Association262 3298-3 _02.

i

AlsoiSie Corn_en_entary:

[R] adjust, [R] constraint. [R] lincom. [R] lrtest. [R] mfx, [R] predict, [R] test, [R] testnL [R] xi

Re_t0d:

[R] clogit, [R] logistic, [R] nlogit, [R] ologit, [R] svy estimators

Baekgtouhd:

i

[U] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and p_st-estimation commands. [t;] 23.11 Obtaining robfist variance estimates. [u] 23.12 Obtaining scores,

I

[R] maximize

°

more

-- The --more--

i

message

1

]

[

Syntax

set more{ onloaf } set

_p_agesize #

Description set more on, which is the default, tells Stata to wait until a key is pressed before continuing when a --more-message is displayed. set more off

tells Stata not to pause or display the --more--

set pagesize

# sets the number of lines between --more--

message. messages.

Remarks When you see --more

at the bottom of the screen

Press ...

and Stata...

letter t or Enter

displays the next line

letter q

acts as if you pressed Break

space bar or any other key

displays the next screen

In addition, you can press the More button, or click on --more--,

to display the next screen.

--more is Stata's way of telling you it has something more to show you, but that showing you that something more will cause the information on the screen to scroll off. If you type set more oft.--more--conditions at full speed. If you type set more on, --more Programmers

conditions

will never arise and Stata's output will scroll by will be restored at the appropriate

should see [p] more for information

Also See Complementary:

[R]

query, [P] more

Background:

[U] 10 ---more---

conditions

374

on the more programming

places.

command.

.

Title

,

-- Change missing to coded missing value and vice versa

Synta _ercode

varlist [if exp] [in range] , my(#) [override

]

m_d, code varlist [-if exp] [in range] , my(#)

Destcttipl:ion m_veJ code changes all occurrences of mis_ing to # in the specified varlist. m_d_code changes all occurrences of # to missing in the specified varlist.

options my(#) }pecifies the numeric value to which or from which missing is to be changed and is not opti@al. oveJ'ri[te specifies that the protection provided by mvencode is to be overridden. Without this option, m_rer_code refuses to make the requested change if # is already; used in the data.

, !

Remalrk One _occasionalty reads data where missing (e.g., failed to answer a survey question, or the data were ndt collected, or whatever) is coded wi|h a special numeric value. Popular codings are 9. 99. 29, -9_), and the like. If missing were encoded as -99, ' • mvdecode _all, my (-99)

would translate the special code to the Stata missing value ' ' Use this command cautiously since. even if L99 were not a special code, all 99's in the data would be changed to missing. Conxlersely, one occasionally needs to export data to software that does not understand that .' friends r_issmg value, so one codes missing With a special numeric value. To change all missings to -99i _nvencode

_all, my(-99)

mvenco_leis smart: it will automatically recast variables upward if necessary, so even if a variable is strred as a byte. its missing values can be recoded to. say, 999. In addition, mvencode refuses to make th_ change if # _-99 in this case) is already used in the data, so you can be certain that your coding ig unique. You can override this feature by including the override option. _. Example O_ur_utomobile dataset (described in [U] 9 State's on-line tutorials and sample datasets) contains 74 observations and 12 variables• Let us first attempt to translate the missing values in the data to 1: 375

. mvencode

_

..............

,

_all,

my(1)

make : string ,..,,-,,_ variable ,,,,oo.,y ignored .u uuueu rmssmg rep78: already I in 2 observations

foreign: already no action taken

i in

22

value aria vice versa

observations

r(9) ; Our attempt failed, mvencode first informed us that make is a string variable--this is not a problem but is reported merely for our information. String variables are ignored by mvencode. It next informed us that rep78 already was coded 1 in 2 observations and that foreign was already coded 1 in 22 observations. Thus, 1 would be a poor choice for encoding missing values because, after encoding, you could not tell a real 1 from a coded missing value t. We could force mvencode to encode the data with 1 anyway by typing mvencode _all, my(l) override and that would be appropriate if the ls in our data already represented missing data. They do not, however, and we will code missing as 999: • mvencode

_all,

make:

mv(999)

string

rep78:5

variable

missing

ignored

values

This worked, and we are informed that the only changes necessary were to 5 observations

of rep78.

<1

> Example Let us now pretend that we just read in the automobile data from some raw dataset where all the missing values were coded 999. We can convert the 999's to real missings by typing • mvdecode

_all,

mv(999)

make : string variable ignored rep78:5 missing values

We are informed that make is a string variable and so was ignored and that rep78 observations with 999. Those observations have now been changed to contain missing.

contained

5 q

Methods and Formulas mvencode and mvdecode are implemented

Also See Related:

[R] generate, [R] recode

as ado-files.

:

!

Title•

........

i

-- Multivariateregression

Syntax :mere!depvarlist = vartist [weigh,] [if expl [in range] [, noconstantcorrnoheader

by..

_ : _nay be used with m-rreg; see JR] by.

aw_i_tsland_: fweights are allowed;see [Ul 14.1.awei_t. mvteff sh_es the features of all estimation commands; see [U] 23 F_imation and l_t-estimati_

commands.

SyntaxIfo_predict predict I

[type] newvarname [if exp] Iin range][,

{ _b !stdp Iresiduals I_difference

i

equation(eqno

[,eqnoj)

Is_tdap }]

i

These gtati{sticsare available both in and out of sample: type predict :theiesfi_nation sample. :

...

if e(s_ple)

...

if wanted onl? fl)r

i I

Desaripti_n I

T avte_ estimates multivariate regression models.

Optienis :

1

no_:o_st_mt omits the constant term from the estimation. corr _lis ys the correlation matrix of the residuals between the equations. _ !la noheddei" suppresses display of the table reporting F statistics. R-squared, and root mean square errbr a _ove the coefficient table notable suppresses display of the coefficient table. leve2 (# specifiesthe confidencelevel, in percent, for confidenceintervals. The default is level (95) or as _t by set level: see [U] 23.5 Specifying the width of confidence intervals 1

Optionsf_r predict t

oqu o (qo[.qnot

,ow ich eq tiooareefem g.

equat _on() is filledin with one eqno for options zb, stdp, and residuals, equation(#1) would mean the calculation is to be made for the fi_stequation, equation(#2) would mean the second, and so on. Alternatively, you could refer to the equations by their names, equation(income) Wotild"efer to the equation named income and equation(hours) to the equation named hours. '

37"/

not -specify equation(), results are as if you specified equation(#1). oreff you do mvreg Multivariate regression difference and stddp refer two equations; e.g., equat ion be specified, equation() is prediction of equation(#1)

to between-equation concepts, To use these options, you must specify (#1, #2) or equation (income, hours). When two equations must not optional. With equation(#1,#2), difference computes the minus the prediction of equation(#2).

xb, the default, calculates the fitted values--the

prediction

of xj b for the specified equation.

stdp calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is also referred to as the standard error of the fitted value. residuals difference

calculates

the residuals.

calculates the difference between the linear predictions

of two equations in the system.

stddp is allowed only after you have previously estimated a multiple-equation model. The standard error of the difference in linear predictions (xljb - x2jb) between equations 1 and 2 is calculated. For more information on using predict

after multiple-equation

estimation commands,

see [R] predict.

Remarks Multivariate regression differs from multiple regression in that several dependent variables are jointly regressed on the same independent variables. Multivariate regression is related to Zellner's seemingly unrelated regression (see [R] sureg) but, since the same set of independent variables is used for each dependent variable, the syntax is simpler and the calculations faster. The individual coefficients and standard errors produced by mvreg are identical to those that would be produced by regress estimating each equation separately. The difference is that mvreg, being a joint estimator, also estimates the between-equation covariances, so you can test coefficients across equations and, in fact. the tesl: syntax makes such tests more convenient.

> Example Using the automobile data. we estimate a multivariate regression for "space" variables (headroom, trunk, and turn) in terms of a set of other variables including three "pertbrmance variables" (displacement, gear_ratio, and mpg): . mvreg

headroom

trunk

turn

Parms

= price RMSE

mpg

displ

gear_ratio

Equation

Obs

"R-sq"

headroom

74

7

.7390205

O. 2996

4. 777213

trunk

74

7

3. 0523 L4

0. 5328

12. 7265

turn

74

7

2. 132377

0. 7844

length

weight

F

40.

62042

P 0. 0004 0. 0000 0. 0000

mvreg-- Multivariateregression

¢-...

i

Coef.

Std.

Err.

t

P>(t[

[95_ Conf.

379

Interval]

he. oom price I mpg displacement g#ar_ratio ! length weight _cons

-.0000528 -.0093774 .0031025 .2108071 .015886 -.0000868 -.4525117

.000038 .0260463 .0024999 .3539588 .012944 ,0004724 2.170073

-1.39 -0.36 i,24 O.60 I.23 -0.18 -0.21

0.168 O.720 0.219 O.553 O.224 0.855 O.835

-.0001286 -.061366 -.0018873 -.4956976 -.0099504 -,0010296 -4.783995

.0000229 .0426112 .0080922 .9173118 .0417223 .0008561 3.878972

, price ' mpg _is_lacement 1 g_ar_ratio length c weight _cons

.0000445 -. 0220919 .0032118 -,2271321 .170811 - ,0015944 -13.28253

,0001567 .1075767 .0103251 I.461926 .0534615 ,001951 8. 962868

0,28 -0.21 0.31 -0.16 3.20 -0.82 - 1.48

0,778 O.838 0.757 0.877 O,002 0.417 0. 143

-.0002684 -. 2368159 -,0!73971 -3.145149 .0641014 - ,0054885 -31. 17249

.0003573 .1-926322 .0238207 2.690885 .2775206 ,0022997 4.607429

price mpg displacement

-.0002647 -.0492948 .0036977

,0001095 .0751542 .0072132

-2.42 -0.66 O. 51

O.018 O.514 O.610

-.0004833 -.1993031 -. 0106999

-.0000462 .1007136 .0180953

gear_ratio --length

-. 1048432 .072128

I.021316 .0373487

-0.10 I. 93

0.919 O. 058

-2. 143399 - J)024204

1.933712 .1466764

_cons

20. 19157

3.22

O.002

7.693467

32. 68967

i

_un!

I

i !

i

6.261549

!

We should have specified the corr option so that we would also see the correlations between the residu_ils _ of the equations. We can correct our omission because mvreg--like all estimation com_ahds!--typed without arguments redisplaysiresutts The noheader and notable (read no-table) options Sul_press redisplaying the output we have already seen:

g

• mv_'eg, notable noheader corr

i

COrrl_lationmatrix of residuals: headroom trunk turn h@ad]'oom i.0000 t]'unk O.4986 I.0000 urn -0.1090 -0.0628 1.0000 Breu _ch-Pagantest of independence: chi2(3) =

19.566, Pr = 0.0002

The Breusc h-Pagan test is significant, so the residuals of these three space variables are not independent of eachiott er. t I

The thre_eperformance variables among our ihdependent variables are mpg, displacement, and gear_ratio. We can jointly test the significance of these three variables, in all the equations, by typing

i

(Continued on next page)

I!iI'!

• test

mpg

(1) ,

displacement

[headroom]mpg

(2)

[truak]mpg

(3)

[turn]mpg

gear_ratio

= 0.0

= 0.0 -- 0.0

(4)

[headroom]displacement

(5)

[trunk]displacement

(6)

[ttu'-n]displacement

(7)

[headroom]gear_ratio

(8) (9)

[trul_k]gear_ratio [t_rn]gear_ratio F(

9,

67)

Prob

= 0.0 = 0.0 = 0.0 = 0.0

= O. 0 = 0.0

=

0.33

> F =

0.9622

These three variables are not, as a group, significant. We might have suspected this from their individual significance in the individual regressions, but this multivariate test provides an overall assessment with a single p-value. We can also perform a test for the joint significance of all three eqtmtions: - test

[headroom]

(output

omitted

• test

)

[trunk],

(output

omitted

• test

accum

)

[turn],

accum

(I)

[headroom]price

(2)

[headroom]mpg

= 0.0

(3) (4)

[headroom]displacement [headroom]gear_ratio

= 0.0 = 0.0 = 0.0

(5)

[headroom]length

(6) (7)

[headroom]weight = 0.0 [tr%L_k]price = 0.0

= 0.0

(8)

[trunk]mpg

(9)

[trumk]displacement

= 0.0

(i0)

[trunk]gear_ratio

(II)

[trunk]length

= 0.0

(::[2)

[trunk]weight

= 0.0

(13)

[turn]price

= 0.0 = 0.0

= 0.0

(14) [turn]mpg= 0.0 (15)

[turn]displacement

(16)

[turn]gear_ratio

= 0.0 = 0.0

(17)

[turn]length

= 0.0

(18)

[turn]weight

= 0,0

F(

18, Prob

67)

=

> F =

19.34 0.0000

The set of variables as a whole is strongly significant. individual equations.

We might have suspected this, too, from the q

C3Technical

Note

The mvreg command provides a good way to deal with multiple comparisons. If we wanted to assess the effect of length, we might be dissuaded from interpreting any of its coefficients except that in the trunk equation. [trunk]length--the coefficient on length in the trunk equation has a p-value of .002, but in the remaining two equations it has p-values of only .224 and .058. A conservative statistician might argue that there are 18 tests of significance in mvreg's output (not counting those for the intercept), so p-values above .05/18 = .0028 should be declared insignificant

i ,

'

,

,

mweg -- Multivariate _k:m

at!the 5! _level. A more aggressive but, in our opinion, reasonable approach would be to first note

1

Then_ w: :hree would work through the individual using test, inpossibly = .0083 that _he equations are jointly significant, variables so we are justified making using some .05/6 interpretation. (6_becau,,e there are 6 independent variables) for the 5% significance level. For instance, examining lemg_h:

!

._stlength (t)

[headroom]length = O.0

(2) (3)

[t_,-_]le_h = o.o [t_m]lem_h = 0.0 F(

3, Prob

67) = > F =

4.94 0.0037 i

The r_por_ed significance level of .0037 is less than ,0083, so we will declare this variable significant. [tru_]iengeh is certainly significant with its p-value of .002, but what about in the remaining two equationsiwith p-values .224 and .058? Performing a joint test: . l;_s_,[headLrooI_]length [t_]length ((!))

I

[tttrn] lenl_ch= O.0 [headroom]length = O.0

F( 2,Prob 67) = > F =

2.91 0.0613

At t_isipolnt; reasonable statisticians could disalgee. The .06 significance value suggests no interpretation t_ut}hese were the two least-significant values out of three, so one would expect the p-value to be a litkte high. Perhaps an equivocal statement is warranted: there s_ms to be an effect, but chance cannot Ibe _xcluded. Q

SavedReSults mvreg '

_aves

in e () : Scalars e(N)

number of obsep;atior_

e (k)

number of parameters ifincluding constant)

e(k_eq)

number of equations

e(df_I)

residual degrees of freedom

e(chi2)

Breusch-Pagan

e (df_chi2)

degrees of freedom for Breusch-Pagan

X2 (corr

only) X2 (curt

Macros e (cmd)

mvreg

e(eqn_es)

names of equations

e(r2) e (rinse)

R-squared for each eqt!ation RMSE for each equatidn

e(F)

F statistic for each eqdation

e(p._F)

significance of F for each equation

e(predic_)

program used to implement predict

Matrices

I

e (b)

coefficient vector

e(V)

variance-covariance

e (Siuna)

_

malrix of the estimators

matrix

i Functions e(sample)

t.

marks estimation samptd

only)

•

Methods and Formulas _

......implemented :, ,.u,.,,,m i -gresslon mvregis as ,at_ an ado-file.

p independent variables (including the constant), the parameter estimates are Given given qbyequations the p × qand matrix B-

(XtWX)-lxtwY

where Y is an n × q matrix of dependent variables and X is a n x p matrix of independent variables. W is a weighting matrix equal to I if no weights are specified. If weights are specified, let v: 1 x n be the specified weights. If fweight frequency weights are specified. W = diag(v). If aweight analytic weights are specified, W = diag{v/(l'v)(l'l)}, which is to say, the weights are normalized to sum to the number of observations. The residual covariance matrix is R={YIWy

B tX' (

.WX)B}/(n-;)

The estimated covariance matrix of the estimates is R ® (X _WX)-I These results are identical to those produced by sureg when the same list of independent variables is specified repeatedly; see [R] sureg. The Breusch and Pagan (1980) X2 statistic--a

Lagrange multiplier statistic--is

given by

q i-I =nZ

.2 z=l .,4=1

where vii is the estimated correlation between the residuals of the equations and n is the number of observations. It is distributed as X 2 with g(q - 1)/2 degrees of freedom

References Breusch. T. andStudies A. Pagan. t980. The LM test and its applications to model specification in econometrics. Review of Economic 47: 239-254.

Also See Complementary:

[R] adjust, [R] lincom, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce

Related:

[R] reg3, [R] regress, [R] regression diagnostics. [R] sureg

Background:

[U] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and post-estimation commands

{

Title -- Negative binomial regression

syntax nbrqg depvar [indepvars] [weight] [if exp] [in range] [,

{

d!spersion({mean Iconstant} ) level(#)irr exposure(varname)oflset(varneme) r_bust cluster(varname)score(ne_'vars) noconstantcoBsr, raints(numlist)

!

[

n__test

nolog maximize_options ]

fffibr_ E depvar[indepvar,] [_eight] [ifexp][inrange][,inalpha(varlist) level(#) irrr ; e_posure (varname) offset (vantame) robust cluster (varname) score (newvars) , ? nc constmat constraint s (numlist) n_log maximtze._options j by ../ : may be used with nbrog; see IR] by, f_ei_hts iweights, and p-aeights are allowed; see [U] 14,1.6 weight, T_ese_,con mands share the features of all estimation commands; n_re_ m_

see [U] 23 Estimation

and post-estimation

commands,

be used with sw to perform stepwise estimation; see [R] sw.

Syntax!fo predict p!cec .ct [_pe] newvarname where st_nsnc is

n ir xb stdp

[if exp] Iin range] [, statistic

predicted number of events (the default) incidence rate (equNalent to predict ..., linear prediction standard error of the prediction

In _dd!itiqn. relevant only after gnbreg _alpha lnalpha stdplna

i

nooffset

i

n nooffset)

are

predicted values of alpha predicted values of In(alpha) standard error of predicted In(alpha)

The,_e !tati tics are available both in and out of sample; type predict the _esti_ation sample.

...

if

e(sample)

...

_f wanted only Io_

Desclriptign nbreglesttmates a negative binomial maximum-likelihood regression of depvar on varlist, where dep_,ar is / _ nonne_ative count variable. In this model, the count variable is believed to be gene|atcd cept that the greater than that of a true Poisson. This cxuu by _ Pbislon-like process, ex variation is variation ls referred to as ox.erdispersion. See [R] poisson before reading this entry. 1 383 l

i ;:!

_breg is a generalized negative binomial regression; the shape p_'ameter alpha may also be parameterized. Persons who have panel data should see [R] xtnbreg

Options dispersion ( {mean I constant } ) specifies the pararneterization of the model, dispersion (mean), the default, yields a model with dispersion equal to 1+ cr exp(zib + offset/); that is, the dispersion is a function of the expected mean: exp(xib + offseti), dispersion(constant) has dispersion equal to I + 6; that is. it is a constant for all observations. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. irr

(95)

reports estimated coefficients transformed to incidence rate ratios, i.e., e b rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored, irr may be specified at estimation or when replaying previously estimated results.

exposure(varname) and offset(varname) are different ways of specifying the same thing, exposure () specifies a variable that reflects the amount of exposure over which the depvar events were observed for each observation; ln(varname) with coefficient constrained to be 1 is entered into the log-link function, o:ffset() specifies a variable that is to be entered directly into the log-link function with coefficient constrained to be I, so exposure is assumed to be e varnarne. robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster () allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,

robust

is implied; see [U] 23.13 Weighted

estimation.

cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups, varname specifies to which group each observation belongs; e.g., cluster (person±d) in data with repeated observations on individuals, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U] 23.11 Obtaining robust variance estimates. cluster() by itself.

implies robust;

specifying

robust

cluster()

is equivalent

to typing cluster()

score (newvars) creates newvar containing % = OlnLj/0(x/b) for each observation j in the sample. The score vector is _7] OlnLj/Ob --: _ujxj; i.e., the product of newvar with each covariate summed over observations. If two newvars are specified, then the score from the ancillary parameter equation is also saved. See [U] 23.12 Obtaining scores. noconstant

suppresses

the constant term (intercept) in the regression.

constraints (numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command: see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. nolrtest suppresses fitting the Poisson model. Without this option, the Poisson model is fit and the likelihood is used in a likelihood-ratio test of the alpha parameter, This option is only valid for nbreg; gnbreg has no likelihood-ratio comparison test (see Technical Note in the section on gnbreg within this entry).

nbreg-

385

n(_logst_ppresses i the iteration log. maxb_iz__options control the maximization process; see [R] maximize. You should never have to specif_, them, although we often recommend specifying trace. lr_alpha (varIist) is a/lowed only with gnbreg. If this option is not specified, gnbreg and nbreg :will produce the same results because the _hape parameter will be parameterized as a constant. lntalt.ha() allows specifying a linear equation for In(alpha). Specifying lnalpha(male old) means in(alpha) = ao + almale .a a2old, where a0, al, and a2 are parameters to be fitted along wilh t ae other model coefficients.

Options n. I !

t i

Negative binomialregression

predict

the _efault, calculates the predicted number of events, which is exp(xjb) if neither of_s_t(varname) nor exposure(varname) was specified when the model was estimated: exp(xib + offset) if offset(varname) was specified: or exp(x_b) • exposure if exposuite (varname) was specified.

ir caicul ttes the incidence rate exp(xjb), which is the predicted number of events when exposure is I. Thi ; is equivalent to n when neither offset (varname) nor exposure (varname) was specified when he model was estimated. xb.calcul ires the linear prediction. I

strip caklulates the standard error of the linear prediction. •

a!pha, l_alpha, and stdplna are relevant after gnbreg estimation only; they produce the predicted values !of alph_ or In(alpha) and the standard error of the predicted In(alpha), respectively. nooffse_ is relevant only if you specified offset(varname) or exposure(vamame) when you esffma_ed the model. It modifies the calculations made by predict so that they ignore the offset or ex_ost_re variable: the linear prediction is treated as xjb rather than x./b + offseb. and specifying predilzt ... is equivalent to specifying predict ... , nooffset Jr.

Remarks See Lo_ng(1997. chapter 8) for an introduction to the negative binomial regression model and lot a discassibn of other regression models for count data. Negati_,e binomial regression is used to estimate models of the number of occurrences (counts) of an event when the event has extra-Poisson Variation; that is. it has overdispersion. The Poisson re_essionl model is Yi "- PoisSon(#i ) where tti = exp_xi,O + offseti) for obser_.led counts Yi with covanates xi for the ith observation. One derivation of the negative binomial i_ that individual units follow a Poissdn regression model, but there is an omitted variable u_ such th_atc"_ follows a gamma distribution With mean 1 and variance a:

5'i "_ Poissorl(/z_) wheTe

_,,. _

/_' = exp(xi/_ and ssu

+ offset/+

ui)

nbreg -- Negative binomial regression

c ~ gamma(1/ , (Note that the scale (i.e., the second) parameter for the gamma(a, A) distribution is sometimes parameterized as 1,/A: see the Methods and FormuIas section for the explicit definition of the distribution. ) We refer to a as the overdispersion parameter. The larger a is, the greater the overdispersion. The Poisson model corresponds to a = 0. nbreg parameterizes c_ as In a. gnbreg allows In G to be modeled as In _xi = ziT, a linear combination of covariates z,. nbreg will estimate two different parameterizations of the negative binomial model. The default, described above and also given by the option dispersion(mean), has dispersion for the ith observation equal to 1 + a. exp(xd3 + offset/). The alternative parameterization, given by the option dispersion(constant), has dispersion equal to 1 + 6, i.e. it is constant for all observations. The Poisson model corresponds to 6 = 0.

nbreg It is not uncommon to pose a Poisson regression following data appeared in Roddguez (i993):

model and observe a lack of model fit. The

list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. II. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

cohort 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3

age_mos 0.5 2.0 4.5 9.0 18.0 42.0 90.0 0.5 2.0 4.5 9.0 18.0 42.0 90.0 0.5 2.0 4.5 9.0 18.0 42.0 90.0

deaths 168 48 63 89 102 81 40 197 48 62 81 97 103 39 195 55 58 85 87 70 10

exposure 278.4 538.8 794.4 1,550.8 3,006.0 8,743.5 14,270.0 403.2 786.0 1,165.3 2,294.8 4,500.5 13,201.5 19,525.0 495.3 956.7 1,381.4 2,604.5 4,618.5 9,814.5 5,802.5

gen logexp = in(exposure) quietly tab cohort, gen(coh) poisson deaths cob2 cob3, offset(logexp) Iteration Iteration Iteration Iteration

O: I: 2: 3:

log log log log

likelihood likelihood likelihood likelihood

Poisson regression

Log likelihood = -2159.5159

= = = =

-2160.0544 -2159.5182 -2159.5159 -2159.5159 Number of obs LI%chi2(2) Prob > chi2 Pseudo R2

= = =

21 49.16 0.0000 0.0113

l• i ! !

T

_

nbreg-

I

deaths

Coef.

Std. Err.

z

Negative binomial regression

P>Izl

[95_, Conf.

387

Interval]

, J

I

coh2

-. 3020405

.0573319

coh3

.0742143

.0589726

-3.899488 (offset)

.0411345

_cons logexp

-5.27 1,26 -94.80

0.000

-. 4144089

-. 1896721

O. 208

-. 0413698

.1897983

O. 000

-3.

98011

-3.818866

. _oisgof Goodness-of-fit Prob

chi2

> chi2(18)

=

4190.689

=

0.0000

The extreme significance of the goodness-of-fit X2 indicates the Poisson regression model is inap_opTiate suggesting to us that we should try a negative binomial model: Lbreg deaths Negative _

I

coh2

binomial

Lo_ likelihood

=

coh3,

offset(loge_p)

nolog

regression

Number of obs LR chi2(2) Prob > chi2

= = =

21 0.40 0.8171

-131.3799

Pseudo

=

0.0015

deaths

Coef.

Std. Err.

z

R2

P>[zl

[95_, Conf.

coh2

-. 2676187

.7237203

-0.37

O. 712

-I.

coh3

-. 4575957

.7236651

-0.63

O. 527

- I.875753

.511856

-4.08

0.000

_cons

-2.086731

logexp

(offset)

/Inalpha

.5939963

686084

-3.08995

.2583615

.0876171

Interval] I. 150847 .9609618 -I.083511

1. 100376

l

alpha Li_elihood

I. 811212 ratio

test

of

.4679475 alpha=O:

I. 09157 chibar2(01)

= 4056.27

Prob>=chibar2

3.005295 = 0.000

Our original Poisson model is a special case of the negative binomial--it corresponds to a = O. nbreg, however, ' estimates a indirectly, estimating instead In a. In our model. In c_ = 0.594. meanin_ that a 't.81 (nbreg undoes the transformati6n for us at the bottom of the outputk In!order to test o = 0 (equivalent to lna, = -ao),

nbreg

performs a likelihood-ratio test. The

stag_erin_z._ ...r ,_2 value of 4.056 asserts that the probability that we would observe these data conditional on cti= t_ is ,_irtually zero. i.e., conditional on!the process being Poisson. The data are not Poisson. It is _ot ]accidental thal this _2 value is quite close to the goodness-of-fit statistic from the Poisson regre_sio! itself. 1 "

t

Q TechnicaI!Note Tl'/e u_ual Gaussian test of ct = 0 is omitted since this test occurs on the boundary, invalidating the u_ual! theory, associated with such tests However. the likelihood-ratio test of a. = 0 has been modifiedlto be valid on the boundao,. In partieular, the null distribution of the likelihood-ratio lest statistic _, not the usual ;_2 but rather a 50:50 mixture of a )_o _ (point mass a, zerot and a t'_. denoted as _02i. 5[ee Gutierrez et al. (2001) for more details.

' : r_

[] Technical Note v,,,,

...,._

_ _egatwe olnomla! regression

The negative binomial model deals with cases where there is more variation than would be expected were the process Poisson. The negative binomial model is not helpful if there is less than Poisson

i

Poisson models arise because of independently generated events. Overdispersion comes about if some of the parameters (causes)of of Poisson areitsunknown. obtain underdispersiom the variation--if the variance the the count variableprocesses is less than mean. ButTounderdispersion is uncommon. sequence of events would have to somehow be regulated; that is, events would not be independent, but controlled based on past occurrences. []

gnbreg gnbreg is a generalization of nbreg. Whereas in nbreg a single tn _ is estimated, gnbreg allows In a to vary observation by observation as a linear combination of another set of covariates: ln c_=z_. We will assume the number of deaths is a function of age whereas the in _ parameter of cohos. To estimate the model, we type gnbreg

deaths

Fitting

age_mos,

constant-only

Inalpha(coh2

coh3)

O:

log

likelihood

=

Iteration

I:

log

likelihood

= -148.64462

-187.067

= -132.49595

Iteration

2:

log

likelihood

Iteration

3:

log

likelihood

= -131.59338

Iteration

4:

log

likelihood

= -131.57949

log

likelihood

= -131.57948

Iteration

5: full

model:

Iteration

O:

log

likelihood

= -124.34327

Iteration

I:

log

likelihood

= -117.72418

Iteration

2:

log

likelihood

= -117.56349

Iterazion

3:

log

likelihood

= -117.56164

Iteration

4:

log

likelihood

= -117.56164

Generalized

offset(logexp)

model:

Iteration

Fitting

negative

binomial

regression

Number LR

likelihood

deaths

= -117.56164

Cool.

Pseudo

Std.

Err.

z

P>IzI

of obs

=

chi2(1)

Prob Log

is a function

=

21 28.04

> chi2

=

0.0000

R2

=

0.1065

[95X

Conf.

Interval]

deaths age_mos _cons logexp

-,0516657 -1.867225 (offset)

.0051747 .2227944

-9,98 -8.38

0,000 0.000

-.061808 -2.303894

-.0415233 -1,430556

cob2

.0939546

.7187747

0.13

coh3

.0815279

,7365477

0.II

0.896

-1.314818

1.502727

0.912

-1.362079

1.525135

0.356

-1.486614

.5346978

inalpha

_cons

-.4759581

.5156502

-0.92

We find that age is a significant determinant of the number of deaths. The standard errors for the variables in the In c_ equation suggest that the overdispersion parameter does not vary across cohorts. We can test this by typing

i

-' _

i

i

'

nbreg -- Negative binomialregresldon

. !test coh2 cob3 i

_ I)

[inalpha] coh2 = O.0

d 2)

[inalpha]coh3

Prob

:

= 0.0

2)

chi2(

0.02

=

> chi2 =

0.9904

There isl no evidence of variation by cohort in these data.

i

[3Techr icl Note

!

NOte the intentional absence of a likelihood-ratio test for o_ = 0 in gnbreg. The test is affected by the .,ame boundary condition the affects the comparison test in nbreg, however, when a is paramet(:rized by more than a constant term the null distribution becomes intractable. For this reason we recot nmend using nbreg to test for overdispersion and if overdispersion exists, only then model

i

the over tispersion using gnbreg.

! 1

Predicted values '

i

After!nbreg

and gnbreg,

predict

returns the predicted number of events:

_breg deaths coh2 coh3, nolog Negative binomial regression

Lo_ likelihood

:

= -108.48841

deaths

Prob

=

O. 9307

=

0.0007

[95Y. Conf.

Interval]

> chi2

Err.

z

.2978419

_cons

4.435906

.2107213

21.05

O. 000

4.0229

-. 0538792

•2981621

-0.18

O. 857

-. 6382662

.5305077

- 1.207379 .29898

.3108622 .0929416

-I. 816657 .1625683

-.5980999 .5498555

• _redict

ratio

test

of alpha=O:

O. 20

P> Izl

chibar2(01)

O. 843

R2

.0591305

"LiKelihood

=

-. 5246289

434.62

Prob>=chibar2

count

(o_tion n assumed; _lmmarize Variable

i

= =

coh2

/ Inalpha alpha

i

Std.

2I O. 14

Number of obs LE chi2(2)

Pseudo

Coef.

coh3

deaths

predicted

number

of events)

count Obs

Mean

Std. Dev.

Min

Max

i

i

deaths

21

84.66667

i

count

2i

84.66667

48.84192

10

4.00773

80

(Continuett on next page)

|

389

197 89. 57143

.64289 4.848912

= 0.000

Saved Results nbreg and gnbreg save in e O" Scalars e (N) e (k) e (k_eq)

number of observations number of variables number of equations

e (N-clust) e(re) e (chi2)

number of clusters return code X2

e(k_dv) e (df_.m)

number of dependent variables model degrees of freedom

e(chi2_c) e (p)

_2 for comparison test significance

e (r2_p)

pseudo R-squared log likelihood

e (ie) e (rank)

number of iterations rank of e (V)

log likelihood, constant-only model log likelihood, comparison model

e(ram.k0) e(alpha)

rank of e(V) for constant-only model the value of alpha

e (cmd)

nbreg or gnabreg

e (opt)

_ype of optimization

e(depvar) e(title) e(wtype)

name of dependent variable title in type estimation outpul weight

e(chi2type) e(chi2_ct)

e (11)

e(ll_O) Macros e(ll_c)

e(wexp) weigh!expression e(clustvar) name ofcluster variable

e(offset#) e(dispers)

Wald or LR; type of model X _ test Wald or LR; type of model X 2 test corresponding to e(chi2_c) offset forequation # mean or constant

e(user) e(vcetype)

e(predict)

program used to implement predict

name covanance of likelihood-evaluator estimation method program

Matrices e (b)

coefficient vector

e(ilog) Functions e(samp2e)

iteration log (up to 20 iterations)

e(Y)

variance-covariance the estimators

matrix of

marks estimation sample

Methodsand Formulas nbreg

and gnbreg

See [RJ poisson

are implemented

and Feller

(1968,

as ado-files. 156-164)

for an introduction

to the Poisson

distribution.

A negative binomial distribution can be regarded as a gamma mixture of Poisson random variables. likelihood is The number of times something occurs, Yi, is distributed as Poisson(ui# i). That is, its conditional f(Yi

where _ui = e×p(xij3+

offseh)

Jui) --

(uilzi)U'e-_"_'

r(y, + 1)

and u_ is an unobserved g(u)

parameter

with a gamma(I/a,

i/a)

density:

= u(1-,_)t%__,/,_ cei/c'F(1/a)

This gamma distribution has mean 1 and variance c_, where o_ is our ancillary parameter. (Note the scale (i.e., the second) parameter for the gamma(a, A) distribution is sometimes parameterized l/A; the above density defines how it has been Parameterized here.)

that as

I

nbreg -- Negative binomialregression

j,,,_

i

391

The unconditional likelihood for the ith observation is therefore

/0

f(Y_) =

:

i

f(Yi I u)g(u) du

_

+y,)

, r(y + 1)r(m)

whe/re _i = I/(1 + c_#i) and m 1/a. SolUtions for a are handled by searching for lna since c_ is requiied to be greater than zero. The ,cores and log likelihood (with weights and offsets) are given by

,(z) = digamma _ction evaluated at z _ls(z) = trigamma function evaluated at z '

a = exp(Tl

m = I/a

Pi = 1/(1 + c_#i)

Pi -- exp(xd3 + offset/)

; c= i=1

I

scOrei[3)/ = Pi (Y_ - #i) scOre(iT-}i = -m

1 + a#i {_(_ui-yi)

In the, :ase of gnbreg,

tn(l +°_gi) 4g'(Yi

_m)-_b(m)}

a is allowed to vary across the observations according to the parameterization

[IIO_i m _:W"

M_in_ization for gmbreg is via the If linear-form method and for nbreg described in [R] ml.

is via _he d2 method

i

Refemnccs Cameron, A! C. and P. K Trivedi. 1998. Regression analysis of count dat_. Cambridge: Feller, W, 1§68. An Introduction

Sons.

to Probabititv

Theory and Its Applications,

Cambridge

University Press.

vol 1 3d ed. New York: John Wile_ &

i

Ounen'ez. R 1 G., S. L. Carter, and D. M Dmkker. 200t, On boundary-value l_,]le_m, _forthcoming. !

likelihood

ratio rests. Stata Technical

Hilbe, J. 19_8. sggl: Robust variance estimators /or MLE Poisson and negative binomial regression_ Stata Technic_d Bulletin _5: 26-28. Reprinted in Stata Technical Bulletin Reprints, vol. 8. pp. 177-180. I

.

!999. @102: Zero-truncawd

poisson

and negative

binomial

regression.

Long, J. S, 1997. Regression Models tot Categorical aad Limited Dependent Reprinted Iin Stata Technical Butledn Reprints, vol, 8; pp. 233-236. Pubtieatit_s, l

1

Rodr{gueL 4" 1993. sbel0- An improvement tu poisstm, Stata Technical 7_-hnie_llBulletin Reprint,_. vol. 2, pp. 94-98.

I

Stata

Technical

Bulletin

Variables. Thousand

Bulletin

11: 11-t4,

47:_7-'40

Oaks, CA: Sage

Reprinted

in St_lta

v

• ii::'i

.....

Rogers, W. H. 1991. sbel: Poisson regression with rates. Stata Technical Bulletin t: 11-t2 Bulletin Reprints, vol. 1, pp. 62-64.

Reprinted in Stata Technical

1993, sgl6.4: Comparison of nbreg and glm for negative binomial. Stata Technical Bulletin 16: 7. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 82-84.

Also See Complementary:

JR] adjust, [R] constraint, [R] lincom, [R] linktest, [R] Irtest, [R] mfx, [R] predict, [R] sw, [R] test, [R] testnl, [R] vce, [R] xi

Related:

[R] glm, [R] poiss0n, [R] xtnbreg, [R] zip

Background:

[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [U] 23.12 Obtaining scores, JR]maximize

I

Tit Install and manage user-written additions from the net

Syrlta:oi fro=

directory._or_.url

_ael cd

path_or_urt

no1 link

tinkname

he1 search

keywords

(see [RJ net search)

taet net

describe

pkgname

net net

set ado set other

dirnarae dirname

ne_ query net

install

net ii get

I

i' all

replace

j

pkgname

[[J all

replace

]

adoi

[, f__i_ad(string) frora(dirname)

ado dir

r ' i, f__ind(string) from(dirname)

ado i describe

[pkgid]

_, fired(string)

ado!uninstall

pkgid

[, fr_m(dimame)

where

i

pkgname

i p_enarne is p_,id is or di "nanle is or or or

]

fArom(dirname)] f

name of package name of a package a number in square brackdts: [#] a directory name PEKSONAL STBPLUS SITE

(default)

DesCription net etches and installs additions to Stata. The additions can be obtained from the Internet or from m__tia. The additions can be ado-files _new commands), help files, or even datasets. Collections of:files Z'e bound together into packages. For instance, the package named zz49 might add the xyz comman( to Stata. At a minimum, such a package would contain xyz.ado, the code to implement the new _:ommand. and xyz .hlp, the on-line help to describe it. That the package contains two files is a detail: you use net to fetch the package z:z49 regardless of how many files there are. ado n_anages the _Packages you have installed using net. The ado command allows you to list packages you have previously installed and to uninstall them. Users can also access the net and ado features by pulling down Help and selecting STB and User-wri ten Programs. 393

I_I

394

net -- Install and manage user-written additions from the net

Options ,

all is used with net install and net get. Typing it with either one makes the command equivalent to typing net Lnsza11 followed by net get. replace is for use with net install and net get. existing files if any of the files already exist.

It specifies that the fetched files are to replace

find(string) is for use with ado, ado dir, and ado describe. It specifies that the descriptions of the packages installed on your computer are to be searched and that the package descriptions containing string are to be listed. from(dimame) is for use with ado. It specifies where the packages are installed. The default is from(STBPLUS). STBPLUS is a codeword that Stata understands to correspond to a particular directory on your computer that was set at installation time. On Windows computers, STBPLUS probably means the directory c:\ad.o\stbplus, but it might mean something else. You can find out what it means by typing sysdir, but this is irrelevant if you use the defaults.

Remarks For an introduction to using net and ado. see [U] 32 Using the Internet to keep up to date. The purpose of this documentation is 1. To briefly but accurately describe net 2. To provide documentation to Stata.

and ado and all their features.

to those who wish to set up their own sites to distribute

additions

Remarks are presented under the headings Definitionof a package The purposeof the :net and ado commands Content pages Package-descriptionpages Where packages are installed A summary of the net command A summary of the ado command Relationship of net and ado to the point-and-click interface Creating your own site Format of content and package-descriptionfiles Example 1 Example 2 Metacharactersin content and package-descriptionfiles Error-free file delivery

Definitionof a package A package is a collection of files typically . ado and . hip files--that provides a new feature in Stata. Packages contain additions that you wish were part of Stata at the outset. We write such additions and so do other users. One source of these additions is the Stata Technical Bulletin (STB). The STB is a printed and electronic journal with corresponding software. If you want the journal, you must subscribe, but the software is available for free from our web site. If you do not have Intcrnet access, you may purchase the STB media from StataCorp.

i

net -- Install and manage user-writtenadditionsfrom the net

i

i

i

I +

I ii

l

395

The purp(se of the net and ado commands 1

The pprpose of the net command is to make distribution and installation of packages easy. The goal is tO_get you quickly to a package description page that summarizes the addition: rte_stat

• n_et describe ! '

package

rte

star from

http://www.wemak_itupaswego.edu/fscu!ty/sgazer/

i

+

_I_ rte_stat.

i '

The robust-to-everything

_ES_PrIOl/tCrln S. Gazer, Aleph-O

(S) Dept.

_ '

of Applied

I00_, confidence

applications; |

i

__

Aleph-I

statistic ; update.

Theoretical

intervals confidence

Mathematics,

proved

WMIUAWG

too conservative

intervals

have been

Univ.

for some

substituted.

The new robust-to-everything supplants the previous robust-toeverything-conceivable statistic. See "Inference in the absence of data"

1

(forthcoming).

After

ihstallation,

_IN+ILLITIOE FILES _ rt e.ado

see help (type

net

rl}o. ins_a]/

rte_stat)

rte.hlp i

nullset, ado random, ado

Should y,)u addition might prove u_fu], net makes the installation easy: at decide install the rte_stat checking

r_e_stmt

consistency

ins talling into c :\ado\stbpius\ ins ;allation complete•

and verifying

not already

installed.-.

...

The p_rpose of the ado command is to help you manage packages installed with net. Perhaps you reme[nber that you installed a package that calculates the robust-to-everything statistic but cannot remembe I the command's name. You could use ado to search what you have previously installed for the r'_e dommand:

[I]i package sg146 from http://waw.ststa, STB-66 sg146. Scalar measures of ( _utput omitted

[I+ 1i

rto_stat. package

com/stb/stb56 fit for re_ression

models.

)

rte_statThe

robust-to-svoryth_-_ sta_2s_ic! Ilpdmte, from http://wwa._emakeitupaswego.edu/faculty/sgazer

(_utputomitted) [2+

package STB-62

sgl21from http://www, stata, com/stb/stb52 sK121: Seemingly unrelated est. t cluster-adjusted

sandwich

est.

or, you _ight type . afro,package find("robust-to-everything") [15_] fro_star from http://www.i_emakeitupaswego.edu/faculty/sgazer rte_s_at. The robust-to-ovorytb_n_ statistic! update. 1

Perhaps 9ou decide that rte You can hse ado to erase it: o uninstall pa+age

(_kag, !

rte_s_at rte_stat.

_-,tall.d)

despite the author's claims is not worth the disk _paee it oeeupie<

rte_stat from http://www, wemak_itupaswego.edu/faculty/sgazer The robust-to-everyt_ii-_ statistict update.

F ..

o:,o nez -- lnsza, ano manage user-written additions from the net ado uninstall is easier than erasing the files by hand because ado uninstall will erase every file associated with the package and, moreover, ado knows where on your computer rte_stat is installed; you would have to hunt for these files.

Content pages There are two types of pages displayed by net: content pages and package-description pages. When you type net from, net ed, net link, or net without arguments, Stata goes to the specified place and displays the content page: • net

from

http://www,

http://www,

stata, com/

STB and

other

Welcome

to Stata

Below

user-written

we provide

mentioned install

stata.com

additions

additions

official

DIRECTORIES

you

use

with

Stata

Corporation.

on Statalist. the

for

to Stata These

updates

could

--met

that

are

NOT

by typing ad-

were

THE

published

OFFICIAL

in the

UPDATES;

STB or

you

fetch

to:

stb

materials

users meetings quest

materials by various people Stata user group meetings StataQuest additions

links

other

published

locations

in the

providing

Stata

Technical

including

additions

Bulletin

BtataCorp

cd stb

http ://www. stata, com/stb/ The Stata Technioal ]Buul.let_

PLACES

you

could

-net

link-

to:

stata

StataCorp

portugal

STB

DIRECTORIES you (ou_ut omitted)

could

web

mirror

-net

site

site

@d- to:

stb57

STB-57,

September

2000

stb56

STB-56,

July

2000

stb55 stb54

STB-55, STB-54,

May March

2000 2000

stb53 stb52

STB-53, STB-52,

January November

2000 1999

(ou_utomitmd)

• net

cd stb54

http://www.stata.com/stb/stb54/ STB-54 March 2000

DIRECTORIES •.

you

could

-net cd- to: Other STBs

employees

to Stata

A content page tells you about other content pages and/or package-description example lists other content pages only. Below we follow one of the links: • net

and

-update-.

pages. The above

net -- Install and manage user-writtenadditionsfrom the net PA(:KAGES you could

I { I

i i

_, i

;

• ;

i _

397

-mot dlscribe-:

dm73_I

Contrasts

dm76 dm77

ICD-9 diagnostic and procedure code utility Kemoving duplicate observations in a dataset

for categorical

to drawing

variables:

update

gr34_3

An update

gr43 ip29_I

Overlaying graphs Metadata for usercwritten

she32

Automated

sg120_2 sgll6_l

Correction to roccomp command Update to hotdeck imputation

sg132 sgl30 sg133

Analysis of variance from summary statistics Box-Cox regressidn models Sequential and drop one term likelihood-ratio

sg134 sg84_2 sxdl_2

Model selection using the Akaike information criterion Concordance corr61ation coefficient: update for Stata 6 Random allocatio_ of treatments bal. in blocks: update

outbreak

Venn diagrams

det.

contributions for pub.

health

to Stata surveillance

data

tests

dm?32_l,I din76 ..... sxdl 2 are links to package-description pages. 1. _W_en you type net from, you follow that with a location to display the location's contem :pai_e, i

ia. The location could be a URL such as http://www, stamcom. The content page at that location would then be listed 1 b, The location could be a: on a Windows computer or :diskette: on a Macintosh computer, The content page on that source would be listed. That would work if you had special media obtained from StataCorp (STBmedia) or speckfl media prepared by another user.

_, The location could even be a directory on your computer, but that would work only if ! that directory contained the right _nd of files. 2. OnCe you have specified a location, typing net loc_tion, if there are any. Typing

cd will take you into subdirectories of that

!

• n_t cd stb

_is_quivalent to typing • n_t from http://www.stata.com/stb

TyI ing net cd displays the content page from that location. 3. _y[ing net without arguments redisplays the current content page, which is the content page last displayed. 4. ne_ link is similar in effect to net cd in that the result is to change the location, but rather thai changing to subdirectories of the current location net link jumps to another location:

(Contin uett on next page )

t

398

net -- Install and manage user-written additions from the net • net

from

http://wvw.xk8.net/

http://www, Ik8 .net

Welcome No,

xk8. net/

to www.xk8.net.

we don't

employees

use

statistical

at StataCorp

software,

so we agreed

but

to put

so you could see how a user materials does not interfere

site might look with the other

PLACES

to:

you

could

-net

stata PACKAGES

link-

StataCorp'smain you

could

four

like

files

one

on our

and see that HTML files.

of the web

having

site Stata

page

desaribe-_

-net

xsample

we rather

A sample

package

Typing net link stata would jump to http://www.stata.com: • net

link

stata

http://www.stata.com/ STB and other user-vritton

Welcome

to

Stata

additions

for

use

v£th

Sta_a

Corporation.

(ou_utomitted)

Package-descriptionpages Package-description • net

from

pages describe what would be installed:

http://w_w.stata.com/stb/stb54

http://www.stata,

com/sth/stb54/

(ou_utomiUed) • net

sg132

describe

package

sg132

from

http://www.stata.com/stb/stb54

TITLE STB-54

sg132.pkg.

Analysis

of variance

from

s1_mmary

statistics

DESCRIPTION/AUTK_(S) STB

insert

Support: After

by

John

R. Gleason,

Syracuse

University

loesljrg©accucom.net

installation,

INSTALLATION FILES sg132/aovsum.ado

see help

aovsum (type

net

install

(type

net

get

sg132)

sg132/aovsum.hlp ANCILLARY FILES sg132/absences.dta

sg132)

sg132/demo.do sg132/Idose.dta

A package-description page describes the package and tells you how to install the component Package-description pages potentially describe two types of files:

files,

!

r , i

net -- Install and rr_nage user-writtenadditionsfrom the net 399 I. Ins, allation files: Files you type net install to install and which are required to make the _addition work. 2. _An,:illary files: Additional files you migM want to install--you type net get to install them-bu_ which you can ignore. Ancillary files are typically datasets that are useful for demonstration pm 9oses. Ancillary files are not really installed in the sense of being copied to an official place for use by Stata itself. The), are merely copied into the current directory so that you may use the_ if you wish. You ir_stall the official files by typing net install sg132

n._t install the :king sgi$2

consistency

and verifying

ins ;ailing into c :\ado\stbplus\ ins ;allation complete.

You get t]te ancillary files thei p_ka ge name: •

i

.

m_t

get

che_:king

not already

installed.,.

...

if there are any and if you want them

by typing

net

get followed b}.

sg132 sg132

consistency

curremt

top,ring into

ltl_smncos,

copying

_,do

copying

IM_s1.4F_a

files

and verifying

not

already

installed...

d_rectory...

copying

anc: llary

followed by the package name:

dta

successfully

copied.

Most _se_ ignore the ancillary files. Once _ou have installed a package--typed descril_tio_ page whenever you wish: . a_o describe

i] package

net

install

use ado to redisplay the package-

sg132

sg132

from http://www.stata.com/stb/stb54

Tn_ :

STB-54

sg132.pkg.

Analysis

of variance

DSS(_IPTION/AUT_It($) : STB insert by John R. Gleason, Syracuse i Support : loesljrg©accucom, net After _NS_ALLATION i a\ao_s_, a\aovs_,

installation,

from summary

statistics

University

see help aovsum

FILES ado hip

iNStALLED ON ! 25 Jul 2000

• I

Note that he package-description page shown _ ado includes where we got the package and when we install_d it. Also note that it does not mention the ancillary files that were originally par_ of thi_ package b_cause they are not tracked by ado.

,_uu

b,

nez -- msza. ana manage user-wm_en aOdltions trom the net

Where packages are installed Packages should be installed in STBPLUS or SITE. STBPLUS and SITE are codewords understands and that correspond to some real directory on your computer. Typing sysdir you where these are, if you care. • sysdir STATA : UPDATES: BASE : SITE:

C: \STATA\ C :\STATA\ado\updates\ C :\STATA\ado\base\ C:\STATA\ado\site\

STBPLUS: PERSONAL: DLDPLACE:

c :\ado\stbplus\ c :\ado\personal\ c :\ado\

If you type sysdir,

that Stata will tell

you may obtain different results.

By default, net installs in the STBPLUS directory and ado tells you about what is installed there. If you are on a multiple-user system, you may wish to install some packages in the SITE directory. This way, they will be available to other Stata users. To do that, before using net install, type . net set ado SITE

and when reviewing ado

....

what is installed or removing packages, redirect ado to that directory:

from(SITE)

In both cases, you literally type "SITE" because Stata will understand that SITE means the site ado-directory as defined by sysdir. To install into SITE, you must have write access to that directory. If you reset where net ado-directory, type

installs and then, in the same session, wish to install into your private

• net set ado STBPLUS

That is how things were originally. If you are confused as to where you are. type net query.

A summary of the net command The purpose of the net command is to display content pages and package-description pages. Such pages are provided over the Internet and most users get them there. We recommend you start at http://www.stata.com and work out from there. We also recommend using net search to find packages of interest to you; see [R] net search. You do not need Internet access to use net. The additions published in the STB are also available on media and can be obtained from StataCorp; see [R] sth. There is a charge for the media. net from is how you jump to a location• The location's

content page is displayed.

net cd and net link change from there to other locations, net cd enters subdirectories of the original location, net link jumps from one place to another, where being determined by what the content-provider coded on their content page• nez describe pkgname

lists a package-description

page. Packages are named, and you type net describe

net install installs a package into your copy of Stata. net (ancillary files) to your current directory.

get

copies any additional

files

_'_-

.et -- lnsti_lland manage user-writtenadditions from the net

401

A summary of the ado command The purp _se of the ado command is to list the package descriptions of previously installed packages. Typirlg o without arguments is the same as typing ado dir. The}' list the names and titles of the p_cl_ge_ you have installed. ado describe !

lists full package-description Nges.

ado _ni stall removes packages from your computer. Since yo_ can install packages from a variety of sources, there is no guarantee that the package name_ are t_n,que. Thus. the packages installed on your computer are numbered sequentially and you may r_fer to them by name or by number I For instance, say you wanted to get rid of the robust-to-everything statistic command you installed: • ado_ find("robust-to-everything") [15] _ackage

z_e $_;a_ from http://www.wemakeitupaswego.edu/facu!ty/sgazer

r_e_stat.

-The robust-to-everyt_

statistic;

_)da_e.

You could t!pe • _do luninstall

ado uninstall

tie_star

[15]

Typing "adc tminstall rte_stat" would work Only if the name rte_stat ado wot/ld efuse and you would have to type the number. The fine () option is allowed with ado dir for the wor( or phrase you specify, ignoring description searched, including the author's name of 'a c_)mmand you wanted to eliminate you could t_pe ;

• ado, [15]

were unique: otherwise

and ado describe. It searches the package description case (alpha matches Alpha). The complete package name and the name of the files. Thus, if rte was the but you couId not remember the name of the package,

find(rte) t5ackage

_,

rte.s_a_.

:

i

rte_sta% The

from

http://www,

robust-to-everythin_

wemakei_:upaswego, statistic;

edu/faculty/sgazer _la_e.

Relationshi of net and adoto the point-and-clickinterface i i

Users m_y instead pull down Help and select STB and User-written advanlages _nd disadvantages: 1. Flippi4g through content and package-desc_ption See CI)apter 20 in the Getting Started manu_l.

1

Programs. There are

pages is easier; it is much like a brow_er.

2. When_rowsing a product-description page, note that the .hlp files are highlighted. _bu may click On .hlp files to review them before installing the package. 3. You ntay not redirect from where ado searches for files. I

402

net m Install and manage

user-written

additions

from the net

Creating your own site The

rest

of this

idea

is that

you

wish

to put

them

Or, perhaps

you

In any case,

have out just

all you

(or in a subdirectory), you

entry

concerns

how

written

additions

so that

coworkers

have

a dataset

need

to create for

you

that

add

use

own

with

you

two

and

more

want

to distribute xyz.

at other

others

You place

site

Stata--say

or researchers

is a homepage.

and

your

ado

and

institutions

can

xyz

to Stata. .hip--and

easily

you want

content

file and

to distribute a package

description

Format of content and package-descriptionfiles content

file describes

the

content

OFF * lines starting with * are comments;

page.

It must

be named

stata

(to make site unavailable they are ignored

.toe:

top of stata.toc temporarily)

* blank lines are ignored, too * v indicates version v2 * * * d d d d

specify

v 2; old-style

d lines display description text the first d line is the title and the remaining blank d lines display a blank line title text text

* I lines display links t word-to-show path-or-url 1 word-to-show pa_h-or-url

toc files do not have this

ones are text.

[ description] [description]

* t lines display other directories t path [description] t path [description]

within the site

* p lines display packages p pkgname [description] p pkgname [description] end of stata.toc Package

files

describe

packages

and

are

named

pkgname,

pkg: top of pkgname.pkg

* lines starting with • are comments;

they are ignored

* blank lines are ignored, too * v indicates version--specify v 2 * * * d d d d

v 2; old-style pkg files do not have this

d lines display package description text the first d line is the title and the remaining blank d lines display a blank line title text text

ones are text.

you

install

on your

are done.

The

The them.

to share.

the files

files--a

additions

homepage file--and

net --

f

Install and man_ge

user-written

additions

from the net

403

* fid :ntifies the component files [patf/]tilename [description] f [pat//]tilename [description]

* • tile is optional; it means stop reading e end of pkgname.pkg

-Example Say we kant the user to see ffJe following: I • net Ifrom http://_r_a.university, edu/-me http: _/w_w. university, edu/-me Chris Farrar, Uni University PA_KA(,ESyou could -n=t d_scribe-: x 'z interval-truncated survival • met Idescribe xyz pack_ iexy= from http://_.university,

edu/-me

_rZTLE xyz.

interval-truncated

survival,

_E$C_FTIOS/AO'tHOL(S) C. Farrar, Uni University. :IBSTII,LATIOB FILES xyz. ado xyz. hip AN¢ILIJd_I FILES sample,dta

i

The files

to do this would

(type

net

install

(type

r_t

get

x_)

ry=)

be top of stata.toc -

-

v2 d Chr:s F_crrar,Uni University p xyz interval-truncated survival end of stata.toc top of xyz.pkg v2 d xyz i

d f f f

C. xyz xyz sam

interval-truncated survival. 'arrar, Uni University. ado hip ,le.dta end of xyz.pkg

On his horn _page. Chris

"_

Note

!

that C_ri_

does

would

place

the followivlg

stata.toc

(shown above)

xyz.pkg xyz, ado xyz.htp sample.dta

(shown file to file to file to

nothing

to distinguish

files:

beabove_ deli#ered (for use by net install_ be delit'ered (for use by net install} be deli*ered (for use by net get'_

ancillary

files from

installation

files.

.....

4U'*

net -- msca. anu manage user-written additions from the net

Example 2 S. Gazer wants to create a more complex site: • net

from

http://www.wemakeitupaswego.edu/faculty/sgazer

http://www, wemakeitupaswego, Data-free :iJ_eron@e _eri_s

S. Gazer, Also

Department

of Applied

see my homepage

PLACES

you

could

edu/f aculty/sgazer

for

-not

stata

the

link-

Theoretical

preprint

you

could

ir

-no_

web

you

could

-net

site

inference

programs

Robust-to-everything-conceivable

rte

Robust-to-everything

describe

package

rte

(work

in progress)

des@ribe-:

rtec

• net

inference".

@d- to:

irrefutable

PACKAGES

"Irrefutable

to:

StataCorp

DIRECTORIES

of

Mathematics

statistic

statistic

rte

from

http://www.wemakeitupaswego.edu/faculty/sgazer/

TITLE rte.

The

robust-to-everything

statistic;

update.

DESCIIPTIOW/AUTHO_(S) S.

Gazer,

Aleph-O

Dept. I00_

of

Applied

confidence

applications;

Aleph-I

Theoretical intervals

confidence

_athoma_i@s, proved

intervals

The new robust-to-everything supplants everything-conceivable statistic. See of data" Support:

(forthcoming). email

After

too

WHIUAWG Univ.

conservative

have

been

for

some

substituted.

the previous robust-to"Inference in the absence

installation,

see

help

r_e.

(type

net

install

(type

net

Ket

[email protected]

INSTALLITION FILES rte.ado

rio_star)

rte.hlp nullset.ado random.ado ANCILLAKT FILES empty.dta

rte_stat)

The files to do this would be top of stata.toc v 2 d Data-free

inference

d S. Gazer, d

Department

materials of Applied

d Also see my homepage for I stats http://w_w.stata.com

the

t ir irrefutable

programs

p rtec p rte

inference

Theoretical

preprint

Robust-to-everything-conceivable Robust-to-everything statistic

Mathematics

of "Irrefutable

(work

inference".

in progress)

statistic end of stata.toc

"

net -- Install and mdnage user-written additions from the net

405

top of rte.pkg v2 d rte

The robust-to-everything

d _bf S. !

Gazer,

of Applied

d Alel)h-O,IOOY, confidence d app.ications;

'

Dept.

Aleph-i

update,

statistic; Theoretical

intervals confidence

prov4d

Mathematics,

WMIUAWG

too conservative

for

inter_als

have

been

Univ.}

some

substituted.

d The new robust-to-everything supplants the previous robust-tod eve:'ything-conceivable statistic, See "Inference in the absence d 6f

[ata" (forthcoming).

After

d

installdtion,

see help

{bf:rte}.

i

d Sup]tort : f _rte ado

emaii

sgazer@wemalteitupasweg_.edu

f rte hip f hull.set.ado f ran,tom.ado f emp_ :y.dta

end of rte.pkg On his hom_page,

Mr. Gazer would place the following

stata, toe rte .pkg rte.ado

i

|

rte.hlp nullset,ado random, ado empty.dta

For comple)

i

(shownabove) (shownabove) (file to be delivered) (file to be delivered) (file to be delivered) (file to be delivered) (file to be delivered)

!

the other package referred to in stata.toc the correspondidg files to be delivered

ir/stata.toc

the contents file':for when the user types net cd ir whatever other _pkg files are referred to whatever other files are to be delivered

sites, a different

l

files:

rtec,pkg rtec .ado rtec.hlp ir/.,, Jr/...

stata,

toc

rte .pkg

--

structure

i

may prov4 more convenient: (shown abovel (shown abovej

rtec.pkg

ihe other pac_ge referred to in stata.toc directory containing rte files to be delivered: (file to be dellvered) rte/rte,hIp (file to be delivered) rte/nullset.ado (file to be delivered) rte/random, ado (file to be delivered) rte/empty,dta (file to be delivered) rte/ rte/rte.ado

rtec/...

directory containing rtec files to be delivered: (files to be delivered)

ir/stata_toc

the contents fiie for when the user types net cd ir

Jr/* Jr/*/.,,.pkg

whatever referred to whatever othei othei package files are files to beare delivered

rtec/

I

If you prefe

this structure,

it is simply a matter _f changing

the bottom of the rte,pkg

i

_

from

f _te. ado f _te. hip f 1lullset.ado f _anC ore,ado f erupty.dta

|

I

i,b; f_

406

net -- Install and manage user-written additions from the net

tO f rte/rte, ado f f f f

rte/rte.hIp rte/nullset, ado rte/random, ado rte/empty .dta

Note thatin writingpathsand files, thedirectory separator forwardslash(/)isused regardless of operating systembecausethisiswhat theInternet uses. Alsonotethatitdoesnotmatterwhetherthefiles you putoutareinDOS/Windows,Macintosh, or Unix format(how linesend isrecordeddifferently). Whcn Stainreadsthefiles overthelnternet, it willfigure outthefileformaton itsown and automatically translate thefiles towhat isappropriate forthereceiver.

SMCL in content and package-descriptionfiles The text listed on the second and subsequent d lines in both stata.toc and pkgname.pkg contain SMCL as long as you include v 2; see [P] smcl. Thus, in rte.

pkg, note that S. G_er

may

coded the third line as

d {bf:S. Gazer, Dept. of Applied Theoretical

Mathematics,

WMIUAWG Univ.}

Error-freefile delivery Most people transport files over the Internet and never worry about the file being corrupted in the process because corruption rarely occurs. If, however, it is of great importance to you that the files be delivered perfectly or not at all, you can include checksum files in the directory. For instance, say that included in your package is big.dta and that it is of great importance big. dta be sent perfectly. First, use Stata to make the checksum file for big. dta

that

• checksum big.dta, save

That creates a small file called big. sum; see [R] checksum. Then copy both big. dta and big. sum to your homepage. W"nenever Stata readsfilename, whatever over the net. it also looks forfilename, sum. If it finds such a file. it uses the information recorded in it to verify that what was copied was error free. If you do this, be cautious. If you put big. dta and big. sum on your homepage and then later change big.dta without changing big. sum, people will think there are transmission errors when they try to download big.dta.

References Baum. C. E and N. J. Cox. 1999. ip29: Metadata for user-written contributions to the Stata programming Stata Technical Bulletin 52: 10-12. Reprinted in Stata Technical Bulletin Reprints. vet. 9. pp. t21-124.

language.

Cox. N. J. and C. F. Baum 2000. ip29.1: Metadata for user-written contributions to the Stata programming language: extension_. Stata Technical Bulletin 54: 21-22. Reprinted in Stma Technical Bulletin Reprints. vol. 9. pp. 124-126.

|-_=

net -- Install and manage user,writtenadditionsfrom the net

Also See

i t

.

Complemental:

[R] checksum, [R] net se*rch. [R] stb

Relate_: Ii I i

[R] update, [1"]smcl

Baekg_u ad:

[GSM]20 Using the Internet, [CSU] 20 Using the Intemet, [GSW] 20 Using the Inte_net, [U] 32 Using the Internet to keep up to date

407

i

i

i

F

II II IMI_

net

Search Internet for installable

packages

1 I

Syntax net search keywords [, or no_tb tocpkg toc pkg everywhere __ilenames errnone I

Description net search searches the Internet for user-written additions to Stata, including but not limited to user-written additions published in the STB. net search lists the available additions that contain the specified keywords The user-written materials found are available for immediate download by using the net command or by clicking on the link. In addition to typing net Programs.

search,

users may pull down Help and select STB and User-written

Options or is relevant only when multiple keywords are specified. By default, only packages that include all the keywords are listed, or changes this to list packages that contain any of the keywords. nostb restricts the search to non-STB sources or, said differently, matches that were published in the STB.

causes net

search

not to list

tocpkg, toc, and pkg determine what is searched, tocpkg is the default, meaning that both tables of contents (tocs) and packages (pkgs) are searched, toc restricts the search to tables of contents only. pkg restricts the search to packages only. everywhere and filenames determine where in packages net search looks for keywords. The default is everywhere, filename restricts net search to search for matches only in the filenames associated with a package. Specifying everywhere implies pkg. errnone is a programmer's are found.

option. It causes the return code to be 111 instead of 0 when no matches

Remarks net search searches the Internet for user-written additions to Stata. If you want to search the Stata documentation for a particular topic, command, or author, see JR] search.

408

net search-- Searchlnternetfor installable packages

409

Topic searches Exarripk_: find what is available about "random effects" • n4_t; search

random

effect

Comntent I. It is best to search for the singular, net search _nd "random effects". !

2. net! search

random effect

random effect

will find both "random effect"

will also find "random-effect" because net search

performs a

gtri_g search and not a word search. 3. net{ search i

i

I

!

random effect

lists all packages containing the words "random" and "effect".

,lot becessarily used together. packages containing 4. If y_u wanted random all the Word "random" net search effect, or.

Authar

you

or

the word "effect",

type would

rches

Example

find what is available by authoi_ Jeroen Weesie

CommCnt_: i

_

• ne_ search

weesie

1. Youlcould type net

search

jeroen

last !ame is used without the first. 2. You]could type net search _eesie sCar4h.

weesie

but that might list less. because sometimes the

but it would not matter. Capitalization is ignored in the

Example:! find what _savailable by Jeroert Weesie excluding STB materials • ne

search

weesie,

noRtb

i

l, The ;TB tends to dominate search results because so much has been published in the STB. If you mow what you are looking for is not in the STB, specifying the nostb option will narrow

i

tlae _arch. 2. just net search ype net weesie search

lists everything net the search weesie, nostb lists, and If you weesie, look dowri list. STB materials are listed first more. and non-STB

Command searches i l

mate Jals are listed last.

:

• •

Example: thekursus, user-written command kursus . nel find search file I You _ould just type net search kursus, and that will list everything net search kursus, file!lists, and more, Since you know k_rsus is a command, however, there must be a kurs_s,

ado file associated with the package. Typing net search

, " 2. SOarCtl" You d:outd aIso t}pe net search

kursus.ado,

file

kursus,

file

narrows the

to narrow the search even more.

,

Where does net search look?

41L11 net search -- Search I.nternetfor installable packages net search looks everywhere, not .just at www.stata.com.

net search begins by looking at www.stata.com, but then follows every link, which takes it to other places, and then follows every link again, which takes it to even more places, and so on. Authors: please let us know if you have a site that we should include in our search by sending an email to netsearch@stata, com. We will then link to your site from ours to ensure that net search finds your materials. That is not strictly necessary, however, as long as your site is directly or indirectly linked from some site that is linked to ours.

How does net search really work?

crawler Your computer talks to www.stata.com

www.stata.com maintains a database of Stata resources. When you use net search, it contacts www.stata.com with your request, www.stata.com searches its database, and Stata returns the result(s) to you. Another part of the system is called the crawler: it searches the web for new Stata resources to add to the net search database and it verifies that the resources already found are still available. Given how the crawler works, when a new resource becomes available, the crawler takes about two days to notice it and, similarly, if a resource disappears, the crawler takes roughly two days to remove it from the database.

net search -.i- Search Internet for instailable packages

F

411

Refeieno.=s Baum, C _. and N. J. Cox. 1999. ip29: Metadata for user-written contributions to the Stata programming language. Stata Td_chnicalBulletin 52: 10-12. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 121-124. Cox, N. J. _nd C. E Baum. 2000. ip29.1: Metadata for!user-written contributions to the Stata programming language: } !

extensiols. Stata Technical Bulletin 54: 21-22. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 124-126. Gould. W. a_d A. Riley. 2000. stata55: Search web for ir_stallablepackages. Stata Technical Bulletin 54: 4-6. Repnnted in Stata!Technical Bulletin Reprints, vol. 9, pp. 10-113. i

•

Also See tt Comp!emlnta_': Relatetl: |

:

I t

[R] net, [R] stb [R] search, [R] update

r i

newey _ Regression with Newey-West

standard errors

[

]

i

Syntax newey depvar

[vartist]

[ t(varnamet) aueights newey

are allowed;

force

[weight]

[if

exp]

noconstant

[in

range],

level(#)

lag(#)

]

see [U] 14.1.6 weight.

shares the features

of all estimation

commands;

see [U] 23 Estimation

and

post-estimation

commands.

Syntaxfor predict predict

[type] newvarname

[if

exp]

[in range]

These aresample. available both in and out of sample; type the statistics estimation

predict

[,

{ xb t stdp ...

i_

e (sample)

} ] "'

if wanted only for

Description newey produces Newey-West standard errors for coefficients estimated by OLS regression. error structure is assumed to be heteroskedastic and possibly autocorrelated up to some lag. Note that if lag(0) is specified, the variance estimates produced Huber/White/sandwich robust variances estimates calculated by regress,

The

by newey are simply the robust; see [R] regress.

Options lag(#)

is not optional;

it specifies the maximum

lag to consider in the autocorrelation

If you specify # > 0, then you must also specify option t(), lag(0), the output is exactly the same as regress, robust.

structure.

described below. If you specify

t (varnamet) spccilies the variable recording the time of each observation. _bu must specify t() if lag() > 0. varJ_amet must record values indicating the observations are equally spaced in time or newey will refuse to estimate the model. If observations are not equally spaced but you wish to treat them as if they were. you must specify the force option. You need only specify t () the first time you estimate a model with a particular dataset. After that, it need not be specified again except to change the variable's identity; nevey remembers your previous t () option. force specifies ttu_t estimation is to be forced even though t () shows the data not to be equally spaced. newey requires observations be equally spaced so that calculations based on lags correspond to a constant timc oh:rage, If you specify a t () variable indicating that observations are not equally spaced, newey will refuse to estimate the model. If you also specify force, newey will estimate the model and :_ssume that the lags based on the data ordered by t() are appropriate. noconstant

spccilics that the estimated regression should not include an imercept term.

level (#) specilics the confidence level, in percent, for confidence intervals. The default is level or as set by set level: see [U] 23.5 Specifying the width of confidence intervals. 412

(95)

newey -- Regressionwith Newey-West standard errors

Options [

413

predict

xb, the d_fault, calculates the linear prediction. strip ;al_ulates the standard error of the linear prediction.

[

Remarks

:

• i The H_ber/White/sandwich robust variance estimator (see. for example, White 1980) produces consistend standard errors for OLSregression coefficient estimates in the presence of heteroskedasticity, The New_y-West (1987) variance estimator is an extension that produces the consistent estimates when [he_e_is autocorrelation in addition to possible heteroskedasticity. ! The N_ewey-West variance estimator handles autocorrelation up to and including a lag of m. where m is specified by stipulating a lag(m) option. Thus, it assumes that any autocorrelation at lags great er than m, can be ignored,

> Example neweyi,lag(O) is equivalent to regress,robust: . r, gross

price

weight

displ,

with

robust

standard

Re_ _ession

[ [

robust errors

[

{

74 14,44 0,0000

R-squared Root MSE

O.2909 2518.4

= =

Robust price

Coef.

i weight dis_)lacement

1.823366 2,087054

,7808755 7.436967

2.34 O. 28

O.022 O.780

.2663445 -12.74184

3. 380387 16. 91595

247,907

1129.602

O. 22

O.827

-2004.455

2500.269

i

_cons

. niwey !

price

ReCession i

Number of obs = F( 2, 71) = Prob > F =

maximum

weight

with lag

Std. Err.

displ,

Newey-West

t

P>Jtl

[95_. Conf,

lag(O) standard

errors

Number

of obs

F( 2, Prob > F

: 0

i

Interval}

71)

=

74

= =

14.44 0.0000

Newey-West

dis

price

Cool.

weight ,tacement

I.823366 2.087054

_cons

247.907

Std. Err.

t

P>Itl

[957, Conf.

Interval}

.7808755 7.436967

2.34 0.28

O.022 0.780

.2663445 -12.74184

3. 380387 16.91595

1129.602

0.22

0.827

-2004.455

2500.269

:1

[

i[

.-Example ha e time-series measurements on variables usr and idle

mo_el; bit

obtain

Newey-Wes!

_land,_rd

,'rrors

allowing

for

a lag

and now

of

up

wish

to 3:

to estimate an o15

i_ !

414

newey --

Regression with Newey-West

standard errors

t • newey

usr

Regression maximum

idle, with

lag

lag(3)

t(time)

Newey-West

standard

errors

Number

of

F( I, Prob > F

: 3

usr

Coef.

idle

-.2281501

_cons

23.13483

Std,

Err,

t

.0690927 6.327031 Newey-West

P>[tl

[95_

obs

=

28)

= =

Conf.

30 10.90 0.0026

Interval]

-3.30

0.003

-.3696801

-.08662

3.66

0.001

!0.17449

36.09516

q

Saved Results newey saves in e(): Scalars e (N)

number of observations

e (F)

F statistic

e (dr_m) e(df_/)

model degrees of freedom residual degrees of freedom

e(lag)

maximum lag

e (cmd)

newey

e (wexp)

weight expression

e(depvar) e(wtype)

name of dependent variable weight type

e(vcetype) e(prediet)

covariance estimation method program used to implement predict

coefficient vector

e (V)

variance-covariance the estimators

Macros

Matrices e (b)

matrix of

Functions e(sample)

marks estimation sample

Methods and Formulas newey is implemented newey

calculates

as an ado-file.

the estimates flons

-- (X'X)-IX'Y

- (x'x)-lx'hx(x'x) That is, the coefficient For the case of lag formulation:

estimates (0)

are simply

those

(no autocorrelafion),

X' X

= X'noX

-1

of OLS linear regression.

the variance

estimates

are calculated

using the White

n i

Here gi - Yi XiflOLS, where xi is the ith row of the X matrix, n is the number of observations, and k is thc number of prodictors in the modal, including the constant if there is one. Note that the above formula is exactly the same as that used by regress, robust with the regression-like formula (the default) for the multiplier qc; see the Methods and Formulas section of [R] regress.

newey -- Re ression with Newey-West standarderrors

F i

FOr e case of lag(#), (1987) f_rmulation X'_X

415

# > 0, the variance estimates are calculated using the Newey-We_t

= X t"J'_0X +

n

n-kt=t

Z

eiei_i(xix__t ---/t

/'

m+l

+ xt/_lxi)

i=t+1

where Q is the maximum lag.

ReferenCes i •

i

Hardin.J.!W 1997.sg72:Newey-Wes1standarderroN for probit,logit, and poissonmodels. Stata TechnicalBulletin 39: 32_35. Reprintedin Stata TechnicalBulletinl_eprints,vol. 7. pp. 182-186. covari ce matrix. Econometrica55: 703-708. Newey, \_ 1980. and K. West. 1987. A simple, positixesemi-definite,heteroskedasticitv and test autocorretationconsistent White, H.; A heteroskedasticitv-consisten, cov_ance matrixestimatorand a direct for heteroskedasticitv Econonefrica48: 817-838.

Also.Set C0mplel lentao,: Related:

JR] adjust, JR] lincom, jR] linktest, JR]mfx, JR] test, JR] testnl, JR] vce [R] regress, IN] svy estimators, [R] xtgls. [R] xtpcse

Backgro and:

[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and iaost-estimation commands

i/

Title i

news -- Report Stata news

II

[l

I

]

Syntax news

Description news displays a brief listing of recent news and information of interest to Stata users. It obtains this information directly from Stata's web site. The news command requires your computer to be connected to the Interact. An error message will be displayed if the connection to Stata's web site is unsuccessful. You may also execute the news command by selecting News from the Help menu.

Remarks news provides an easy way of displaying a brief list of the latest Stata news. More details and other items of interest are available at Stata's web site; point your browser to http://www.stata.com. Here is an example of what news produces: . news StataCorp

News

---

* Intercooled Windows * STB-68

2001 (July

* NetCourse

information http://www,

23, for

is sold 2002)

151:

* Proceedings For

July Stata

2002 Windows

is now

on these

8th London and

will

release:

available

"Introduction

of the

2001

(projected

to

Stata

User

additional

---

be available Aug use

the

net

Programming"

Group

topics

Meeting point

the

first

day

i, 2002) command begins now

your

to download next

month

available

web

browser

to:

stata, com

In this case news indicates that there is a new STBavailable. Users can click on STB and User-written Programs from the Help menu to download STBfiles. Alternatively, the net command (see [R] net) can be used.

Also See Related:

[R] net

Background:

[U] 32 Using the Interne! to keep up to date

416

[

I

l [

nl --

least squares

Syntax i n-t kn_depvar

[varhst] [weight] [ifexp] [inrange][, level(#)i_nit(..,) _ "

!e_ve eps(#) _o!og trace it_erate (#) delta(#) i nllni_

fcn_options

ll__isq(#)

]

# parameterHist I

by ....

: n_ay be used witl_ hi; see

[R] by,

aweights lnd fweights are allowed; see [U] 14,1.6 Weight. nl _hare._ the features of eli estimation commands, see[U] 23 Estimation

and post-estimation

commands.

Syntaxfo! predict pred}ct t

[t3,pe] newvarname

[if

exp] [in range]

These _tatiitics are available both in and out of sample: type predict the estimation sample.

[,

{ _yhat . ..

if

residuals

e(sareple)

...

} ] if wanted only for

i 1

Descriptibn n! fit_ an arbitrary nonlinear function to the dependent variable depvar by least squares. You provide tt_e function itself in a separate program with a name of your choosing, except that the first two letteds of the name must be nl. fcn refers to the name of the function without the first two letters. F6r _1 example, you type nl nexpgr ... 1o estimate with the function defined in the prepare nlneapg[. nlini!,

is useful when writing nlfcns. i

Options level: (#}

specifies the confidence level, in percent, for confidence intervals. The default is level or as _et by set level; see [U] 23.5 Specifying the width of confidence intervals.

(95)

i,

init (. • .i specifies initial values for parameters that are to be used to ovemde the default initial

I l i

val_es Examples are provided below. inlsq(#' fits the model defined by nlfcn using "log least squares", defined as least squares with shifted lognormal errors. In other words, ln!(depvar - #) is assumed nommllv distributed. Sum_

[_

i i I

of squ,.res and deviance are adjusted m the same scale as depvar. leave le,]ves behind after estimation a set of new variables with the same name_ as the estimated pamm_ters containing the derivative of E(!/) with respect to the parameter. eps(#)

_pecifies the convergence criterion for successive parameter e._timates and for the residual

sum o[squares, eps(le-5) _

is the default. 417

_I i 1

nolog suppresses the iteration log. trace expands the iteration log to provide more details, including values of the parameters step of the process. iterate(#) delta(#)

at each

specifies the maximum number of iterations before giving up and defaults to 100. specifies the relative change in a parameter

to be used in computing

the numeric deriva-

tives. The derivative for parameter fll is computed as {f(X, fll,,32,... ,fli + d, fli+l,...) f(X, fll,fl2,... ,/3_,fli+x,...)}/d where d is 6(13i -t- _). The default 5 is 4e-7. fen-options

refer to any options allowed by nlfcn.

Options for predict yhat,the default, calculates the predicted residuals calculates the residuals.

value of depvar.

Remarks Remarks are presented under the headings nlfcns Some common nlfcns Log-normal errors Weights Errors General comments on fitting nonlinear models More on nlfcns nl fits an arbitrary nonlinear function to the dependent variable depvar by least squares. The specific function is specified by writing an nlfcn, described below. The values to be fitted in the function are called the parameters. The fitting process is iterative (modified Ganss-Newton). It starts with a set of initial values for the parameters (guesses as to what the values will be and which you also supply) and finds another set of values that fit the function even better. Those are then used as a starting point and another improvement is found, and the process continues until no further improvement is possible.

nlfcns nl uses the function defined by nlfcn, nlfcn has two purposes: to identify the parameters of the problem and set default initial values, and to evaluate the function for a given set of parameter estimates.

> Example You have variables 9 and x in your data and wish to fit a negative-exponential parameters B0 and B_: Y -- Bo (I - e -Bta:) First, you write a program to calculate the predicted

values:

growth curve with

-

t

nl -- Nonlinearleast squares

pr

419

am define I_inexpgr version 7.0

I

if "'i "_global == "7" S_I { i !

"80 BI"

g!obal

BO=-I

global exit

BI=. 1

/* ... /* if Query declarecall parameters

*/ */

/*

*/

and initialize

them

} replace i

"

"1"=$BO*(l-exp(-$Bl*x)

/* otherwise,

calculate

function

*/

endt

! !

To estimate the model, you type nl nexpgr y. nl's first argument specifies the name of the function. although you do not type the nl prefix. You type nexpgr, meaning the function is ntnexpgr, nl's second mgument specifies the name of the dependent variable. Replicating the example in the SAS manual (985, 588-590): . u ,e sasxmpll • n

nexpgr

y

(oh = 20) Ite:'ation

O:

residual

SS =

.1999027

Ite:'ation I:

residual

SS =

.0026064

Ite."ation 2:

residual

SS =

.0005769

Ite:'ation 3:

residual

SS =

.0005768

Source Model

SS

df

17,6717234

IResidual

2

.0005T_81

18

MS

Number

,

F(

17.6723003

20

18)

20

= 275732.74

8.83S86172

Prob

'> F

=

O.0000

.00013t32045

R-squared

=

1.0000

.883_H5013

Adj R-squared Root MSE Kes. dev.

= = =

1.0000 .0056608 -152.317

I Total

of obs =

2,

i (ne.l)gr) y

Coef.

BO BI

.9961886 .0419539

(SE "s, P values,

i

i

CI's,

Std.

Err.

.0016138 .0003982

and correlations

t 617.29 105.35

P>[t [

[95_, Conf.

O. 000 0.000

.9927981 .0411172

are asymptotic

Interval] .9995791 .0427905

approximations)

Notice -:th_ : the initial values of the parameters ;were provided in the nlnexpgr program. You can, however, ,verride these initial values on the nl_ command line. To estimate the model using .5 for the initial _alue of B0 rather than 1, you can tylje nl nexpgr y, iniZ(B0=. 5). To also change the q

i

i i i

i

initial vail e of B1 from.1 to .2, you type nl nexpgr y, init (B0=.5 The_:oulline of all nlfcn's is the same: program

', i

define

I

B1=,2).

nltcn

version 7.0 if "'I .... == "7" { global

'

(tnhialize axit

S_I

"parameternames"

paramelers

)

} replace

"I" = ...

emd

• ' • " "_" " to place the na mes of the P.arameters in the On a q_ler_ call, Indicated b}, "i- being . , the_nlfcn is global mac_-oS_l and ififtialize the parameters, t_arameters are stored as macros, so ff ,_lfc, declares

!

!_ .!

420

nl -- Nonlinear least squares

that the parameters are A, B, and C (via global S_l "A B C"), it must then place initial values in the corresponding parameter macros A, B, and C (via global A=O, global B=I, etc.). After initializing the parameter macros, it is done. On a calculation call, "1" does not contain "?"; it instead contains the name of a variable that is to be filled in with the predicted values. The current values of the parameters are stored in the macros previously declared on the query call (e.g., $A, SB, and $C).

1>Example You wish to fit the CES production

functions defined by

lnq = Bo + Aln{Dl

R + (1 - D)k 2}

where the parameters to be estimated are B0, A, D, and R. q, l, and k refer to total output and labor and capital inputs. In your data, you have the variables lnq, labor, and capital. The ntfcn is program

define nlees version 7.0 "'1""

if

==

"7"

{

global

8_i

global

BO = 1

"BOA

global

A = -1

global

D =

global exit

R = -1

D R"

.5

} " I'=$BO

replace

+ SA*in($D*labor'$R

* (l-$D)*eapitai'$R)

end

Again using data from the SAS manual (1985, 591-592): . use

sasxmpl2

nl ces inq (obs = 30) Iteration

O:

residual

SS =

37.09651

Iteration

I:

residual

SS =

3-5.48655

Iteration

2:

residual

SS

=

22.69058

Iteration

3:

residual

SS

=

1.845468

(output omitted) Iteration

20:

residual

SS

=

Iteration

21:

residual

SS

=

Source

SS

Model Residual

1.761039 1.761039 df

MS

Number

of obs

30

59.5286148

3

19.8428718

1.76103929

26

.06773228

R-squared

=

0.9713

Adj K-squared Root MSE Res. dev.

= = =

0.9680 .2602543 .0775147

Total

61.2896541

29

Inq

Coef.

Std.

2.11343635

26)

=

F( 3, Prob > F

= =

292.96 0.0000

(ces)

BO

* Parameter (SE's,

.1244892

Err.

Conf.

Interval]

0.124

-.0365497

.2855282

-.3362823 .3366722

.2721671 .1361172

-1.24 2.47

0.228 0.020

-.8957298 .0568793

.2231652 .6164652

R

-3.011121

2.323575

-1.30

0.206

-7.787297

1.765055

BO taken

as

CI's,

constant and

term

correlations

1.59

[957

A D

P values,

.0783443

P>ltl

t

in model are

_ ANOVA

asymptotic

table approximations)

i

nl -- Nonlinearleast squares

421

.......

I ! i

If the ncnlinear model contains a constant term, nl will find it and indicate its presence by placing an asteri ;k next to the parameter name when displaying results. In the output above. B0 is a constant. (nldetelmines that a parameter BO is a constant term because the partial derivative f = OE(y)/OBO has a coffficient of variation (s.d./mean) less than eps(). Usually. f = I tbr a constant, as it does

i

in, th,;_ctse.)

I

_utput mimics that of regress;calculates see [R]them, regress. The means model inF this test,case R-squared, of' nl's squares, etc..closeh are calculated as regress which that theysmn: are correcte_t for the mean. If no "constant" is present, as was the case in the negative-exponential gowth • example _prevlouslv. the usual caveats apply tO the interpretation of the F and/?-squared statistics:

I

i

, l

see comr_ents and'references in Goldstein (1992). When! making its calculations, nl creates flee partial derivative variables for all the parameters. giving e_ch the same name as the corresponding parameter. Unless you specify leave, these are discardecl when nl completes the estimation. "_erefore. your data must not have data variables that have thel same names as parameters. We recommend using uppercased names for parameters and

! !

lowercas¢d names (as is common) for variables. After _stimating with nl, typing nl by itself will redisplay previous estimates. Typing correlate, _coef w!ill show the asymptotic correlation matrix of the parameters, and typing predict myvar will creale new variable myvar containing the' predicted values. Typine predict

res,

resid

will

create, r_s containing the residuals. ntfcn'_ have a number of additional featurei that are described in More on nlfcns below.

Someoorlmonnlfcns Ar_ impo:tant feature of nl. in addition to estimating arbitrary nonlinear regressions, is the facility for addin prewritten common fi:ns.

i

i

Three ?ns are provided for exponential regression with one asymptote:

:

_xp3

Y - b0 4- bl b2 x

_xp2

Y -- bib x

For irrst_ _ce. typing nl exp3 ras dvl estimates the three-parameter exponential model tparameters bo. bl, ard 52) using Y = ras and X = dvl. TwOfi ns are provided for the logistic function (symmetric sigmoid shape; not to be confused with log_stic r( _ression): _g4

Y-bo

+ bl/l'

+ exp{-52(X-

b3)}]

Finally, t_,_ofens are provided for the Gompertz function (asymmetric sigmoid shape):

_3Technical Note You may find the functions above useful, but the important thing to note is that, if there is a nonlinear function you use often, you can package the function once and for all. Consider the function we packaged called exp2, which estimates the model Y = bib x. The code for the function is program

define nlexp2 version 7.0 if

"'I'"=="?"

{

global global

S_2 S_I

"2-param. "bl b2"

* Approximate local exp t empvar Y quietly

}

exp.

initial

"['e(wtype)

growth

values

by

" "e(wexp)

curve,

"e(depvar)"=b1*b2"'2""

regression

of

log Y on X

"]"

{ gen "Y" ffilog('e(depvar)') if e(sample) regress "Y" "2" "exp" if e(sample)

global

51 = exp(_b[_cons])

global exit

b2

= exp(_b['2"])

} replace

"i "=$b1*$b2-"

2"

end

Becausewe were packagingthisfunction forrepeated use,we went tothetroubleofobtaining good initial values, whichin thiscasewe couldobtainby takingthelogof bothsides,

Y = bib X ln(Y)

= ln(blb X) -ln(bl)+

tn(b2)X

and then using linear regression to estimate ln(bl) and ln(b2). If this had been a quick-and-dirty implementation, we probably would not have bothered (initializing bt and b2 to 1, say) and so forced ourselves

enough.

to specify

better initial values with nl's

initial()

option when they were not good

The only other thing we did to complete the packaging was store nlexp2

as an ado-file called

nlexp2, ado. The alternatives would have been to type the code into Stata interactively or to place the code in a do-file. Those approaches are adequate for occasional use, but we wanted to be able to type nl exp2 without having to wor O, whether the program nlexp2 was defined. When nl attempts to execute nlexp2, if the program is not in Stata's memory, Stata will search the disk(s) for an

ado-file of the same name and, if found, automatically load it. All we had to do was name the file with the .ado suffix and then place it in a directory where Stata could find it. In our case, we put nlexp2, ado in Stata's system directory for StataCorp-written ado-files. In your case, you should put the file in the director}, Stata reserves for user-written ado-files, which is to say, c:\ado\personal (Windows), -/ado/personal (Unix), or - : ado: persona/ (Macintosh). See [U] 20 Ado-files.

Q

nl -- Nonlinearleast squares

423

Log.normltl errors A non] near model with identically normally distributed errors may be written

y, =

+

~ N(0,,,2)

(1)

i

for i = 1._..., n. If the Yi are thought to have a:k-shifted lognormal instead of a normal distribution. that is, lnt y_ - k) 4 N (t,, r2), and the systemaiic part/(xi,/3) of the original model is still thoughi

l

approlmat '_ :, the model becomes ln(yi-k)=¢i+v,=in{f(xi,/3)-k}+vi This rood t is estimated if lnlsq(k)

i ! i i

vi"_N(0,

r =)

(2)

is specifidd.

If ntod,_l (2)is correct, the variance of (Yi- _)is proportional to {f(xi,/3)k} 2. Probably the most corn non case is k = 0, sometimes called :"proportional errors" since the standard error of Yi is proport anal to its expectation, f(xi,/3). Assuming the value of k is known. (2) is just another nonlinear nodel in/3 and it may be fitted as usual. However, we may wish to compare the fit of (1)

i i

with that ( f (2) using the residual sum of square i or the deviance D, D = -2 x log-likelihood, from each mo& I. To do so, we must alloy, for the ctjange in scale introduced by the log transformation. Assuming, then, the y, to be normally distributed, Atkinson (1985, 85-87, 184), by considering the Jacobi_n IJ ]0 ln(yi - k)/Oy{I, showed that multiplying both sides of (2) by the geometric mean :,:

of Yi - k.1!1, gives residuals on the same scale as those of Yi- The geometric mean is given by

which is aiconstant for a given dataset. The residual deviance for (1)imd for (2) may be expressed as ,

':

D(_)

=

l+ln(2rr_

2) n

(3)

i

where _ i the maximum likelihood estimate (MLE) of/3 for each model and n_ 2 is the RSS from

i

(1), or tha1 from (2) mtfltiplied by b2. i Since (_) and (2) are models with different !error structures but the same functional form, the

! {

_ _

arithmetic _lifference in their RSS or deviances is _ot easily tested for statistical significance. However, if the devtance difference is large" (> 4, say), one would naturally prefer the model with the smaller de_,iance. Of course, the residuals for e_ch model should be examined for departures from

i_ '_

assumptiots (nonconstant variance, nonnormalit3_, serial correlations, etc.) in the usual way. Consider alternatively modeling E(yi) = I_(C + Ae Bx') E(1/yi) = E(y_) = C + Ae Bx'

i ,

(4)

(5)

where C.._, and 13 are parameters to be estimated. We will use the data (y, x) = (.04, 5). (.06, 12), (.08.25). (I.1.35), (1_ 42). (.2,48), (.25,60), (,3,75), and (.5,120)(Danuso 1991). Model C IA B RSS Deviance I

(4) (4) with l_lsq(0)

1.781 1.799

25.74 2545

-.03926-.001640 -.04051 -.001431

t!

(5) (5) with lnlsq(0)

1.781 1.799

25)4 2745

-.03926 -.04051

! i

,

! I

24.70 17.42

There is lit@ to choose between the two versions ;f the logistic model (4), whereas for the exponential model (5) _the fit using inlsq(O) is much betier (a deviance difference of 7.28). The reciprocal •

i

8.197 3.65t

-51.95 -53.18

I

¢

.

transformation has introduced heteroskedasticity into '_liwhich is countered by the propomonal errors property o_ the lognorrfial distribution implicit :in lnlsq(0). The deviances are not comparable between th_ logistic and}exponential models because the change of scale has not been allowed for. althcmgh inl principle it d°uld be"

•_ 'i

,

424

nl -- Nonlinear least squares

Weights Weights are specified the usual way--analytic and frequency weights are supported; see [U] 23.13 Weighted estimation. Use of analytic weights implies that the Yi have different variances. Therefore, model (i) may be rewritten as Yi -- f(xi,_)

+ ui,

ui "-' N(O, a2/wi)

where wi are (positive) weights, assumed known and normalized number of observations. The residual deviance for (la) is D(_)

--- { 1 + ln(27r_ 2) }n - E

(la)

such that their sum equals the

In(w/)

(3a)

(compare with equation 3), where

Defining and fitting a model equivalent to (2) when weights have been specified as in (la) is not straightforward and has not been attempted. Thus, deviances using and not using the lnlsq() option may not be strictly comparable when analytic weights (other than 0 and 1) are used.

Errors nl will stop with error 196 if an error occurs in your nlfcn program and it will report the error code raised by nlfcn. nl is reasonably robust to the inability of nlfcn to calculate predicted values for certain parameter values, nl assumes that predicted values can be calculated at the initial value of the parameters. If this is not so, an error message is issued with return code 480. Thereafter. as nl changes the parameter values, it monitors nlfcn's returned predictions for unexpected missing values. If detected, nl backs up. That is, nl finds a linear combination of the previous, known-to-be-good parameter vector and the new, known-to-be-bad vector, a combination where the function can be evaluated, and continues its iterations from that point. nl does require, however, that once a parameter vector is found where the predictions can be calculated, small changes to the parameter vector can be made in order to calculate numeric derivatives. If a boundary is encountered at this point, an error message is issued with return code 481. When specifying inlsq(), an attempt to take logarithms error rues sage with return code 482.

of Yi - k when Yi <_ k results in an

If iterate () iterations are performed and estimates still have not converged, results are presented with a warning and the return code is set to 430.

General comments on fitting nonlinear models In many cases, achieving convergence is problematic. For example, a unique maximum likelihood (minimum-RSS) solution may not exist. A large literature exists on different algorithms that have been used, on strategies for obtaining good initial parameter values, and on tricks for parameterizing the model to make its behavior as "linear-like" as possible. Selected references are Kennedy and Gentle (1980, ch. 10) for computational matters, and Ross (1990) and Ratkowsky (1983) for all three aspects. Much of Ross's considerable experience is enshrined in the computer package MLP (Ross 1987), an invaluable resource. Ratkowsky's book is particularly clear and approachable, with useful discussion on the meaning and practical implications of "intrinsic" and "parameter-effects" nonlinearity. An excellent general text. though (in places) not for the mathematically faint-hearted is Gallant (t987}. Also see Davidson and MacKinnon (1993, Chapters 2. 3. and 5).

"_

nl -- Nonlinear least squares

425

The success of nl will be enhanced if care is paid to the form of the model fitted, along the lines }

of Ratkov_sky and Ross. For example, Ratkowsky (1983, 49-59) analyzes three possible 3-parameter "yi+ld;de_sitv' models for plant growth:

+

+ 7x )-I

i

! i i

All' three Imodels give similar fits. However, h} shows that the second formulation is dramatically more hn_ar-hke than the other two and theref0re has better convergence propemes, tn ad&tlon, the parameter_estimates are virtually unbiased and normally distributed and the asymptotic approximation to the standard errors, correlations and confiden_zeintervals is much more accurate than for the other models. _'en within a given model, the way th_ parameters are expressed (e.g., _'_ or e°:'_) affects

I !

the degre_ of linear-like behavior. Our ad},ice is that even if you cannot get a particular model to converge, don't gwe up. Experiment with different ways of writing it or with slightly different alternative models that also fit well. l

More on nlfcns Note tl]at the syntax for nl is f

nl fcn depvar [varlist] [...] [, ... fcn_options ] The synta: for an nlfcn is !

nlfcn {varname I ?} [varlist] [, fcn_options ] The varlis

I

Thus, it is _ossible to write ntfcns that are quite_general. When J_tfcn is called with a ?, the varlist and fin_options, if any, are still passed. In addition. e (d_pvar) contains the identity of the dependeni variable; e (sample) contains the estimation sample according x) the if exp and in range specified on the nl command line" and e (wtype) and e(wexp) cont.ain th weight type and weight expression.

i t

i i

nlfcn is required to post the names of the parameters to $S_I and to provide default initial values

i

for all the parameters. In addition, it may post up to two titles in $S_2 and $S_3 that will be subsequenlly used to title the output. The e () returned results provide useful information for filling in titles and generating initial parameter estimates. When _Ifcn is called without a ? it is required to calculate the predicted values conditional on the currenl value of the parameters. Note that nlfcn is not required to process if _W_or in range. Restriction to the estimation sample will be handled by nl.

i :

i

if specified with nl, will be passed !o nlfcn along with any options not intended Ibr nl.

Thus. a the beginning of this insert, we gave an example for calculating a negative-exponential growth m( del. A better version of the n!fcn woltld have been

i

prog:, defi e

:

version

;

if

"'1""

Y. 0 == "?"

{

global global

S_l "BO BI" BO=I

global

BI=.

1

global

S_2

"negative-ex_.

_lobal exit

S_3

"'e(depvar)"

growth" = BO*(1-exp(-Bl*'2"))"

} replace end

"_1"=$BO* (l-exp(-$Bl*" 2" ))

•-,_u _ii

_- --

r_unnnear least squares

This versionline:would command

title the output

and

allow

the independent

variable

to be specified

on the nl

nl nexpgr y xval

1t

An even more sophisticated version of nlnexpgr might use e (depvar), to generate more reasonable starting values of BO and B1. nlinit

is intended nlinit

nlinit

for use by nlfcns.

Its syntax

"2 ". and if

e (sample)

is

# parameterJist

initializes each parameter nlinit 0 A B C nlini't: 1 D E

in parameter_list

to contain

#. For example,

Saved Results nl saves in e(): Scalars e (N) e (k)

number of observations number of parameters

e (r2_a) e (F)

adjusted //-squared F statistic

e(mss) e(tss)

model sum of squares total sum of squares

e(rmse) e(converge)

root mean square error 0 if convergence failed: otherwise l

e(df_m) e(rss)

model degrees of freedom residual sum of squares

e(df_t) e(dev)

total degrees of freedom residual deviance

e(df.._r)

residual degrees of freedom

e(lnlsq)

1 if specified; otherwise 0

e (rams) e(msr) e (r2)

model mean square residual mean square R-squared

e (gin_.2)

geometric mean (y- k) 2 if lnlsq(); otherwise, l

e(cmd)

nl

e(function)

name of function

e(depvar)

name of dependent variable

e(params)

names of parameters

title in estimation output secondary title in estimation output

e(predict)

program used to implement predict

coefficient vector

e (V)

variance-covariance estimators

Macros

e (title) e(title2) Matrices e (b)

matrix of the

Functions e(sample)

marks estimation sample

The final parameter estimates are available in the parameter macros defined by nlfcn. The standard errors of the parameters are available through _se [parameter]; see [U] 16.5 Accessing coefficients and standard errors.

Methodsand Formulas nl is implemented

as an ado-file.

• ...........

nl -- Nonlinear least squares

427

I !

AcknOWle dgments nl was written by Patrick Royston of the MR(! Clinical Trials Unit, London, The original version of this routi_ _ was published in Royston (t992). Francesco Danuso's menu-driven nonlinear regression:

i

program

(991)

provided

the inspiration.

Refemnec. I

Atkinsorl, A C. 1985. Plots, Transformations and Regression. Oxford: Oxford University Press. Danuso, F, ' )91. sgl: Nonlinear regression command. S_ta Technical Bulletin 1: 17-19. Reprinted in Stata Technical Bulletin _eprints, vol. t, pp. 96-98. Davidson. RI and J. G. MaeKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University ]:_eSS.

i

Gallant, A. _. 1987. Nonlinear Statistical Models. New York: John Wiley & Sons. Gold,_tei_,

1992. srd7: Adjusted summary statistics for logarithmic regressions. Stata Technical Bulietin 5: 17-21.

Reprinte_ in Stata TechrricalBulletin Repr/ms, vok 1; pp. 178-183. Kennedy, W.!J., Jr.. and J. E. Gentle. 1980. Statistical Computing. New York: Marcel Dekker. Ra_ows_,, 1

. A. _983. Nonlinear Regression Modeling.' New York: Marcel Dekker.

ROss. G. L i. 1987. MLP L_er Manual, release 3.08. O_ford: Numerical Algorithms Group. i

. 1990. _bnlinear Estimation. New YOrk: Springer-Verlag. Roysmn. R )92. sgl.2: Nonlinear regression command. Stata Technical Bulletin 7_ lt-18,

Reprinted in Stata Technical

..... . ]993. igl.4: Standard nonlinear curve fits. Stata TeChnicalBulletin t1: 17. Reprinted in Stata Technical Bulletin Bulletin "'_eprints,vol. 2, pp. 112-120. • ,. SASReprints, Institute vol. Inc.2,1985. p. t21. SAS User's Guide; Statistics, Versmn 5 Edition. Cary, NC. i

AlsoSee i

i

ComplemeJitary:

[R] ml, [R] pretlict,

[R] ,t'ct_, JR] xi errors,

[u] Accessing and coefficientsand [U] 16.5 23 Estimation po_-estimationstandard commands

Backgroun_h

}

:

Title "_

3't ".1

nlogit

--

Maximum-likelihood

nested

legit estimation I

I

I

Syntax nlogit

depvar

(altsetvarl

= indepvarsB

( aItsetvarB = indet)varsl)

[ notree

nolabel

%vconstraints

]

clogit

(string)

)

[weight]

[if

level(#)

constraint,

[ ... exp]

nolog

s (nurnlist)

( altsen,ar2 [in

range],

= indepvars2

)

group(varname)

robust dl

mcudmize_oprions

]

where

depvar attsen'arl indepvarsl

altseta,ar2 indepvars2 altsen,arB itutepvarsB

is a dichotomous variable coded as 0 for not selected alternatives and 1 for the selected alternative. is a categorical variable that identifies the top- or first-level set of alternatives--these alternatives must be mutually exclusive groups of the second-level alternatives. are the attributes of the first-level alternatives--either of an alternative alone (absolute) or as the alternative is perceived by the chooser (perceived)--and possibly interactions of individual attributes with the first-level alternatives. is a categorical variable that identifies the second-level set of alternatives---these must be mutually exclusive groups of the third-level alternatives. are the attributes of the second-level alternatives (absolute or perceived) and possibly interactions of individual attributes with the second-level alternatives. is a categorical variable that identifies the bottom, or final, set of alt alternatives. are the attributes of the bottom-level alternatives (absolute or perceived) and possibly interactions of individual attributes with the bottom-level alternatives.

nlogitgen where

newvar

branchlist

= varname

)

[,

no!og

[]

outcome

]

is branch,

branch

( branchlist

branch

!, branch ....

]

is [label:]

and outcome

outcome

value or value

nlogittree

[

outcome

... ]]

is

varlist

[, nolabel

label

]

by . . : may be used with nlogit;see [R] by. fweightsand iweightsare allowed, but are interpreted to apply to groups as a whole and not to individual observations; see [U] 14,1,6 weight. nlogit

shares the features of all estimation commands: see IlJ1 23 Estimation and po._t-estimation commands. 428

nlogit -- Miximum-likelihoodnested Iogit estimation

429

Syntaxfor predict pred:.ct

L_Pejt newvarname

[if

exp] [in range] [, statistic ]

where staristic is

i

pb

predicted probability of choosing bottom-level, or choice-sel, alternative_: each alternative identified by altsetvarB; the default.

i

pl

predicted probability of choosing first-level alternatives: each alternative identified by altsetvarl

i

p2

predicte d probability of choosing second-level alternatives:

i

t

each dhoice identified by altsetvar2 p#

predicted probability of choosin_ #-level alternatives; each alternative identified by attsetvar#

xbb

linear prediction for the bottom-level alternatives

xbl xb2

linear prediction for the first-level alternatives linear prediction for the second-ltwel alternatives

, , ,

xb#

linear prediction for the #-level alternatives

condpb

Pr(each bottom alternative i alternative is available a_Cterall earlier choices)

condpl condp2

Pr(each level 1 alternative) = pl Pr(each level2 alternative I alternative is available after level 1 decision)

. o ,

condp# :i.vb

Pr(each level # alternative ! alternative is available after all previous stage decisions} inclusive value for the bottom-level alternatives

l l I

inclusive value for the second-le4ml alternatives iv2 iv#

The

inclusive value for the #-level alternatives

inclusi :e value catcutat_ d.

These

statis:ics

for

the

are available

first-level

both

alternatives

in and out

is not used

of sample;

in the

type predict

estimation

...

of the

model:

if e(sample)

...

therefore,

iT is no1

if wanted

only

for

_heestir_ationsample.

!i

i

Descripti6n nlogi_ estimates a nested logit model using full maximum likelihood. The model may contain one or m_re levels. Fdr a single-level model, nlogit estimates the same model as c!ogit: see IN] ciogit.i n!ogi_gen

generates a new categorical variable based on the specification of the branches. For

Instance,!

: I

. nlogitgen

is equi alent to

type

= restaurant(fast:

I I 2, family:

3 [ 4 I 5, fancy:

6 [ 71

_rr-

4_u

nJoglz_ Maximum-liKelihood nested Iogit estimation gen

i

type

= I

if

restaurant

== 1

] restaurant

replace

type

= 2 if restaurant

== 3

• replace

type

= 3 if restaurant

==

label • label

define value

Ib_type type

1 fast

== 2

I restaurant

== 4

6 I restaurant

== 7

2 family

I restaurant

== 5

3 fancy

ib_type

nlogittree displays the tree structure based on the varlist. Note that the bottom level should be specified first. For instance, • nlogittree

restaurant

type

Options group(varname) notree

is not optional; it specifies the identifier variable for the groups.

specifies that the tree structure of the nested logit model is not to be displayed.

nolabeI causes the numeric codes rather than the label values to be displayed in the tree structure of the nested logit model, clogit

specifies that the initial values obtained from clogit

are to be displayed.

leve 1 (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. nolog

suppresses

(95)

the iteration log.

robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [u] 23.11 Obtaining robust variance estimates. ivconstraints(string) specifies the linear constraints of the inclusive value parameters. One can constrain inclusive value parameters to be equal to each other, equal to fixed values, etc. Inclusive value parameters are referred to by the corresponding level labels; for instance, ivconstraints (fast = family) or ivconstraints(fast=l). constraints (numtist) specifies the linear constraints to be applied during estimation. Constraints are defined using the constraint command and are numbered: see [R] constraint. The default is to perform unconstrained estimation. dl specifies that method dl is to be used in estimating the mt model instead of the default method rdul. rdul is faster than dl; however, in some cases rdul is not as stable as dl. If the model has four or higher levels, method dO is used instead. maximize_options control the maximization process; to specify any of the maximize options except for iteration log shows many "not concave" messages may want to use the difficult option to help it

see [R] maximize. iterate (0). and and is taking many converge in fewer

You will likely never need possibly difficult. If the iterations to converge, you steps.

Optionsfor predict Consider a nested logit model with 3 levels: Pr(ijk) pb, the default, calculates the probability pl, calculate the probability p2, calculates the probability

= Pr(k

of choosing bottom-level

of choosing first-level alternatives, of choosing second-level

xbb. calculates the linear prediction

) ij)Pr(j

for the bottom-level

[ i)Pr(i)

alternatives,

pb -- Pr(ijk).

pl = Pr(i).

alternatives, alternatives.

p2 -- Pr(ij)

= Pr(j

I i)Pr(i).

nlogit -- Maximum-likelihoodnested legit estimation

i !

xbI, c .lculates the gnear prediction for the first-level alternatives. xb2, c lculates the linear prediction for the second-level alternatives.

i

condpl condpb= Pr(klij). cor_Ip:condpl=

431

Pr(i).

¢ondp!,condp2= Pr(j li). ivb, c_dcutates the i_clusive, value for the bottom-level alternatives: ivb = in {_k where xbb is the linear prediction for the bottom-level alternatives. I

,,j exp(xbb)},

iv2, c_lculates the inclusive value for the second-level alternatives:

_ i

iv2 = ln{y_j _ exp(zb2 + rjivb)},, where zb2 is the linear prediction for the second-le_,el altmnatives, ivb is the inclusive value for the bottom-level alternatives, and r_ are the parameters for he inclusive Value.

Remarks i

nlo :it performs 'full maximum likelihood esumation of nested legit models. These are models of a decis on process th_atis made in levels and where the decisions in later levels are limited by those

!

made i_ earlier levelS. In particular, the decision in each level partitions the choice set into more and

i

more sl_ecific alternative sets or groupings of choices. Let"t look at an _xample and clarify somd terminology. The tree structure of a family's decision about Where to eat r/light look something like this:

i

1

I

Christophers

First the family decides whether to eat fast food. eat at a family restaurant, or eat at a fancy i

i

restaura _t. This first-level decision limits their second-level decision to the alternatives available within hos_n fast food. their second-level decision is between the _he seled_tedrestaurant type. If they have c Mamas tPizza and Freei_irds; if they haxe chosen a family restaurant, the second-level decision is betweenl Care Eccell,: Los Nortenos. and Wings 'N More. If they decide on a fancy restaurant, then toe second-level decision is between Mad Cows and Christopher's. To b_ clear, we will use the following terms to use to describe these models.

i i }

i

level orldecision lev61 is the level or stage at which a decision is made. First-level decisions are made! first, foltowed by second-level decisions, and so on. In the example above there are on_y lwo l_vels. In the first level a type of restaurant is chosen, fast food. family', or fancy', and in the seco d level a specific restaurant is chosen.

b°tt°mllvelistheleqelwherethefinaldecisi°nismade'spec c restaurant.: In our example, thisis when wechoose a a/ternat_'e xet is the ._et of all possible alternatives at any given decision level. {

._,:qT

.......

_" _ ,-ax--um-.Kellnooa nested Iogit estimation

,,F_:_; _:.

bottom alternative set is the set of all possible alternatives at the bottom level. This is often referred to as the choice set in the economics choice literature. In our example, the bottom alternative set is all seven of the specific restaurants. alternative is a specific alternative within an alternative set. In the first level of our example, "fast food" is an alternative. In the second or bottom level, "MadCows" is an alternative. Not all ahematives within an alternative set are available to someone making a choice at a specific stage, only those that are nested within all prior decisions. chosen alternative is the alternative from an alternative set that we observe someone having chosen. A one-level nested logit model is the same as a conditional logit model. The conditional logit models assume the independence of irrelevant alternatives (tlA). that the relative probabilities of various alternatives remain constant regardless are included in the model. When this assumption is violated, the idea behind a to _oup the alternatives into subgroups such that the (IIA) assumption is valid

multinomial logit and Basically this means of which alternatives nested logit model is within each group.

McFadden (1977, 1981) showed how this model can be derived from a rational choice framework. Amemiya (1985, Chapter 9) contains a very nice discussion of how this model can be derived under the assumption of utility maximization. For a two-level nested logit model, we index the first-level alternative as i and the bottom-level ahemative as j.respectively. Let X# and refer to the vectors of explanatory variables specific to categories (i,j) and (i), WeY/write

Prij = Prjl i Pri The conditional probability Prjl i will involve only the parameters/3: e _ 'x _ PrJri = _n e_'Xi_ We define the inclusive values for category (i) as Ii = ln

(X e_ _") _"

TL

then

Pr,: =

ea' Yi+ril, _m ea'Y'_ +r'*I"

Remarks are presented under the headings Datasetupand the m_estructure Testof the independenceof irrelevantalternatives(IIA) Mode1estimation Inclusivevalueparameters Obtainingpredictedvalues

) -

l

'i

nlogit --Maximum-likelihood nestedIogitestimation 433 ............................................................................

I

i

Data setU}pand the tree structure

nlog_tgenand n!ogittree are designedto the nested logit model.

help users specify and display the tree structure of

l> Exampte ! ;

Usin_ fictional dat_, we have data on 300 families and their choice of 7 local restaurants. The restaurar(ts are Freebi_ds, MamasPizza, CafeEccell, LosNortenos, WingsNmore, Cluistophers and

)

MadCm_Is._We want t_ explore the relationship of the decision about where to eat to the household income variable income in t000s of dollars),!the number of kids in the household (variable kids), _he ratin of the restaurant according to the local restaurant guide (variable rating 0 to 5), the average meal co per person '(variable cost), and the distance between the household and the restaurant (variable distance iri miles), incomeand kids are attributes of the family, rating is an attribute of the al ernative (the _restaurant) alone, and cost and distance are attributes of the alternative as perceivec bv the families--that is, each family has its own cost and distance for each restaurant.

[ i } I ) ! i

_se restaurant

Co_ains

clear

data from

restaurant.din 8

v_rs: 75,600

si e:

i_s : variable

i

i

[

i0 Sop (99.0_

of memory

2000

00:41

free)

storage

display

value

2,100 type %

format

label

variable

names

family id choices of restaurants

name

label

fam_ly_id restaurant

float float

_,9.0g _,12.0g

cos£ income

float

7,9.0g _/,9. Og

average meal cost per perso_ hollsehold income

kid_ ratlng

'float float

Y,9.0g Y,9.0g

number of kids in the household ratings in local restaurant

distance

float

_,9.0g

distance between restaurant

chosen

ifloat

_9.0g

0 no 1 yes

guide

Sor#ed

i i

i

I

by :

home

and

fami_y_id

l_st family_i4 family_id

restaurant restaurant

IJ

1

Freebirds

3 2 j_

1 I

CafeEccell MamasPizza

4J 5

11

6J 7

i

chosen

kids rating

chosen

distance kids

in 1/21 rating

distance

1

l

0

I.245553

[) 0

1 I

2 1

4.21293 2.82493

Los.ortenos Wing_Wmore

0_

11

23

_hrisZophers

0

1

4

10. 19829

4.167634 6.330531

1

MadCows

0

1

5

5. 601388

8 9 ,; ,! 1O. i 11 .i

2 2 2

Freebirds MamasPizza CafeEcceil

0 0

3 3

0 1

4.162657 2. 865081

2

LosNortenos

0 1

3 3

2 3

5. 337799 4. 282864

t2 "i 13.,

2 2

WingsNmore ¢hristophers

O, 0

3 3

2 4

8.133914 8.664631

.,

14._

2

0

3

5

9.119597

!

15. I

3

Freebirds

MadCows

1

3

0

2.112586

i

17.1 16.' 18._ )

3 3 3

Cafegccell MamasPizza LosNortenos

0 0 0

3 33

2 t3

6. 978715 2.215329 5. 117877

434 i:t

nlogit -- Maximum-likelihood nested Iogit estimation 19. 20.

3 3

21.

3

WingsNmore Christophers MadCows

0 0

3 3

2 4

5.312941 9. 551273

0

3

5

5. 539806

Suppose that for each family, the decision about where to eat is a decision of two steps. First, i

the family decides whether to eat fast food, eat at a family restaurant, or eat at a fancy restaurant. This first-level decision limits their second-level decision to the alternatives available within the selected restaurant type. If they have chosen fast food, their second-level decision is between the MamasPizza and Freebirds" if they have chosen a family restaurant then the second-level decision is between CafeEccell, LosNortenos, and WingsNmore; if they have chosen a fancy restaurant then the second-level decision is between Christophers and MadCows. To run nlogit, we need to generate a categorical variable that identifies the first-level set of alternatives, fast food, family restaurants, or fancy restaurants. This can be easily accomplished by using nlogitgen. . nlogitgen type = restaurant(fast: ] WingsNmore, fmlcy: Ckristophers new

variable

Ib_type

type

is generated

with

Freebirds I MadCows)

I MamasPizza,

family:

CafeEccell

I LosNortenos

3 groups

: 1 fast 2 family 3 fancy

nlogittree tree

structure

restaurant specified

type for

the nested

logit

model

top-->bottom type fast

restaurant Freebirds MamasPizza

family

CafeEccell LosNorte-s WingsNmore

fancy

Christop~s MadCows

The new categorical variable is type, which takes value l (fast) if restaurant is Freebirds or MamasPizza, value 2 (family) if restaurant is CafeEccell, LosNortenos or WingsNmore, and value 3 (fancy) otherwise,

nlogittree displays the tree structure.

<1

•3 Technical Note We could also use values instead of value labels of restaurant in nlogitgen. The value labels for the newvar, type are optional, and the default value labels for type are typel, type2, and type3. The vertical bar is also optional.

(Continued

on next page)

nlogit-- Maximum-likelihoodnestedlogitestimation r_

.

435

...................................................

:logitgen type = restauraut(1 2, 3 4 5, 6 7) nm' variable type is generated with 3 groups lb.type: i typel

2 type2 3 types

, i

[

)

) !

[ } ! [

tr _e structure Specified for the nested :logittree restaurant type

logit

model

top--> ottoz

type

restaurant

type I

Preebirds NamasPizza

type2

C_feEccell Lk)sNorte~s WtingsNmore

type3

Christ op-s MadCows

_]

Test,oftNe indeperidenceof irrelevant alternatives (IIA) i The I:roperty of th_ multinomial logit model and conditional ]ogit model where odds ratios are independent of the other alternatives is referred to as the independence of irrelevant alternatives (IIA). Hausraan and McPadden (1984) suggest that if a subset of the choice set truly is irrelevant with respect t the other alternatives, omitting it frbm the model will not lead to inconsistent estimate_. Ttierefor Hausman's:_1978) specification test can be used to test for IIA.

'3 ExampleI Supp(,se we want to run ctogit on our choice of restaurants dataset. We also want to test IIA between the alternatives of family restaurants and the alternatives of fast food places and fancy restaurants. To do so, we need to use Stata's hausman command: see [R] hausman. We fi "st run the e_timation on the full bottom alternative set: save the results using hausman, save; ard then run th_ estimation on the bottom alternative set, excluding the alternatives of family restaurarts. We then mn the hausman test. w_th the less option indicating the order in which our models ,_ere fit. 1

en incFast _en

incFancy

en kidFast

(type

== I) *

income

_ (type == 3) * income _ (type

== I) * kids

en kidFancy = (type == 3) * kids logit chose_ cost rating distance Iteration

O:

log

It(ration It_ration

2: I:

_og likelihood likelihood _og

It( ration It_ ration

3: 4:

_og likelihood i _og likelihood

It_ ration

5:

Col,ditional t Lo

_og likelihood

(fiied-effects) _

likelihood

i

likelihood

_ -488.90834

i incFast "_

incFancy

kidFast

kidFancy,

group(family_id)

= -564._7856 = -489.$5097 -496.41546 = -488. _1205 -488.90854 -488.g0834 logistic

regression

Number of obs LR chi2(7)

: =

2100 189.73

Pseudo R2 Prob > chi2

= =

0.1625 0.0000

_

....................

,,,vuu

chosen

Coef.

cost rating

IOgl!

Std. Err.

z

estimation

P>IzI

[95X Conf. Interval]

-.1367799 .3066622

.0358479 .1418291

-3.82 2.16

0.000 0.031

-.2070404 .0286823

-.0665193 .584642

t

-.1977505

.0471653

-4.19

0.000

-.2901927

-.1053082

incFancy incFast I kid_Past kidFancy__[

.0407053 -.0390183 -.2398757 -.3893862

.0080405 .0094018 .1063674 .1143797

5.06 -4.15 -2.26 -3.40

0.000 0.000 0.024 0.001

.0249462 -.0574455 -.448352 -.6135662

.0564644 -.0205911 -.0313994 -.1652061

distance i

-u3zeO

• hausman, save clogit chosen cost ratine distance incFast incFancy kidFast kidFancy if type group(family_id)

l= 2,

note: 222 groups (888 obs) dropped due to all positive or all negative outcomes. Iteration Iteration Iteration Iteration Iteration Iteration

O: 1: 2: 3: 4: 5:

Conditional

log log log log log log

likelihood likelihood likelihood likelihood likelihood likelihood

= = = = = =

-104.85538 -88.077817 -86.094611 -85.956423 -85.955324 -85.955324

(fixed-effects) logistic regression

Log likelihood = -85.955324

chosen

Coef.

cost rating distance

-.0616621 .1659001 -.244396 -.0737506 .4105386

incFastI kidFast __

Std. Err.

Number of obs LR chi2(7) Prob > chi2 Pseudo R2

= = = =

312 44,35 0.0000 0.2051

z

P>JzJ

[95X Conf. Interval]

.067852 .2832041 .0995056

-0.91 0.59 -2.46

0.363 0.558 0,014

-.1946496 -.3891698 -.4394234

.0713254 .72097 -.0493687

.0177444 .2137051

-4.16 1.92

0.000 0.055

-.108529 -.0083157

-.0389721.8293928

• hausman, less Coefficients--j'

cost d kidFast_

Test:

Ho:

i

(b) Current

(B) Prior

-.0616621

-.1367799

-.244396 -.0737506 .4105386

-.1977505 -.0390183 -.2398757

(b-B) sqrt (diag(V_b-V B)) Difference S.E. .0751178 -.0466456 -.0347323 .6504143

.0576092 .0876173 .015049 .1853533

b = less efficient estimates obtained from clogit B = fully efficient estimates obtained previously from clogit difference in coefficients not systematic chi2(5)

= (b-B)'[(V_b-V_B)-(_I)](b_B) = 10.70 Prob>chi2

=

0.0577

The small p-value indicates that the IIA assumption between the alternatives of family restaurants and the bealternatives should utilized. of other restaurants is weak, hinting that the more complex nested logit model

t

• _

_

...................................................................

_

;

nlogit --

/laximum-likelihoodnested Iogit estimation

437

Model! ea timation Exampt¢ In tl_is example, _e want to examine how alternative-specific attributes apply to the bottom altemati,i_.eset (all se_.,en of the specific restaurants), and how family-specific attributes apply to the altema@e set at the Ifirstdecision level (all ttiree types of restaurants). Inlogitchoseh (restaurant = cost ra_ing distance ) (type = incFast incFancy > kidFast kidF_ncy), group(family_id)Inolog tzee structure specified for the nestbd logit model top--_bottom type fast

_estaurant !Freebirds

_asPizza family

fancy

_afeR.ccell _osNort;e-s WingsNmore _ristop~s MadCows

Ne _ted logit Le'rels = De')endentvariable = Lo likelihood =

2 chosen -483,9584

Number of obs LR chi2(10) Prob > chi2

= = =

2100 199.6293 0,0000

i' Coef.

z

P>Jz[

[95X Conf. Interval]

re:_taurant cost

-,0944352

-2.78

O.006

-.1611131

-.0277572

rating distance

.1793759 -.1745797

.126895 1.41 .0433352 , -4,03

O,157 0.000

-,0693338 -,2595152

,4280855 -.0896443

.0116242

-2.47

0.013

I incFancy

-.0287502 . 0458373

5.14

O. 000

-.0515332 0283722 .

-, 0059672 0633024 .

I kidFancy , kidFast

-.0704164 -,3626381

.1394359 ' .1171277

-0.51 -3.10

0.614 O.002

-.3437058 -.5922041

-.1530721 .2028729

2.45 1.49 3.52

0.014 0 135 0.000

1,143415 -.5366608 .6494642

10.2881 3.979105 2.283711

t_e i l

Std. Err.

incFast

.03402

.0089109

(Ii params) /fast /family i /fancy

5,715758 1.721222 1.466588

2,332871 1 152002 .4169075

I

i

LR _est of homo$cedastlclty (iv = 1): 1 •

I

In thi_ model.

"

[

' Ji

chi2(3)=

9.90

Prob

> chi2 = 0.0194

:

Pr(restdurant I tyPe)= !

_

I

Pr(tvpe)!-

Pr(_cost cost + 3rati_ rating + 3dist distance) i

Pr(a, iva_ incFast +

_ Tfast

IVfast

+ 7family

aiFancy

ineFancy +

!V 'I family

+ Tfancy

CtkFast

"

kidFast +

O kFast

kidFast

IVfancy)

T he [J_ test against t_e ' constant-only model iMicates_ that the model is significant (p-value = 0.000). and t.466588. The inclul}ive value, part,meters for Iast, famil 'iy,and import are 5.......... 715758.1 -7o_o-_o

-..... _

,,,..,_,.-- m.x,,,,u.,-..e.noo,

nesteaIOglt estimation

respectively. The LRtest reported at the bottom of the table is a test for the nesting (heteroskedasticity) against the null hypothesis of homoskedasticity. Computationally, it is the comparison of the likelihood of a nonnested clogit model against the nested legit model likelihood. The X2 value of 9.90 clearly supports the use of the nested legit model with these data, <1

Inclusivevalue parameters nlogit allows the user to apply linear constraints of the inclusive value parameters. One can constrain inclusive value parameters to, say, equal to each other, or specify fixed values rather than allowing these parameters to be freely estimated. > Example Continuing with the above example, we fix all the three inclusive value parameters to be 1 to recover the model estimated by clogit. . nlogit

chosen

> kidFast User

defined

I000:

(restaurant

kidFancy),

[fast]_cons

distance

nolog

) (type

ivc(fast

=I,

= incFast

family=l,

incFancy fancy=l)

notree

= 1

[family]_eons

998:

[fancy]_cons

= 1 = 1

legit =

Dependent Log

rating

constraint(s):

999:

Nested Levels

= cost

group(family_id)

variable

2

=

likelihood

Number

chosen

LR

= -488.90834

Coef.

Prob

Std.

Err.

z

of obs

=

chi2(7) > chi2

P>lzl

2100

=

189.T294

=

0.0000

[95_ Conf.

Interval]

restaurant -.1367799

.0358479

-3.82

0.000

-,2070404

-.0665193

rating distance

.3066622 -.1977505

.1418291 .047i653

2.16 -4.19

0.031 0,000

.0286823 -.2901927

.584642 -.I053082

incFast

cost

type -.0390183

.0094018

-4.15

0.000

-.0574455

-.0205911

incFancy kidFast

.0407053 -.2398757

.0080405 ,1063674

5.06 -2.26

0.000 0.024

.0249462 -.448352

.0564644 -.0313994

kidFancy

-.3893862

.I143797

-3.40

0.001

-.6135662

-.t652061

(IV params) type 1

/fast /family

1

/fancy

I

LR test

clogit

of homoscedasticity

chosen

cost

rating

(iv = I):

distance

chi2(O)=

incFast

> group(family_id) Iteration Iteration

O: 1:

log likelihood io g likelihood

= -564.57856 = -496.41546

Iteration

2:

log

likelihood

= -489.35097

Iteration

B:

log

likelihood

= -488.91205

0.00

incFancy

Prob

kidFast

> chi2

kidFancy,

=

i

•

! I

_

_ Itezation

ii

4:

nlogit-- Mlaximum-likelihood Iogitestimation 439 _ nested .......... i....

l_g

likelihood

Number of obs LR chi2(7)

= =

2100 189,73

Prob

=

0.0000

Log likelihood

Pseudo

=

0.1625

[95Y, Conf.

Interval]

= -488.90834

,

l

i

Coef.

5

Std.

Err.

z

> chi2

P>Izl

1{2

cost

, .1367799

,0358479

-3.82

O.000

-. 2070404

-.0665193

r rating

i "3066622

1418291

2.16

0.031

.0286823

.584642

distance , incFast

_- 1977505 _ .0390183

.0471653 .0094018

-4.19 -4,15

0.000 0.000

-.2901927 -.0574455

-. 1053082 - :0205911

5.06

O. 000

.0249462

.0564644

-2,26 -3.40

O.024 O.001

-.448352 -• 6135662

-. 0313994 -. 1652061

lincFancy

. 0407053

i kidFast !kidFancy

.0080405

_. 2398757 _. 3893862

i

= -488.90834

Itez ation 5: (fixed-effects) lag. likelihoodlogistic = -488. re_ression 90834 Concitional ! } i _

chosen

i zl

.1063674 •1143797

'

i i

i

Obtainingredicted!values predictmay be use_lafter nlogitto obtain the predicted values of the probabilities, the conditional

!

probabili@s, the linear predictions, and the inclusive values for each level of the nested logit model Predicted _robabilities _or nlogitmust be inte_reted carefully. Probabilities are estimated for each group as _whole and dot for indi'_idual observations. ?

Example i 1

Contim _ingwith our Nodel with no constraintsl we can predict pb = Pr(restaurant); pl = Pr(type); condpb = Pr(restaura_t I type); xbb, the line_ prediction for the bottom-level altemativesi xbl, the linear ?rediction fo_ the first-level alternatives; and the inclusive values ivb for the bottom,level alternative _. • q_i nlogit

i

i

l

i

chosen

(restaurant

k±dFancy), group [family_id) . pzedict pb (opt ion

pb

assum,,,d;

distance

) (type

nolog

= incFast

incFancy

kidFast

i

Pr (mode))

. pxedict

p1,

• pzodict

condpb

• predict

xbb,

x!>b

. predict

xbl,

xl_)l !

pli condpb

• list predict id chosenlpb ivb, i'rb

i

pl condpb

in 1/14

pb .0831245

pl ; 1534534

condpb .5416919

.070329 ,2763391 .284375

11534534 ', 7266538 _,7266538

.4583081 .3802899 .3913486

0

.1659397

! 7266538

.2283615

0

.03992 t5

11198928

.3329766

I 2

0 0

.0799713 . Ol i76

_ 1198928 10286579

.6670234 •4103599

2 2

0 0

• 0168978 .2942401

i0286579 _7521651

. 5896401 .3911909

t ._

id 1

2. ! 3.1 4.1

1 1 1

_

0 0 0

5.:

1

i

6. i

1

7, i 8. 9 105

= cost _ating

chosen 1

i

j7

11. 12. 13. 14.

iF

,

2 2 2 2

.........

1 0 0 0

.2975767 .1603483 .1277234 ..vv_vv .0914536

.7521651 .3956268 -7521651 .2i31824 Iw_mt .219177 _OtllllQ||_ll .582741 .219177 .417259

. list id chosen xbb xbl ivb in 1/14 id chosen xbb xbl 1. 1 1 -.731619 -1.191674 2. 1 0 -.8987747 -1.191674 3. 1 0 -1.149417 0 4. 1 0 -1.120752 0 5. 1 0 -1.659421 0 6. 1 0 -3.514237 1.425016 7. 1 0 -2.819484 1.425016 8. 2 0 -1.22427 -1.878761 9. 2 0 -.8617923 -1.878761 10. 2 0 -1.239346 0 11. 2 1 -1.22807 0 12, 2 0 -1.846394 0 13. 2 0 -2.804756 1.570648 14. 2 0 -3.138791 1.570648

i

ivb -.1185611 -.1185611 -.1825957 -.1825957 -.1825957 -2.414554 -2.414554 -.3335493 -.3335493 -.3007865 -.3007865 -,3007865 -2.264743 -2.264743

Saved Results nlogitsaves in e O: Scalars e(N) e (k_eq)

number number

of observations of equations

e(tevels) e (re)

depth of the model return code

e(N_g)

number

of groups

e(chi2)

x2

e(df._m)

model

degrees

of freedom

e(df...me)

model

degrees

of freedom

e(ll) e(ll_O)

log likelihood log likelihood,

constant-only

log likelihood,

clogit

e(ll_c) Macros

for clogit model

model

e(chi2_c)

X 2 for comparison

e(p_c)

p-value

for comparison

test

e(p) e(ic)

p-value numoer

for X 2 test of iterations

e(rank)

rank of e(V)

e (cmd)

nlogit

e (vcetype)

covariance

e (level#)

altsetvar#

e (user)

]ikelihood-evaluator

e(depvar)

name of dependent

e(opt)

type of optimization

e(title)

title in estimation

output

e(chi2type)

LK: type of model

e(group) e(wtype)

name of group() weight type

variable

e(predict) e(cnslist)

program used to implement constraint numbers

e(iv._names)

parameters

e(V)

variance-covariance estimators

variable

e (wexp) Matrices

weight

e (b) e(ilog) Matrices

coefficient vector iteration log (up to 20 iterations)

e (sample)

marks

expression

estimation

sample

estimation

test

method program

X 2 test

for inclusive

predict

values

matrix

of the

nlogit -- Maximum-likelihoodnested togit estimation

441

Methods and For.mlas

!

provide ltroductions t the nested logit model nlogit is implem,,'nted as an ado-file. Greene (2000, 865-871) and Maddala (1983, 67-70) We _ 11present the! methods and formulas for a three-level nested logit model. The extension of this mo& to cases m_olvmg more levels of a tree is apparent, but is more complicated.

!

Using !he same not_tion as Greene (2000), we index the first-level alternative as i, the second-level

! !

ahemativ_ as j, and tte bottom-level alternative as k. Let Xijk, }_j and Zi refer to the vectors ot explanato_; variables _ecific to categories (i,j, k), (i,j), and (i), respectively• We write

i

:

--Prkli j Prjl i Pr_

The cond fional probability Prkto will involve only the parameters/3: eff Xij_ Prklij = Y'_ne'°'X_'_ We define the inclusiw values for category (i,d) as

1"1

and

easyij +ri5Iij PrJli = }-_,mea'V"_+ri'_h'_ Define in(lusive values! for category (i) as

m

/

then

e'f'

Zi-b_i

Ji

Pri = -= El eYrZt+rlJl If we r_strict all the

I

form:

l

where

i

Prijk

i

_ij

and 6i to be 1, we then recover the conditional logit model of the following

eVijk

= fl'X, k + c,'Y j+

,_ il i_ '

,,,,,_ mogrzm Maxlmum-iiKellllOOOnested logit estimation There are two ways to estimate the nested logit model: sequential estimation and the full information maximum likelihood estimation, nlogit estimates the model using the full information maximum likelihood method. If g = 1, 2,..., G denotes the groups, and Pr_j k is the probability of category (i, j, k) being a positive outcome in group 9, the log likelihood of the nested logit model is In L = E

ln(Pr;jk)

9

=

In Pr_lij + InPr_i

+ tnPrf

/

)

References Amemiya.

T. 1985. Advanced

Econometrics.

Greene. W. H. 2000. Econometric

Analysis.

Cambridge,

Hausman.

J. 1978. Specification

tests in econometrics.

Hausman,

J. and D. McFadden.

1984. Specification

Maddala. G. S. 1983. Limited-dependent Press.

McFadden, D. EconomeNc

1981. Econometric models Applications, pp, 198-272.

Saddle

University

46: 125t-I27t.

tests in econometrics.

for analyzing

Press°

River, NJ: Prentice-HalL

Economerrica

and Qualitative

McFadden, D. 1977. Quantitative methods Foundation Discussion Paper no. 474.

MA: Harvard

4th ed. Upper

Econometrica

Variables

in Econometrics.

behavior

of individuals:

of probabilistic choice. In Smacturat Cambridge, MA: MIT Press.

52: 1219-1240.

Cambridge: some recent Analysis

Cambridge developments. of

Also See Complementary:

[R] lincom, [R] lrtest, [R] predict, [R] test, [R] testnl, [R] xi

Related:

[R] elogit, [R] logistic, [R] logit, JR] mlogit

Background:

[u] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [R] maximize

Discrete

University CoMes Data

with

[ ie !

notes

i

i

Place

in data

Syntax vama,ne] notes

t_xt : !

_otes

notes drop evarlisf [in #[/#]] where eva list is a varl:_'t but may also contain _theword _dta and # is a number or the letter 1. If text incl ides the letters TS surrounded by blanks, the TS is removed and a time stamp is substituted in its p ace.

Descripti(,n notes

attaches note: to the dataset in memory. These notes become a part of the dataset and are

attached generically to :he dataset or specifically to a variable within the dataset.

i

Remarksi saved when

the dataset is saved and retrieved When the dataset is used: see [R] save, notes can be

j [

A not_ is nothing formal; it is merely a string of text--probably words in your native language Treminding you to do something, cautioning you against something, or anything else you might] feel like jotti lg down. People who work with real data invariably end up with paper notes plastered ground their tlerminal saying things like "Send the new sales data to Bob" or "Check the

I

income salary95; I don't believe if' or "The gender was significant!" would be betterv_iable jf theseinnotesi were attached to the dataset. Attached to dummy the terminal, they tend toItfall off

l

and get lost. Addin_ a note to y ur dataset requires typing note or notes (they are synonyms), a colon (:L and whatever _ou wan_ to remember. The note is added to the dataset currently in memory.

4

. n_te:

I

i

Send co_y to Bob once verified.

nite s

i

You can +play your n_tes by typing II

notes

(or note)

by itself.

!

Send copy ,_oBob once verified.

!

Once youi resave your _ata, you can replay the note in the future, too. You add more notes just as

i

you did tl_e first:

[

. nSte:

Ii 2i

i

Mary war_ts a copy, tOO.

Send copy to Bob once verified. Mary ,,rants a copy, too.

443

You can place time stamps on your notes by placing the word TS (in capitals) in the text of your note: • note: TS merged • notes

updates

from

JJ_F

_dta: I. 2. 3.

Send copy to Bob once verified. Mary wants a copy, too. 19 Jul 1000 15:38 merged updates

from JJ&F

The notes we have added so far are attached to the dataset generically, which is why Stats prefixes them with _dta when it lists them. You can attach notes to variables: • note mpg: is the 44 a mistake?

Ask Bob.

note mpg: what about the two missing values7 • notes _dta: i. 2. 3. mpg: i. 2.

Send copy to Bob once verified. Mary wants a copy, too. 19 Jul 2000 15:38 merged updates from JJ_F is the 44 a mistake? Ask Bob. what about the two missing values?

Up to 9,999 generic notes can be attached to _dta and another 9,999 notes can be attached to each variable.

Selectively listing notes notes by itself lists all the notes. In full syntax, notes is equivalent to typing notes _all in 1/1. Here are some variations: notes notes notes notes notes notes notes

_dta mpg _dta mpg _dta in 3 _dta in 3/5 mpg in 3/5 _dta in 3/1

list list list list list list list

all generic notes all notes for variable mpg all generic notes and mpg notes generic note 3 generic notes 3 through 5 mpg notes 3 through 5 generic notes 3 through last

Deletingnotes notes drop works much like listing notes except that typing notes all notes; type notes drop _a11. Some variations: notes notes notes notes notes

drop drop drop drop drop

_dta _dta in 3 ._dta in 3/5 _dta in 3/i mpg in 4

delete delete delete delete delete

drop by itself does not delete

all generic notes generic note 3 generic notes 3 through generic notes 3 through mpg note a

5 last

"

_

T

.......

!

............................

._ .......................................................

_

-_ .i ¸

i

_:

notes -- Place notes in data

" 445

Warningsi 1. Notes _re stored wit_ the data and, as with _her updates you make to the data, the _additions and deletions are not pei_nanent until you save the data; see JR] save, i i

I

2. The m_ximum lengt_ of a single note is 1,000 characters with Small Stata and 67,784 characters

+

with I ercooled Stala.

Methods it nd Forrrtulas ! '

noteaiis

implemen_d

as an ado-file.

1

i

References Gleason,

J, R. I998.

in Stata

Technical

i dm571

A notes

Butlqtin

editor

Reprints,

vol.

for Window 8, pp.

i

Also See

i

i

Complenenta_v:

[_] describe, [R] save

!

Related:

_] codebook

i

Backgrou nd:

L_]15,8 Characteristics

i

and Macintosh.

10_13.

1

i

J

Stata

Technical

Bulletin

43: 6-9.

Reprinted

f"f .."

! !

Title I nptrend, , -- Testfor trend across,°rderedgroups ,,,

I

Syntax nptrend

varname [if exp] [in range], by(groupvar) [ nodetail s_core(scorevar)]

Description nptrend

performs a nonparametric test for trend across ordered groups.

Options by(groupvar) is not optional; it specifies the group on which the data is to be ordered. nodetail

suppresses the listing of group rank sums.

score (scorevar) defines scores for groups. When not specified, the values of groupvar are used for the scores.

Remarks nptrend performs the nonparametric test for trend across ordered groups developed by Cuzick (1985), which is an extension of the Wilcoxon rank-sum test (rar,_ksum:see [R]signrank). A correction for ties is incorporated into the test. nptrend is a useful adjunct to the Kruskal-Wallis test; see [R] kwallis.

In addition to nptrend, for nongrouped data the signtest and spearman commands can be useful: see [R] signrank and [R] spearman. The Cox and Stuart test, for instance, applies the sign test to differences between equally spaced observations of varname. The Daniels test calculates Spearman's rank correlation of varname with a time index• Under appropriate conditions, the Daniels test is more powerful than the Cox and Stuart test. See Conover (1999) for a discussion of these tests and their asymptotic relative efficiency. > Example The following data (Altman 1991, 217) show ocular exposure to u]traviolet radiation for 32 pairs of sunglasses classified into 3 groups according to the amount of visible light transmitted. Group

Transmission of visible light

I 2

< 25% 25 to 35%

3

> 35%

Ocular exposure to ultraviolet radiation 1.4 0.9 2.6 0.8

1.4 1.0 2.8 1.7

1.4 1.1 2.8 1.7

Entering these data into Stata, we have 446

1.6 1.1 3.2 1.7

2.3 t.2 3.5 3.4

2.3 1.2 1.5 1.9 2.2 2.6 2.6 4.3 5.t 7.1 8.9 13.5

I

|

i _

V

......................................

_

............... i¸ 4

nptrend!--Test for trend across ordered groups

|

447

, li,t _xposmte 1.4

2.

1

1.4

3._

i

1.4

1

2.3

1

2.3

(o 7; ut omitted) 2 31 "i 3

s2.1

i

.9 8,9

s

ls.s

]

We use nt_trend to tes for a trend of (increasing) exposure across the 3 groups by typing . nl_rend exposure, by(group)

l

group 1

2z

=

sum of ranks 76

score t

obs 6

3

8

162

..522

18

290

3 i

!

i > Izl = i,13 Pr?b When the l_rou ps are g{iven any equally saced scores (such as -1, O, 1), we will obtain the same p , answer as !above. To illustrate the effect of changing scores, an analysis of these data with scores 1,

i

2, and 5 (_dmittedh' no! very sensible in this c_se) produces

ii

[

geb mysc = con_(group==3,5,group) nl_rend exposul_e,by(group) score(mys_)

I

group

s4ore

1 2 3 z

i _i

2 5 1

=

.46

Pr_b> Izl :

_.14

obs

sum of ranks

18 8 6

290 i62 76

"

This example suggests ihat the analysis is not all that sensitive to the scores chosen.

q

! i 3 Technical _lote

_

! I

The grc_uping variabt_ may be either a string v_able or a numeric variable. If it is a string variable and no scdre variable id specified, the natural nfimbers 1, 2, 3, .. are assigned to the goups in the son order !of the string!variable. This may not always be what you expect. For example, the sort raer olttle strings one, two, three _s one, three, two.

l

a

]

i

4 group 1

1.

i

SavedReSults nptrer@ii saves in r ): _

i

_calars r(N) r(p)

nuNber of observitions

r(z)

z statistic

two,sided p-value

r(T)

test statistic

i

-

- _

[,Ii!

..-

,,_.._,,,.

--

,_ot ,u, .u.u

across oruere(! groups

Methods and Formulas nptrend

is implemented as an ado-file.

nptrend is based on a method in Cuzick (1985). The following description of the statistic is from Altman (1991, 215-217). We have k groups of sample sizes ni (i = 1,..., k). The groups are given scores, li, which reflect their ordering, such as 1, 2, and 3. The scores do not have to be equally spaced, but they usually are. The total set of N = _ n_ observations are ranked from 1 to N and the sums of the ranks in each group, R/, are obtained. L, the weighted sum of all the group scores, is k L = E lini i=1

The statistic T is calculated as k T = E liRi i-=1

Under the null hypothesis, the expected value of T is E(T) = .5(N + l)L. and its standard error is

se--'(T)

=

I

(

n + 1 --_ N

k i=I

li2ni -- L 2

)

\ /

so that the test statistic, z, is given by z = { T - E(T) }/se(T), which has an approximately standard Normal distribution when the null hypothesis of no trend is true. The correction for ties affects the standard error of T. Let 2_"be the number of unique values of the variable being tested (N _
a--

The corrected standard error of T is _(T)

N(X - 1) = v/1 - a se(T).

Acknowledgments nptrendwas written by K_ A. Stepniewska and D. G. Altman (t992) of the Imperial Cancer Research Fund, London.

References Altman, D. G. 1991. Practical Statistics for MedicaJ Research. London: Chapman & Hall, Conover, W. 3. 1999. Practical Nonparametric Statistics_ 3d ed. New York: John Wiley & Sons. Cuzick, J. 1985. A Wilcoxon-type test for trend. Statistics in Medicine 4: 87-90. Sasieni, E 1996. snpl2: Stratified test for trend across ordered groups. Stata Technical Bulletin 33: 24-27. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 196-201). 7"

i

i i

i

1

i Ii I

I +

i_

i

i

_ nptrend:+- Test for trend across o_ groups 449 Sasieni,P., L A. Stepniew_&a,and D. G. Altman+1996_.snpll: Test for trend across orderedgroups revisited. Stata Technic_ Bulletin 32: 7-29. Reprintedin Stata TechnicalBulletin Reprints,vol. 6, pp. 193-196. Stepniewsk K.A. and D. 3. Altman.1992.snp4:Nonparametric test for trend across orderedgroupS.Stata Technical i Bulletinl9 21-22. Reitr_ntedin StataTechnicalBultetinReprints,vol. 2. p. 169.

!

i

AlsoSee Related:

[R] epitah

[R] kwallis, [R] signrahk, [R] spearman, IN] symmetry

+ i i

i

I i

i

+ i t i

I I i

J

i

i

i

!

l

+

I

i i

i+ +

i i

_

' !',

I ILI_

I

obs

--

Increase

_e I I Inu]_ber

Ul

°f

°bse]_ati°ns

in

dataset

I

I

I

I

i

Syntax set obs #

Description sel; obs changes the number of observations in the current dataset. # must be at least as large as the current number of observations. If there are variables in memory, the values of all new observations are set to missing.

Remarks Example set obs can be useful for concocting artificial datasets. For instance, if you wanted to graph the function 9 = z e over the range 1 to 100, you could .

drop

_all

• set

obs

obs was

I00

O, now

100

generate

x = _n

generate

y = x-2

graph

y x

(graph notshown)

q

> Example If in a program you want to add an ex_a data point, local set

npl obs

= _N + 1 "npl"

Also See Related:

[R] describe

450

ologit

Maximum-likelihood

_

ordered logit stimation

i

il,

i

li

. i

i

iii

i1!ii I

I , ill

1

l !

Syntwx ologi_

depvar

[varlist]

[weight] [if

exp] [in range][

table robust

cl___uSter (varnam.z) s_£ore (newvarlist) !eve1 (#) offset (varname)

i I

i by .,.

i

maximize_options

]

_

i

: m_y be used with ologit; see [R] by,

I I

fweights, exghts, and l_we:tghts are allowed: see [U} /4,1.6 weight ologit _ha+s the features bf all estimation commands: see [U] 23 Estimation

i

ologit ma3i be used with i_ to perform stepwise estimation: see [R] sw.

i

and post-estimation commands,

i

l l

Syntax for tpredict outCome(outcome) '_ i

nooffset

]J

"

I !

Note that wilh the p option,!vou specify ei{her one or k new _afiables depending upon wbether tbe outcome() option is also s_cified (where/t" is the number of categories of dem'ar). With xb and strip, one new variable is specified.

i

These statistics are availabk the estimation sample.

i

t

both in and out of sample! type predict

...

if

e(sample)

...

if wanted only for

Descliptin

I

ologi_ estimates ortdered ]ogit models of ordinal variable depvar on the independent variables

! !

va,llst. Th_ actual valuds taken on by the dependent variable are irrelevant except that larger values are assumdd to correspohd to "higher" outcomes. Up to 50 outcomes are allowed in Intercooled Stata: 20 outcombs in Small _tata.

-_

!

See [R] logistic for _ list of related estimation commands.

Options

i

!

tablereqhests a table howing how the probabilities for the categories are computed from the fitted ; equatio_a.

I

robust s[ ecifies that tl_e Huber/White/sandwich estimator of variance is to be used in place of the traditio_}aI calculatioh: see [U] 23.11 Obtaiding robust variance estimates, robust combined with cluster () allc_,s observations which a_'enot independent within cluster (although they must be inde )endent between clusters)i

l

i

If you _?ecify pue5._ his, robust

is impliedi see [U] 23.13 Weighted estimation.

!

I

451

_:

:+ i

,; "

';

eluszer(varname) specifies that ......... the observations are independent across groups (clusters) but "+'== 'uSlIL estimation not necessarily within groups, varname specifies to which group each observation belongs; e.g., cluster(personid) in data with repeated observations on individua/s, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U] 23.11 Obtaining robust variance estimates, cluster() can be used with pweighes to produce estimates for unstratified cluster-sampled data, but see the svyologit command in [R] svy estimators for a command designed especially for survey data. byClUster()itself, implies robust;

specifying

robust

score (newvartist) creates k new variables, where first variable contains OlnLi/O(xib); the second contmns OlnLj/O(_cut2j); and so on Not ,ho, . e ,,,+ ..Stata. sc(kW°Uld1).create the appropriate number of new

cluster()

is equivalent

to typing cluster()

k is the number of observed outcomes. The variable co ' . . . . ;¢ ...... ntmns O_Lg/O(_cutl,), the third ,, yuu were to speclry the option score (sc,), variables and they would be named me0, scl,

level (#) specifies the confidence level, in percent, for confidence intervals+ The default is Zleve] (95) or as set by set level;see [uJ 23.5 Specifying the width of confidence intervals. offsetto be(Varname)l, specifies that varname is to be included in the mode/with maximize_optiOnSspecify them. control the maximization

process: see [R] maximize.

coefficient constrained

You should never have to

Optionsfor predict p, the default, calculates the predicted probabilities. If you do not also specify the outcome () option, you must specify k new variables, where k is the number of categories of the dependent variable. Say you estimated a model by typing ologit result xl x2, and result takes on three values. Then you could type predict pl p2 p3 to obtain all three predicted probabilities. If you specify the outcome () option, then you specify one new variable. Say that result takes on the values I, 2, and 3. Then typing predict p1, outcome(I) would produce the same pl. xb calculates the linear prediction. You specify one new variable; for example, predict linear, xb. The linear prediction is defined ignoring the contribution of the estimated cut points. stdppredictCalculateSse,thestdp.Standard error of the linear prediction.

You specify one new variable; for example,

outcorae(outcome) specifies for which outcome the predicted probabilities are to be calculated. outcome () should contain either a single value of the dependent variable, or one of #1, #2 ..... with #1 meaning the first category of the dependent variable, #2 the second category, etc. nooffse_z

is relevant only if you specified offset

(varname)

for ologit.

I! modifies the calculations

made by rather thanpredict x 3b + offsetj. so that they ignore the offset variable; the linear prediction

is treated as xjb

Remarks Ordered Iogit models are used to estimate relationships between an ordinal dependent variable and a set of independent instance, "poor", "good", variables. and " An ordinal ,, variable is a variable that is categorical and ordered, for excellent , which might be the answer to one's current health status or the repair record of one's car. If there are only two outcomes, see [R] logistic, [R] legit, and [R] probit. This entry is concerned only with more than two outcomes. If the outcomes cannot be

ologit-- MaXimum-likelihood orderediogitestimation

453

: !

ordered (e. _,.,residency in the north, east, south and west), see [R] mlogit. This entry is concerned only with nodels in whi ch the outcomes can be ordered.

! !

In order ed logit, an u _dedying score is estimated as a linear function of the independent variables and a set ,f cut points, rhe probability of observing outcome i corresponds to the probability that the estimad'd linear fum tion. plus random error, is within the range of the cut points estimated for

i

the outcomi.':

i

_ r(outcomej = i) = Pr(_,-1 < /31xlj +-" _ 3kZkj + uj _< t
i i t

t

Example

!

You wisi_ to analyze [the 1977 repair records of 66 foreign and domestic cars. The data are a variation o( the automobipe dataset described in [U] 9 Stata's on-line tutorials and sample datasets,

Io

The n:pair records, 't,like those in 1978. take on values poor, fair. average, good, and excellent. Here 1977 is a c_oss-tabulation of the data:

i

tab

rep77

forei+,

¢hi2

R_pair R_cord !

_1977

! Foreign Dome_tlc

iFair iPoor Average !Good

t

_o al ! Pearson

Total

! I

tO 2 20

1 l 7

3 27

t

13

7

20

,I 0

Excellent,

i

Foreign

5

_ 45 i 1 _hi2(4)

Although it lappears that

ll

oreign

5 66

21 =

13.8619

• Pr = 0.008

takes on the _alues "Domestic"

and "Foreigff',

it is actually

a numeric vhriable takin_on ..--.the, values 0 and I.: Similarly, rep77 takes on the values 1, 2, 3, 4, and 5, correiponding to 'l_oor , Fair , and so o0. The more meaningful words appear because we attached val_e labels to t_e data" see [U] 15.6.3 Value labels. ° t •

! I

i

Since theI chl-squaredb, alue is significant, you could claim that there is a relationship between :foreign an_t rep77. Lite_rally,however, you can only claim that the distributions are different; the ch_-squared lest _s not dt t_onal. One way to model these data is to model the categorization that took place Qhen the data _ere created, Cars havea true frequency-of-repair, which we will assume is given by Nj = 3 :fOr_ig'Xlj + ttj and a car is categorized as "poor" if Sj < _o, as "'fair" if _o < Sj < 4i, and so onI • olog!t

1

i

foreign,

table

Iterat

on O:

log !Likelihood

= -89,895098

Iterat

on t:

log

Likelihood

= -85.95176S

]terat

on 2:

log

Likelihood

= -85.908227

!terat

on 3:

log

likelihood

= -85.908161

Ordere

log_t 1

i_ Log

I

rep77

II elihood i

estimates " = -85.908161

Number

of obs

66

=

7.97

iR chi2(1)

=

PSeudo

R2

=

0.0444

Prob

_h_9-

=

0.0047

>

......

[,

u'"

.,.-^,,.u.m-.n¢..ooa

rep77

Coef.

foreign

oroerea

SCd. Err.

1.455878

.5308946

j

_cut1 _cut2

-2. 765562 -. 9963603

.5988207 .3217704

I

_cut3 _cut4

3.123351 .9426153

.3136396 .5423237

rep77

ioglt estimation

z 2.74

[95% Conf.

O. 006

.4153436

Interval] 2.496412

(Ancillary parameters)

Probability

Poor Fair Average Good Excellent

P>[zl

Observed

Pr( xb+u<_cutl) Pr(_cutl<xb+u<_cut2) Pr(_cut2<xb+u<_cut3) Pr(_cut3<xb+u<_cut4) Pr(_cut4<xb+u)

0.0455 0.1667 0.409i 0.3030 0.0758

Our model is Sj = 1.46 foreignj -+-uj; the expected value for foreign cars is 1.46 and, for domestic cars, 0; foreign cars have better repair records. The "ancillary parameters" _cut 1, _cut2, _cut3, and _cut4 correspond to the t_'s in our previous notation--they model the categorization. For instance, the probability that a foreign car is categorized as having a poor repair record is given by the probability that 1.46 i < -2.77 or, equivalently, uj _< -4.23. The estimated cut points tell us how to interpret the score and the estimates--produced because we specified the option table--reminds us of A car is estimated as having a poor repair record if the score is less than the (Actually, the table could say less than or equal, but since the logistic distribution probability of any particular value is zero so it does not matter.)

table below the the interpretation. estimated _cut1. is continuous_ the

For a foreign car, the probability of a poor record is the probability that 1.46 + uj < -2.77 or, equivalently, uj < -4.23. Making this calculation requires familiarity with the logistic distribution: the probability is 1/(1 4- e 42z) = .014. On the other hand, for domestic cars, the probability of a poor record is the probability _Lj _< -2.77, which is .059. This, it seems to us, is a far more reasonable prediction than we would have made based on the table alone. The table showed that 2 out of 45 domestic cars had poor records while 1 out of 21 foreign cars had poor records--corresponding to probabilities 2/45 -- .044 and 1/2t = .048. The predictions from our model imposed a smoothness assumption_foreign cars should not, overall, have better repair records without the difference revealing itseIf in each category. The fact that, in our data. the fractions of foreign and domestic cars in the poor category are virtually identical is due only to the randomness associated with small samples. Thus, if we were asked to predict the true fractions of foreign and domestic cars that would be classified in the various categories, we would choose the numbers implied by the ordered legit model:

tabulate Domestic Foreign

legit Domestic

Foreign

Poor Fair

.044 .222

.048 .048

.059 .210

.014 .065

Average Good Excellen|

.444 .289 .000

.333 .333 .238

.450 .238 .043

.295 .467 .159

ologit -- Makimum-likelihoodordered Iogit estimation

I

455

See H_pothesis test'._and predictions below for a more complete explanation of how to generate prediction_ from an ore ered logit model,

E3TechnicalNote ! i !

an arbitra_-y dichotomi/_ation, which might otherwise have been tempting, We could, for instance, have sum_narized thesd data by converting the '5-outcome rep77 variable to a 2-outcome variable. combinin_ cars in the _verage, fair. and poor categories to make one outcome and combinin_ cars in

;

the good _nd excellent stIcategorie to make the second. l t , Anoth!r even less _lpp,ealing,atternati, e would have been to use ordinau' regression, arbitrarily labeling _xcellent" as! good as 4, and so on. The problem is that with different but equally valid labelings (say I0 for _xcellent ), we would obtain different esnmates. We would have no way of choosin_ bne metric over another. That is not, however, true of ologit. The actual values used to label the i:ategones _ " make no difference other tl_an through the order they imply.

i

In facti our labeling was 5 for "excellent". 4 for "good". and so on. The words "excellent" and

! _

"good'-'at_pear in our o Jtput because we attached a value label to the variables' see [U] 15.6.3 Value labels. If!we were to n3w go back and type replace rep77=10 if rep77==5, changing all the 5s to 10s. w_ would still _btain exactly the same results when we re-estimated our model.

I

! i

i

i

i

{

i

Example

; I

In the !example abo_e, we used ordered lo_t as a way to model a table. We are not. however. limited to!including only' a single explanatory ._-,_ _fiable nor to including only categorical variables, We can explo_e the relatio_ship of rep77 with any of the variables in our data. We might, for instance. model re'_77 not only tn terms of the origin of manufacture, but also including length (a proxy for size) and_pg: . ologit rep77 f reign length mpg Ite_'ationO: Ite_'ationi:

log likelihood = -89.895098 log likelihood = -78.775147

Ite_'ation2: Ite_'ation3:

log log

Iteration 4:

log likelihood = -78.250719

likelihood = -78.25_299 likelihood = -78.25_722

i i i Ord+red logit estimates Logilikelihood _ -78.2507_9

LR chi2(3) Prob > chi2 Number of obs Pseudo R2

|

rep77

Cool.

P>lz[

[95% Conf. Interval]

! }

_ foreign _ length f i mpg

2.896807 .0828275

.7906411 .02272

3.66 3.65

0.000 0.000

1.347179 .0382972

4.446435 .1273579

.2307677

.0704548

3.28

0.001

.0926788

.3688566

i :

curl _cut2

17.92748 19.86506

5.551191 5.59648

(Ancillary

parameters)

i

_cut3 _cut4

22.10331 24.69213

5.708935 5.890754

i

z

23.29 0.0000 66 0.1295

!

!

Std. Err.

= = = =

......

_--.

v_.

iw_ii

liilUlllil|llill

foreign ptays a role, asand even larger role than previously. We find that larger cars tend to have betterstill repair records, doan cars with better mileage ratings.

,1::

I :i_; :! ;,

Hypothesis tests and predictions See [u] 23 Estimation and post-estimation commands for instructions on obtaining the variancecovariance matrix of the estimators, predicted values, and hypothesis tests. Also see [R] lrtest for performing likelihood-ratio tests.

• i

_>Example , In a previous example, we estimated the model ologit rep77 predict command can be used to obtain the predicted probabilities.

foreign

length

mpg. The

You type predict followed by the names of the new variables to hold the predicted probabilities, ordering the names from low to high. In our data, the lowest outcome is "poor" and the highest "excellent". We have five categories and so must type five names following predict; the choice of name is up to us: predict poor fair avg good exc (option p assumed; predicted probabilities) • list exc good make model rep78 if repT7 ==. 3. I0. 32.

exc .0033341 .0098392 .0023406

good .0393056 .1070041 .0279487

44. 53. 56. 57. 63.

.015697 .065272 .005187 .0281461 .0294961

.1594413 .4165188 .059727 .2371826 .2585825

make AMC Buick Ford Merc. Peugeot Plym. Plym. Pont.

model Spirit Opel Fiesta

rep78

Monarch 604 Horizon Sapporo Phoenix

Average

Good

Average

The eight cars listed were introduced after 1977 and so do not have 1977 repair records in our data. We predicted what their 1977 repair records might have been using the fitted equation. We see that, based on its characteristics, the Peugeot 604 had about a 41.65 + 6.53 _ 48.2 percent chance of a good or excellent repair record. The Ford Fiesta, which had only a 3 percent chance of a good or excellent repair record, in fact had a good record when it was introduced in the following year. <1

Q TechnicalNote For ordered legit, predict, xb produces Sj - Xlj;31 + X2j/_2 4-- -- q- Zkj/Pk. The ordered-legit predictions are then the probability that Sj + uj lies between a pair of cut points _,i-1 and i,;/. Some handy formulas are Pr(Sj + ,,j < n):

1/(1 + e s'-*)

Pr(Sj 4-uj > n) = 1 Pr(nl

1/(1 + e s'-*)

< Sj + uj < _;2) = 1/ (1.4 eS'-*2 ) - 1/(1+

e

")

i

_

_

!

;

ologit -- Maximum-likelihoodordered Ioglt estimation

! Rather t_an using pr+±ct

457

--i ............... directly, we coul_ cak'ulate the predicted probabilities by hand. tf we

wighed tc obtain the predicted probability that the repair record is excellent and the probability that it is good, ,_e look back fat ologit s output to obtain the c,ut points. We find that "good" corresponds to the int _rvat _cut3 _ Sj + u < _cut4 and i excellent to the interval Sj + u > _cut4:

i I

" predict score! xb ;n probgood _ l/(l+exp(score-_b[_cu,4]))

!

,

1

i

i

- 1/(I+exp(score-_b[_cut3]))

• g_n probexc = _I - £/(i+exp(score- b[,cut4]))

1

The resul s of our cal ulation will be exactly ihe same as that produced in the previous example.

!_

Note that}we refer to the estimated cut points just as we would any coefficient, so _b[,_cut3] refers to the valhe of the _c@3 coefficient; see [U] i6.5 Accessing coefficients and standant errors,

SavedR,ults ologil , Scalars

r

i

saves in e(I:

e(N_ e(k__cat)

nut bet of observations nun ber of categories

e(ll) e(ll_O)

log likelihood Iog likelihood,

e(d__.m)

moc el degrees of freedom

e(ch±2)

X2

e (r__p)

pse_tdo R-squared

e (N_clust)

number of clusters

constant-only

model

i

Macros i

! ! i

e(c+d) e(d@var) e(w_ype)

o].0 21; narr of dependent variable weJ _t type

e(vcet_e) e(chi2type) e(offset)

covariance estimation method Wald or Lit: type of model x 2 test offset

e(w_xp) e(c]ustvar)

weight expression nam of cluster variable

e(predict)

program used to implement predict

i

coef ]cient vector care _ory values

e (V)

vafiance-covariance estimators

i

Matrices '

l

e (b) e (c_ t)

matrix of the

Functions i

e(sa_ple)

marl:s estimation sample

ethods,nd For M ' i _ I

i i {

las

A straightforward textbook description of the model fit by ologit, as well as the models tit by oprobiit, clogit, mlogit, can be found in Greene (2000, chapter 19). When you have a qualitati_ie dependentk,ariable, several estimation procedures are available. A popular choice is muttinomia_ logistic reglession (see JR] miogit)_ but if you use this procedure when the response V ' • i .... , .... anable ts ,ordmal. youiare d|scardmg mformatmn because multmomml logtt ignores the ordered aspect of t_e outcome Ordered logit and probii models provide a means to exploit the ordering information[ '

4nO

There isimore than ol e "ordered logic model, The model fit by ologit,which we will call the ordered lo@ model, is a] ;o known as the proportional odds model. Another popular choice, not fitted by ologitlis known as he _tereotype model. AIi ordered logit models have been derived by s_arting with a bindS, logit/probi model and generalizing I it to allow for more than two outcomes.

;_"_r_

'

_

oioglt -- Maximum-likelihood ordered Iogit estimation

The proportional odds ordered logit model is so called because, if one considers the odds odds(k) = P(Y < k)/P(Y > k), then odds(k1) and odds(k2) have the same ratio for all independent variable combinations. The model is based on the principle that the only effect of combining adjoining categories in ordered categorical regression problems should be a loss of efficiency in the estimation of the regression parameters (McCullagh 1980). This model was also described by Zavoina and McKelvey (1975), and previously by Aitchison and Silvey (1957) in a different algebraic form. Brant (1990) offers a set of diagnostics for the model. Peterson and Harretl (1990) suggest a model that allows nonproportional explanatory variables, Fu (1998).

odds for a subset of the

ologit does not allow this, but a model similar to this was implemented

by

The stereotype model rejects the principle on which the ordered logit model is based. Anderson (1984) argues that there are two distinct types of ordered categorical variables: "grouped continuous", like income, where the "type a" model applies; and "'assessed", like extent of pain relief, where the stereotype model applies. Greenland (1985) independently developed the same model. The stereotype mode/starts with a multinomial logistic regression model and imposes constraints on this model. Goodness of fit for ologi'l;

can be evaluated by comparing

the likelihood value with that obtained

by estimating the model with mlogit. Let Lj. be the log-likelihood value reported by ologit and let L0 be the log-likelihood value reported by mlogit, if there are p independent variables (excluding the constant) and c categories, mlogit will fit p(c - 1) additional parameters. One can then perform a "likelihood-ratio test", i.e., calculate -2(L1 - L0), and compare it to )C2{p(c2)}. This test is only suggestive because the ordered logit model is not nested within the multinomial logit model. A large value of -2(L1 - L0) should, however, be taken as evidence of poorness of fit. Marginally large values, on the other hand, should not be taken too seriously. The coefficients and cut points are estimated using maximum-likelihood as described in [R] maximize. In our parameterization, no constant appears as the effect is absorbed into the cut points. ologit and oprobit begin by tabulating the dependent variable. Category i = 1 is defined as •"the minimum value of the variable, i = 2 as the next ordered value, and so on, for the empirically determined [ categories. The probability

of observing an observation

Pr(outcome

= i) = Pr

t_i-1

<

in the case of ordered logit is

_jxj

J

n-

U <

t_i,

1

1

l +exp(-t_i+ Note that _;0 is defined as -vo

_-_./3jzj)

+ _ /3jzj)

and t_I as +co.

In the case of ordered probit_ the probability

Pr(outcome

l +exp(-__l

= i) -- Pr(_i_l

/ "

where @() is the standard normal cumulative

of observing an observation

< Z/3jxj J

3

;i]jl

distribution

+ u < _)

_

¢_(I_-"

function.

1

_32j_j) j

is

-

otogit -- MaXimum-likelihood ordered Iogit estimation

459

References

Aitchison, J. and S. D. Sih)y. 1957. The generalization Of probit anah'sis to the case of multiple response,_. Biometrika

:

1

Anderson. i" A. 1984. Re ession and ordered categorical variables (with discussion)_ Journal of the Royal Statistical

44:131 140,

! l

Societyj Series B 46: 1_30. Brant, R. 1_990.Assessing,proportionality in the propot'tional odds model for ordinal lo_isticLregession. Biometrics 46: I17_-t178.

}

Fu, V.K.

_-_ ,!

Stata T_chnicat Bulletir Reprints, vol. 8. pp. 160-164. Goldstein, R. 1997. sg59: Index of ordinal variation arid Norman-Barton GOF. Stata Technical Bulletin 33: 10-!2.

i _ I

Reprini__ in Stata Teclnical Bulletin Reprints. vol. 6. pp. 145-147. Greene, _'_'iH. 2000. Econgmetric Anah,sis, 4th ed. Upper Saddle River. NJ: Prentice-Halt. Greenland,!S. 1985, An a_plication of 10_istic models to the analysis of ordinal response, giometrical Journal 27: 189-19_. i

!

Lor_g,J. SI' 1997. Regresston Models for Categorical and l,imited Dependent Variables Thousand Oaks, CA: Sa_e i Publications. i

[ _i

McCulla_h! R 1977. A logistic model for paired comp_'isons with ordered categorical data. Biometrika 64: 440-4'_ . _98d, Regression m_l)delsfor ordinal data (with discussion). Journal of the Royal Statistical Society, Series B

98. sg88: Estimating generalized ordered t0git models. Stata Technical Bulletin 44: 27-30. Reprinted tn

McCullagh,!R and J.A.

elder, t989. Generalized Linear Models, 2d ed, London: Chapman & Hall.

Peterson. Ii. and E E H_rrelt, Jr, 1990, Partial proportional odds models for ordinal response variables. Applied I

StatistiCs39:205-217., Woife, R. 1_98. sg86: Continuation-ratio models for ordil_alresponse data, Stata Technical 13ultetin44: 18-2t in Stat_ Technical Bull'.tin Reprints. vol, 8, pp, 149-153

!

Wolfe, R. iand W. W. GoJld. 1998. sg76: An approxlmate likelihood-ratio test for ordinal restxmse models. 5tata Technic_l Bulletin 42: ?4- 27. Reprinted in Stata Technical Bulletin Reprints. vol. 7. pp. t99-204.

_'

Zavoina, W. and R. D. _ cKetvey. 1975, A statistical model for the anah'sis of ordinal level dependent variables Journaliof Mathematic_ Sociology 4: 103-120.

!!

_:

AisoSee =

Complen_entary: Related: BackgroUnd:

iR]R] w,adjust'[R] test,[R] [R]linktest'vce lrtest, [R] [R] svy mfx.estimators [R] predict. R] slogistic, [P,]lincom.[R] logit,testril.[R] [R]: mlogit, [R][R] oprobit, u] 16.5 Accessing u] 23 Estimation

coefficients

and standard

and _st-estimation

U] 23.11 Oblaining

robust variance

U] 23.12 Obtaining

scores,

R] maximize

errors.

commands, estimates.

Reprinted

_lt_

i ILli_

oneway

-- One-way analysis of variance

Syntax {i,

on_eway response_varfactor_var

[weight]

[if exp] [in range]

[, noa-nova nolabel

missing wrap tabulate [no]means trio]standard[no]freq Ino]obs bonferroni s__ccheffe sidak I

by ...

: may be used with oneway;

aweights

and freights

see JR] by.

are allowed;

see [U] 14.1.6

weight.

Description The oneway comparison tests.command reports one-way analysis-of-variance If you wish to estimate more complicated (ANOCOVA)models, see [R] anova.

ANOVAlayouts or wish to estimate analysis-of-covariance

See [R] encode for examples of estimating See [R] loneway

(ANOVA)models and performs multiple-

ANOVAmodels on string variables.

for an alternative oneway command with slightly different features.

Options noanova

suppresses

the display of the ANOVAtable.

nolabel causes the numeric codes to be displayed rather than the value labels in the ANOVA and multiple-comparison test tables. rni s sing requests that missing values offactor_var to be omitted from the analysis.

be treated as a category rather than as observations

wrap requests that Stata take no action on wide tables to make them readable. Unless wrap specified, wide tables are broken into pieces to enhance readability.

is

tabulate produces a table of summary statistics of the response_var by levels of the factor_var. The table includes the mean, standard deviation, frequency, and. if ihe data are weighted, the number of observations. Individual elements of the table may be included or suppressed by the ino]means,[no]standard,[no]freq,and [no]obsoptions. Forexample,typing ,

oneway

response

factor,

tabulate

means

standard

produces a summary table that contains only the means and standard deviations. the same result by typing oneway

response

factor,

tabulate

You could achieve

nofreq

[no]means includesabove. or suppresses only the means from the table produced by the tabulate See tabulate 460

option.

)

I [' , !

,,,,e I .... Ii

i

Elapse --

nerate pharmacokinetic measurement dataset

Syntax

!

t stlat (measure) I keep(varlist)

t

I

force

_odots

]

where treasure is any of

_ ! !

au¢ aucli to aucex auclo

area area area area

i

half ke cmax

half life _f the drug elimination rate maximun_ concentration

tmax

time concentration time at of l_st haximum concentration

I

und und und und

,r the ',rthe ',r the ,.rthe

concentration-timel curve (AUCo,_) concentration-time curve from 0 to vc using a linear extension concentration-time curve from 0 to vc using an exponential extension log concentration-time curve extended with a linear fit

t omc

i

Description t

•

pkco_lapse

is on_ of the pk commands. If you have not read [R] pk, please do so before reading

pkeo:_lapse

generhtes new variables with the pharmacokinetic summary measures of interest.

Options

' . I

!

id(id_v(r), which _sinot optional, specifies the variable that contains the subject id over which pk¢o_ lapse is to +perate.

! !

fit(#) .__ecifies the r_umber of points to use ita estimating the AUC0.oo.The default is the last three points fit(3), which should be viewed as a minimum. The appropriate number of points will

I !

depen_ on your dat_. trapezoid tells Statal to use the trapezoidal rule when calculating the AUC. The default is cubic splinel, which give_ better results for most ftinctions. When the curve is very irregular, trapezoid may g_ve better res!alts. stat(melm, re) specifies which measures should be generated, The default _s to generate all the measures, i

1 I

i i

keep(va_tist) specifie_ which variables should be kept during the collapse, Variables not specified with t_e keep () op!ion will be dropped. When keep ()is specified, the keep variables are checked _o ensure that all v_lues of the variables are the same within id_var. force

folces, the collapse, even in cases where ihe values of the keep()

nodots s_ppresses the dots during calculation. 514

variables are different within

pk --

-_-

2 2 2 2 2 2 2 2 2 2 (output omitted ) This

format

for two

1 1 1 1 1 1 1 1 1 1

is similar

drugs

at each

to the second time

id i 2 3 4 5 7 8 9 10 12 13 !4 15 18 19 20 For that

this

we

the dataset administered. The

data

observation

produced

was the

applied study.

is only the

subject

during We

the

need

WE see

that

in the

Similarly,

and

id I0 i0

study

first

expansion

1 was

and

treatment

into need

the

two

two

period

in sequence and

had

10 (the

calls

we have data

measurements

to the

first

of

following:

study

Note

to the

AUCs,

each

treatment

was

format.

That and

is.

one,

when

treatment

A

period

of

second

the

outcome

the

the

subject

need

[R] pkshape.

means

indicating

is one we

in the

so that one

see

which

applied

there

pkequiv,

pkshape:

variables,

subject

measure.

addition

when

using

B was

the

In

pkcross

observations new

pharmacokinetic

wide

is in sequence

measure treatment

received

that

to be

period A B

auc 137.0627" 139.7389

these

indicating

To use

treat

study,

of subject

subject

Stata

subject

of the

auc 150.9643 218.5551

of the

sequence 2 2

recording

the

number

expansion

we

now

pkexamine.

be accomplished

observation

In addition, another

that

as our

by

outcomes.

can

of the

expect

period

This

period the

except

drugs

each

in what

This

1 1

first

are

511

/

two

for

or more

format.

goal

computed

dataset.

sequence

subject

is to transform

in the

variable.

might

the

first

to split

received One

id 1 1

applied

subject

The

variable

pkcollapse

to long

above,

AUC for

data

8.710549 10. 90552 8. 429898 5. 573152 6. 32341 .5251224 7.415988 6.323938 1,133553 5. 759489

described

measures

two

(biopharmaceutical)

auc_concB 218. 5551 133. 3201 126. 0635 96.17461 188.9038 223.6922 104.0139 237.8962 139.7382 202.3942 136.7848 104.5191 165.8654 139. 235 166. 2391 158. 5146

the

the

containing

data

in a single

treatment.

by

the first

to use of

a sequence

subject

these

subject.

auc_concA 150. 9643 146. 7606 160.6548 157. 8622 133.6957 160.639 131.2604 168.5186 137.0627 153.4038 163.4593 146.0462 158.1457 147. 1977 164. 9988 145. 3823

any

contains

per

to transform Consider

used

7.253442 5. 849345 6. 761085 4. 33839 5. 04199 4. 25128 6. 205004 5. 566165 3. 689007 3.644063

format

each

we chose

have

also

for

seq 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2

example,

could

1.5 2 3 4 6 8 12 16 24 32

Pharmacokinetic

l, had

1 2 an AUC of

an AUC of 218.5551 first

treat B A

subject

in the period I 2

150.9643 when

sequence

when

treatment

treatment

B was

2) would

be

A was applied.

[i i I_ _I i

pk_ P k,o.c

in the first dataset, vdu will need to use reshape to change it to the second form; see [R] reshal_. Because the data in th_ second (or long) format containsinformation for one drug on severalsubjects, pksummzan be used t)pproduce summary statistics of the pharmacokinetic measurements. The output .

ksunm id t

S_ary i

e concA

statistics

for

star,

i

_

Mean

the

pharmacoktnetic

Median

measures N_mber of observations =

_ariance

Kurtosis

p-value

127,58

-0.34

2,07

O.55

auc

, L51.63

152.18,,

aucline

397.09

219,83

1_8276.59

2.69

9.61

O,O0

aucexp auclog half

_68.60 _65,95 90.68

302,96 298.03 29,12

720356.98 752573.34 17750.70

2.67 2.71 2,36

9.54 9.70 7.92

0.00 0,00 O,O0

! 0.02 i 7,37

0.02 7.42

0.00 0.40

0.88 -0.64

3.87 2,75

0.08 0,68

3,38 32.00 !

3.00 32,00

7.25 0.00

2,27

7,70

0,00

ke cmax tomc tmax

I

T

16

Skewness

Until aow, we hav_ been concerned with the profile of only one drug. We have characterized the profil '. of that dru_by individual subjects using pkexamine and by a group of subjects using pkm_mm.['he goal of hn equivalence trial, however, is to compare two drugs, which we will do in the remai_der of this e_ample. In the case of equi,_alencetrials, the study design most often used is the crossover design. For a complete discussion of!crossover designs, see Ratkowsky,Evans, and Attdredge (1993).

!

Briefly_crossover d_signs require that each sr_bjectbe given both treatments at two different times. The orde_in which thd treatments are applied ihanges between groups. For example, if we had 20 subjects n_mbered I through 20, the first I0 would receive treatment A during the first period of the study, ther_they would l_egiven treatment B. Thesecond I0 subjectswould be given treatment B during the first p_riod of the study, then they would be given treatment A. Each subject in the study will have four vari@les that describe the observation: a _ubject identifier, a sequence identifier that indicates the order bf treatment, and two outcome variables, one outcome for each treatment. The outcome vari_ables_or each sub _:tare the pharmacokinetic measures. The data must be transformed from a series of rbeasurements on individual subjects to data containing the pharmacokinetic measures for each subj@t. In Stata _lance. this is referred to as a collapse, which can be done with pkcollapse:

I

see [R] pl_coilapse. To illustrate pkcoll:_pse,

!

i i

i i

I

1

assume that we have the following dataset:

1 id1 1 I t I I 1 i 1 1 1 1 2

1 Iseq1 1 1 I 1 1 I 1 1 1 i 1 1

0 ttme.5 1 1.5 2 3 4 6 8 12 16 24 32 0

2

1

•5

2

1

1

0 3.073_03 concA 5.188444 5. 898577 5.096378 6. 0941_85 5. 158172 5.7065 5. 272_67 4. 4576 5. 146423 4.947427 1,920421 0

0 3.712592 concB 6. 230602 7. 885944 9.241735 13.10507 .169429 8.759894 7.985409 7. 740126 7.607208 7,588428 2.791115 0

2.48_62

. 9209593

4,883_9

5.925818

pk -- Pharmacokinetic (biopharmaceutical) data

509

. pkexamine time conc :

Maximum concentration Time of maximum concentration Tmax Elimination rate Half life

= = = = =

4.7 3 32 0.0279 24.8503

Area under the curve r

AUC [0, Tmax]

.....

AUC [0, inf.) Linear of log conc.

85.24

AUC [0, inf.) i AUC [0, inf.) Linear fit i ExponenZial fit

142.603

107.759

142.603

Fit based on last 3 points.

Clinical trials, however, require that data be collected on more than one subject. There are several ways to enter raw measured data collected on several subjects. It would be reasonable to enter for each subject the drug concentration value at specific points in time. Such data could be id 1 2 3

concl 0 0 0

conc2 1 2 1

conc3 4 6 2

conc4 7 5 3

conc5 5 4 5

conc6 3 3 4

conc7 1 2 1

where concl is the concentration at the first measured time, conc2 is the concentration at the second measured time, etc. This format requires that each drug concentration measurement be made at the same time on each subject. Another more flexible way to enter the "data is to have an observation with three variables for each time measurement on a subject. Each observation would have a subject id, the time at which the measure.merit was made, and the corresponding drug concentration at that time. The data would be id 1 1 1 1 t 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2

concA 0 3.073403 5.188444 5.898577 5.096378 6.094085 5.158772 5.7065 5.272467 4.4576 5.148423 4.947427 1.920421 0 2.48462 4.883569 7.253442 5.849345 6,761085 4.33839 5.04199 4.25128 6.205004 5.566165 3.689007 3.644063

time 0 .5 1 1.5 2 3 4 6 8 12 16 24 32 0 .5 1 1.5 2 3 4 6 8 12 16 24 32

(ou_utomitted)

Stata expects the data to be organized in the second form. If your data are organized as described

...................

!

I t

[

' "_ !

....................................

pkex mine will co pute and report all the t_harmacokineticmeasures that Stata produces including PkuPhB_fl? of tl_ area Okir_ticunder_iOpharmB_O_icam) dSta fc/urcak ulations the time-vdrsus-concentrationcurve. The standard area under the curve from 0 to the _aximum observed time (AUCo,tm.) is computed using cubic splines or the area curve tO pkexamine compute trapezok:aI rule. Additionally, will also the under the from 0 infinity t,v extending t_e standard time-versusqzoncentrationcurve from the maximum observed time _

l_i

i !

t

usin_ thr_ different rrWethods.The first method%implyextends the standard curve using a least squares linear fit through the iast few data points. The second method extends the standard curve by fitting '

a decreadn_ exponenlial curve through the l_tstfew data points. Lastly, the third method extends the curv,_by fitting a least squares linear regrdssion line Onthe log concentration. The mathematical details o_these extensions are described in th_ Methods and Formulas section of [R] pkexamine. Data [rom an equikalence trial may also bd analyzed using methods appropriate to the particular study detign. When ybu have a crossover design, pkcross can be used to fit an appropriate ANOV_. model. As an aside, _ crossover design is simply a restricted Latin square; therefore, pkeross can also be lsed to analyie any Latin square design. Therelare some pNctical concerns when de_ling with data from equivalence trials. Primarily, the da_anee_ to be organiied in a manner that Stat_ can use. The pk commands include pkeollapse and pkshap_, which are +signed to help transform data from a common format to one that is suitable for analysis with Stat_. In thd following example, we illustrate severWdifferent data formats that are frequently encountered in pharr_aceutical research and describe how ihese formats can be transformed to formats that can

bea. y ed S,.,+ 1

)

[ [

i

,>Example

i !

Assu__ethat we ha,_eone subject and are interested in determining the drug profile for that subject. A reasonable, experiment would be to give, thei subject the drug and then measure the concentration • _ .

}

of the d4g m the subject s blood over a t,me period. For example, here is a dataset from Chow and --

time

1

o

.g

[

[ l

i'on

o 0

1.5 2 3 1 4

4.4 4.4 4,7 2.8 4.1

8 12

3.6 3

24 32 16

1.62 2.5

°

1

concentrat

,

)

Examining these d ta, we notice that the concentration quickly increases, plateaus for a short period, a_d then slowh' decreases over time. pkexamine is used to calculate the pharmacokinetic

i

measuresi°f interest" li_examine is explained !n detail in [R] pkexamine The °utpul is

le I pk-

Pharmacokinetic

(biopharmaceutical)

data

[

I

I

i

Description The term pk refers to pharmacokinetic

data and the commands,

all of which begin with the letters

pk, designed to do some of the analyses commonly performed in the pharmaceutical industry. The system is intended for the analysis of pharmacokinetic data, although some of the commands are of general use. The pk commands pkexamino pkst__mm pkshape pkcross pkequiv pkcollapse

are [R] [R] [R] [R] [R] [R]

pkexamine pksumm pkshape pkeross pkequiv pkeollapse

Calculate pharmacokinetic measures Summarize pharrnacokinetic data Reshape (pharmacokinetic) Latin square data Analyze crossover experiments Perform bioequivalence tests Generate pharmacokinetm measurement dataset

Remarks Several types of clinical trials are commonly performed in the pharmaceutical industry. Examples include combination trials, multicenter trials, equivalence trials, and active control trials. For each type of trial, there is an optimal study design for estimating the effects of interest. Currently, the pk system can be used to analyze equivalence trials. These trials are usually conducted using a crossover design; however, it is possible to use a parallel design and still draw conclusions about equivalence. The goal of an equivalence trial is the assessment of bioequivalence between two drugs. While it is impossible to prove two drugs behave exactly the same, the United States Food and Drug Administration believes that if the absorption properties of two drugs are similar, then the two drugs will produce similar effects and have similar safety profiles. Generally, the goal of an equivalence trial is to assess the equivalence of a generic drug with an existing drug. This is commonly accomplished by comparing a confidence interval about the difference between a pharrnacokinetic measurement of two drugs with a confidence limit constructed from U.S. federal regulations. If the confidence interval is entirely within the confidence limit, the drugs are declared bioequivalent. An alternative approach to the assessment of bioequivalence is to use the method of interval hypotheses testing, pkequiv is used to conduct these tests of bioequivalence. There are several pharmacokinetic measures that can be used to ascertain how available a drug is for cellular absorption. The most corn mort measure is the area under the time-versus-concentration curve (AUG). Another common measure of drug availability is the maximum concentration (Cmax) achieved by the drug during the follow-up period. Stata reports these and other less common measures of drug availability, including the time at which the maximum drug concentration was observed and the duration of the period during which the subject was being measured. Stata also reports the elimination rate, that is, the rate at which the drug is metabolized, and the drug's half-life, that is. the time it takes for thc drug concentration to fall to one-half of its maximum concentration. 507

1 l

_: ...........

.

i

506

.............

.........

pergram-- IPeriodogram

Also See C0mple[ _enta_:

IR] tsset

Related:

IR] corrgram, JR] cumsp, JR]wntestb

Baekgro_rod:

_tata Graphics Manual

pergram-- Periodogram

505

Methodsand Formulas <_.'_

"-_=

pergramis implemented as an ado-file.

_-._

We use the notation of Newton (1988) in the following.

= _:

A time series of interest is decomposed into a unique set of sinusoids of various frequencies and amplitudes.

_

A plot of the sinusoidal amplitudes (ordinates) versus the frequencies for the sinusoidal decomposition of a time series gives us the spectral density of the time series. If we calculate the sinusoidal amplitudes for a discrete set of "natural" frequencies (I/n, 2/n .... , q/n), then we obtain the periodogram. Let x(1),..., k = I,...,(n/2)

z(n) be a time series and let wk -- (k - 1)/n denote the natural frequencies for + 1. Define

=

t=l

A plot of n C k2 versus ,_k is then called the periodogram. The sample spectral density is defined for a continuous frequency w as

1

n x(t) e2_i(t-l)_

r

f(1 - co)

ifwe[0,.5]

if coC [.5,1]

Note that the periodogram (and sample spectral density) is symmetric about w -- .5. Further standardize the periodogram such that

n

k=2

82

= 1

(where 82 is the sample variance of the time series) so that the average value of the ordinate is one. Once the amplitudes are standardized, we may then take the natural log of the values and produce the log-periodogram. In doing so, we truncate the graph at 4-6. Note that one frequently drops the prefix "log-" and simply refers to the "log-periodogram" as the "periodogram" in text.

References Box. G. E. P. and G. M. Jenkins.

1976. Time Series Analysis:

Box, G E. R, Jenkins. G. M. and G, (2. Reinsel. Englewood Cliffs, N J: Prentice-Hall. Chatfield. Hamilton, Newton.

C. 1996. The Analysis J. 1994. Time Series H. J. 1988,

TL¥1ESLAB:

of Time Series: Analysis,

Forecasting

1994, Time

An introduction.

Princeton:

Princeton

A Time gerie._ Analysi_

Series

and Control. Analysis:

5th ed, London:

University

Laboratory,

Oakland,

Forecasting Chapman

CA: Holden-Day.

and

Control.

3d ed.

& Hall.

Press.

Pacific Grove,

CA: Wadsworth

& Brooks/Cole.

r'

graphsumac ime, xlab ylab s(o) c(1)

=

20"

l

i

i

t

-11]

o V

]i

t

•

i

-2o 5

.

rgra_

I

100

1 0

sturfc, gen(ordinate) Sample spectrsl denstt_ functmn evaluated af the nalura_ frequencies l t 'i| |

• •

rl

I

6.00

" 6.00

4.00

- 4.00

0.00

" 0.00

I

_. •

-2.00

i

-4.00

- -4.00

-6.00 1

_ :

0.00

The periodigram

_'---: I 0,10

..... _------: ............ --:-::--_: : ..... l!! _ ' | 4 ' ' f 0.2_ 0,30 _requency

clearly shows the four contributions

can see tha l the periods

?f the summands

0.40

to the original

-6,00

O.SO

time series. From the plot, we

were 3, 6, 12, and 36, although

you can confirmthis

by

, ge doubleomen.= (.n-1)/144 using • genldouble ni peric d = i/omega

i

(1 missingvalue __nerated) . lie_ period [ 5. 13, 25.

!_ I_ I 1

49.

ome_

period 36 12 6

3

if ordinate>O

k omega<=.5

omega ,02777778 .08333333 I'16666667

i 33333333 <1

l

1

pergram -- Periodogram graph

lynx

time,

.... :

--_

xlab

8000

-

6000

"

4000

-

ylab

s(o)

503

c(1)

z

2000 0 4

_.

j]

V I

Time

• pergram

lynx Sample spectral density function evaluated at tMe natural frequencies i r ,,l ....

I

= - 6,00

6,00 "

4,00 -

_E

2.00 -

>.o

0.00

mc_

.¢:t o E .J

j

4.00

2.00

-

0,00

?

-2.00

-2.00

z

=

¢,

T"

-4.00

-6.00

-4,00

T-0,00

t O. 10

l 0.20

_-- -6.00 0.30

0.40

0,50

Frequency

The periodogram indicates that there may be a periodicity at 15 years for these data, but is otherwise random in nature. In [R] eorrgrarn, we see evidence of the ARMA (autoregressive moving average) nature of this time series,

q

_' Example In order to more clearly highlight what the periodogram depicts, we present the result of analyzing a time series of the sum of four sinusoids (of different periods). The periodogram should be able to decompose the time series into four different sinusoids whose periods may be determined from the plot.

' r

graph spot

_

ime,

xlab

ylab

s(o)

cCt)

200 "

-ca° 15o-

t

¢

too E Z

1

I

1700

i • Ipergra_

1800

°

i

1900

.i

Year

2000

,

,

spo_ ev_fualed at 1he nat_ra frequencies Sample spectral de_sily i _ i functlo_ 1,

I

t

......

6,00

"_ E

" 6.00

4,00

" 4,00

2.00

" 2.00

-2.00

-2.00

-400

-4.00

I=o

-6.00

I 0.00

i 0.,0

0 20

0.30

0.40

-6 O0

0 50

Frequency

The eriodogram _dicatesa peak frequency between 10 and 12 years. ,1

i

- Example Here _veexamine tl:e number of trapped Cat_adianlynx (Newton t988. 587). The raw series and

the' log-plriodogram ar given as

pergram -- Periodogram

501

graph air time, xlab ylab s(o) c(1)

600

. o i

ii

i

1

400

g ,al ea 200

"

0 t t950 Time

1 1955 (in months)

t9

0

pergram air Sampie spectral density evaluated at the natural t

tunction lreq uencies I I

I

,

6.09

600

4.00

I

- 4 O0

E

2.00

2.00

¢' o

O.OO

- C 00

cr_

tOl_

°°

I!

.,:::

0ooo

-6,00

0.00

" -6.00 O, 10

0.20 0.30 Frequency

0,40

0.50

The periodogram clearly indicates the annual cycle together with the harmonics. The similarity in shape of each group of twelve observations reveals the annual cycle. The magnitude of the cycle is increasing, resulting in the peaks in the periodogram at the harmonics of the principal annual cycle.

<1

_' Example In this example, the data consist of 215 observations on the annual number of sunspots from 1749 to 1963 (Box and Jenkins 1976, Series E). The graph of the raw series and the log-periodogram for these data are given as

i

I

lm--

Title Syntax

i

pergra_ is for use with e-series data: see [R] tsset You must tsset your data before using pergram. In addition. the tilae series must _ dense (nonmissing and nd gaps in the time variable) in the specified sample,

!

varname may contain ti_e-series operators: see [U] N.4.3 Time-series varlists.

{

DesCdplion

, i

per ram plots the. tog standardized periodogram for a (dense) time series.

Options ! !

genera':e (newvarna_ne) specifies a new variable to contain the raw periodogram values. Note that the g_merated grapl_log-transforms and scales the values by the sample variance and then truncates

i

them to the [-6, I] interval prior to grapliing them. graph_c,r_tionsare an!, of the options allowed with graph, t_ovay; see [G] graph options.

Ji Remark., A go )d discussior] of

.

the periodogram is provided in Chatfield (1996), Hamilton (1994), and

classic .terence is Btx. Jenkins. and Reinsel (1994). Teehnid Note Newton !1988). Chatt_eldis also a very good introductoryreference for time-series analysis. Another

i I

perg _m produces a scatterplot where the points of the scatterplot are connected. The points themseh es represent the log-standardized periodogram, and the connections between points represent the (conlinuous) log-slandardized sample spectral density. Although the periodogram is asymptotically unbiased for the specttat density, it is not cons!stent, and many analysts witl obtain the raw ordinates from thi command _ith the gen() option and smooth them prior to plotting.

¢ I !

1 1

main features of the lots.

Exampl_ In t following e_amples, we present the periodograms together with an interpretation of the We h{avetime-serie_data consis+tingof 144 observations on the monthly number (in thousands) of intemati_malairline p_,ssengersbetween 1949,and 1960 (Box. Jenkins. and Reinsel 1994. Series G_. We can Faph the ra_ series and the ]og-periodogram for these data by typing

i i !. !

I

+

pctile m Create variable containing percentiles

499

numbered, respectively, 1,2,..., m, based on the m quantiles given by the p_:-th percentiles, Pk = 100k/m for k = 1,2,...,mI. Note that if x[pk_i] - x[pd, then the kth category is empty. All elements are put in the (k - 1)th category:: (xb_k_2],x[pk_ll ]. If xt:i.le

is used with the eutpoints

(-cc,

(varname)

(YO),Y(2)],

and they are numbered, respectively, Y(1), Y(2), .... Y(m).

1, 2,...,

"--,

option, then the categories (Y(m-1),Y(m)],

x -- x__l

where

] = x[pk]

are

(Y(m), +_)

m 4- 1, based on the m nonmissing

values of varname:

Acknowledgment xtile is based on a command originally posted on Statalist (see [u] 2.3 The Stata listserver) Philip Ryan of the University of Adelaide, Australia.

AlsoSee

r

Related:

[R] centile, [R] summarize

Background:

[U] 21.8 Accessing results calculated by other programs

bv

"i

[3 Tech ical Note s,!mmarize, d,:tail

,,

will compute the 1st, 5th, lOth, 25th, 50th (median), 75th, 90th, 95th, and

99th 3ercentiles.q_hereis no real advantage in using _pcZile to compute these percentiles. Both sumlI _rize, detail and __pet ile use the same internal code. _petite is slightly faster because summarize, detail computes a few additional things. The value of _pctile is its ability to completepercentilei other than these standard ones.

_,

Saved esults pc

/

Ale and _p_tile

save in r() Scalars r(r#)

value_f #-requested percentile

Metho¢ ; andFo mulas pet

le and xti:

e are

implemented as ago-files.

We irst give the _efault formula for percentiles. Let (j)refer to t_e x in ascending order [or j = 1,2 .... ,n_nLet w(j) refer to the corresponding weightstof x(j): if t_re are no weights, wo. i = 1. Let N j=l 'w(j). To o_tain the pth 9ercentile,which we will denote as zip], let P = Np/lO0 and let i

W(O = E wO) j=l

Fihd th_ first index i ,uch that DV(_)> P. The pth percentile is then

x[pl = !

x(_-l) +ix(i) 2 x (_)

if 1,9}i_1)= P otherwise

Whenlthe option a Ltde_ is specified, the _:followingalternative definition is used. In this case, weights _e not allowe:l. Lel i e integer flo,_rof (n _ l)p/lO0: i.e., i is largest integer i _ _ a. t)p/lO0. Let h be the remain& h = (n + llp/lO0 - i. The pth percentile is then |

where x j

x[p] = (1 - h)xii ) + hz(i+1)

is taken to _e x(i) and _(n+l) is taken to be x(n). /

xtile)roduces thelcategories

-i

pctile -- Create variable containing percentiles

497

_pctile _pctile is a programmer's command. It computes percentiles [U] 21.8 Accessing results calculated by other programs, You can use _pctile . _pctile . ret

and stores them in r();

see

to compute quantiles just as you can with pctile:

weight,

nq(lO)

list

scalars :

_pctile results.

r(rl)

=

2020

r(r2)

=

2160

r(r3)

=

2520

r (r4)

=

2730

r(r5)

=

3190

r(r6)

=

3310

r(rT)

=

3420

r (r8)

=

3700

r(r9)

=

4060

is, however,

The percentiles wish: _pctile ret

weight,

limited to computing () option (abbreviation p(10,

33.333,

45,

21 quantiles since there are only 20 r()s p()) 50,

55,

to hold the

can be used to compute any odd percentile 66.667,

you

90)

list

scalars : r(rl)

=

2020

r(r2)

=

2640

r(r3)

=

2830

r(r4)

=

3190

r(r5)

=

3250

r(r6)

=

3400

r(r7)

=

4060

_pctile, pctile, and xtile each have an option that uses an alternative definition of percentiles, based on an interpolation scheme; see Methods and Fom_ulas below. _pctile • ret

weight,

p(10,

33.333,

45,

50,

55,

66.667,

903

altdef

list

scalars : r(rl)

=

2005

r(r2)

=

2639. 985

r(r3)

=

2830

r(r4) r(rS)

= =

3190 3252.5

r(r6)

=

3400. 005

r(r7)

=

4060

The default formula inverts the empirical distribution function. We believe that the default formula is more commonly used. although some consider the "alternative" formula to be the standard definition. One drawback of the alternative formula is that it does not have an obvious generalization to noninteger weights.

_ i

Ii• i

rp !

496

jl pctile - create vadablecontaini_ percentiles i 120

1

3

18. 19.

17.

120 125

12o

1 1

1

3 4

:[I.

132

1

4

to,

13o

i2,

1

93

l

94 131 94 (o_qmtornitted)

lo

1_o. i

o

3

4 i

1 1

o

4

136

00

0 TechnicalNote

I_

In th!. iztite' last examplb.catego_y if=webp i:E°nlycase==l ,wanted cut!(pct)t° categorize cases, we could have issued the command

i * _ ! i

Mos_ Stata commahds follow the logic that Using an if exp is equivalent to dropping observations not satisfyi_2 the expressi on and running the command. This is not true of xtile when the cutpoints () option i_Jsed. (_qaer_ the eutpoints () option' is not used, the standard logic is true.) This is because xtile _ill use all no,missing values of the cutpoint () variable regardless of whether these values belon_ io observation that satisfy the if expression,

!

If yoh do not wan: to use all the values i. the cutpoint () variable as cutpoints, simply set the ones that you do not _eed to missing, xtile does not care about the order of the values or whether

I

they are separated by! missing values.

!

I

i i

_ TechnicalNote

,

Note!that quantile_are not always unique. If we categorize our blood pressure data on the basis of quinttles rather tha_ quartiles, we get t _ctile pet = 4 bp, _tile quint bp, nq(5) nq(5) genp(percent) _ist percent

I

bp

quint

pct

!

98

1

104

20

100

1

120

40

lo4

1

_25

80

1

_i

_.

!

_. 5. _.

! I

9. _. 1_. 1t._

i

i

_.

I_o 120 12o t2o

12o 13o la2 125

2 2 2

12o

60

2

2 s s 4

The 40t_ and 60th percentile are the same; t_ey are both 120. When two (or more) percentiles are the samd, they are gixen the lower category nhmber.

i i i i

pctile -- Create variable containing percentiles

495

• xtile category = bp, cut(class) list bp class category 1, 2. 3. 4. 5. 6. 7, 8, 9. 10. 11.

bp 98 100 104 110 120 120 120 120 125 130 132

class 100 110 120 130

category 1 1 2 2 3 3 3 3 4 4 5

The cutpoints can, of course, come from anywhere. They can be the quantiles of another variable or the quantiles of a subgroup of the variable. For example, suppose we had a variable case that indicated whether an observation represented a case (case = 1) or control (case -- 0). . list bp 98 IO0 104 ii0 120 120 120 120 125 130 132 116 93 115

case 1 1 1 1 1 1 1 1 1 1 1 0 0 0

(outputomi_ed) 110. 113

0

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

We can categorize the cases based on the quanfiles of the controls. To do this. we first generate a variable pet containing the percentiles of the consols' blood pressure data pctile pet = bp if case==O, nq(4) list pet in i/4 I. 2. 3. 4.

pet 104 117 124

and then use these percentiles

as outpoints to classify bp for all subjects.

xtile category = bp, cut(pet) gsort -case bp • list bp case category 1. 2. 3. 4. 5.

bp 98 lOO 104 110 120

case 1 1 1 1 1

category 1 1 1 2 3

494

pctile -- Cr,_atevariable containingpercentiles

xtil_ can be used to create a variable quart

i

• tile

quart

•

= bp,

98

i

nq(4)

I

I

i

1oo

i

_. ,

Ii0 t20 104

2 2 1

_.

12o

2

b_ i q"_I

10. :

I

i

11.

i

that indicates the quartiles of bp.

130 125

! !

132 1

4 3 4

The categories created i_are

I

(+_,x[2_l] ' (xi2_,xis_] ' (Xi_oi, X[7_l],(x[75_,+oo) where z_5, Ziso] an ZiTsi are, respectwely, the 25th, 50th (me&an), and 75th percentiles of bp We coul use the pc le command to genera[e these percentiles:

!

I i

-_

1

ictile pet = _p, nq(4) genp(percent) _ist bp quart ipercent pet i bp quart percent

pet

_.

98

I

25

104

_. _.

104 100

1 I

75 50

125 120

4. d._

llo 12o

2 2

_

_.

120

2

i

_.

12o

2

I_.

t20

2

1I!.

130 132

4 4

!

i

xtil(_ can categori_e a variable based on _y set of cutpoints, not just percentiles. Suppose that we wish iocreate the _ollowing categories for blood pressure:

i

(-_.,!_oo],(too, ! t_ot (U0.120] (i2o,_3o].(_3o,+o0)

To do thi_, we simph', ,create a variable contairiing the cu'lpoints

i:

i

class ihput class I!. I00

23i.i. t20 :io 5i. end

and then iuse xtile with the cutpoints()o_tion. |

{

: ]

i

i

pctite -- Create variable containing percentiles

493

Note that summarize, detail calculates standard percentiles. • summarize mpg, detail Mileage (mpg) Percentiles

Smallest

1_ 5_ 10Z 25_

12 14 14 18

12 12 14 14

50_

20

75X 90X 95X 99_

25 29 34 41

Larges¢ 34 35 35 41

0bs Sum of Wgt.

74 74

Mean Std. Dev.

21.2973 5.785503

Variance Skewness Kurtosis

33.47205 .9487176 3.975005

can onlycalculate thesepa_icular percentiles. The commands let you compute any percentile. But s_Immarize,

detail

Weights can be used with petile,

_pctile,

pctile

and

_pctile

and xtile:

. pctile pet = mpg [w=weight], nq(10) genp(percent) (analytic weights assumed) . list percent pet in I/I0 I. 2. 3. 4. 5. 6. 7. 8. 9. 10.

percent i0 20 30 40 50 60 70 80 90

pet 14 16 17 18 19 20 22 24 28

/

The result is the same no matter which weight type you specify--aweighz,

fweight,or pweight.

xtile xtile will create a categorical variable that contains categories corresponding to quantiles. We illustrate this with a simple example. Suppose we have a variable bp containing blood pressure measurements: list

i. 2. 3. 4. 5. 6. 7, 8. 9. 10. 11.

bp 98 I00 104 II0 120 120 120 120 125 130 t32

+

!_'+

I I ,i !

492

pctile -- Ci .=atevariable containin+ percentiles

cutpoi+tts(vamame, requests that xtile ise the values of varname, rather than quantiles, as cut ints for the c legories. Note that all v_lues of vamame are used, regardless of any if or in restri,:tion; see the technical note in the xt_le section below. percentiles(m+mtist) requests percentiles Corresponding to the specified percentages. Percentiles

+ i

are p!aced in r(r]t), r(r2) ..... etc. For example, percentiles(10(20)90) 10th.130th, 50th. 7Dth, and 90th percentilei be computed and placed into r(rl),

}

r(r4_,

i

detail_ on ho,a to _pecify a numIist.

and r!r5)I

Up to 20 (inclusive)p_rcentiles

requests that the r(r2), r(r3),

can be requested. See [u] 14.1.8 numlist for

Remark.. pctile pctil,ecreates a _ew. variable containing percentiles. You specify the number of quantiles that you wan(. and pctil_ computes the corresponding percentiles. Here we use Stata's auto dataset and

!

compute the deciles of mpg: t se auto

+

+

• _ctile pct= _ist pet

i

i

•

pet

14

_, _, _.

20 22 24

_. _.

25 29

!

'_

i

earner to oistinguish be tween the percentiles.

! !

I

'V_

apg, nq(lO)

in I tl0

_illthe

.en.

_ _

. p_tile pet

option

= _pg, t

1_st

percent

!

percent

_ct

.enerate

nq(10) in 1/10 pet

2_ 11. 31

20 10 30

17 14 18

4! si

40 so

19 20

+o .o 80

+

;+

I

:oi°°

.

to

/

22 2+ 25

anot]

genp(percent)

,e_

v_d_ab'e

_vJth

the

co_[espondi_.

_erce_]ta.e..

,, ,.

tie [ petile -- Create variable contlfining percentiles

]

i

Syntax pctile

genp(newvarp) T

xtile newvar

'

altdef = exp

{ nquantiles(#) _pctile

= exp [weight]

[type] newvar

varname

[if

exp]

[in

range]

[, _nquantiles(#)

]

[weight 1 [if

exp]

[in range]

I c_utpoints(varname) [weight]

[if

exp]

[,

} a!tdef

[in range]

t

[,

{ nquantiles(#) I p_ercentiles(numlist) } altdef ] fweights, and pweights are allowed (see [U] 14.1.6 weight) except when the altdef in which case no weights are allowed.

aweights,

option is specified,

Description pctile creates a new variable containing typically just another variable. xtile

the percentiles

creates a new variable that categorizes

of exp. where the expression

exp by its quantiles. If the cutpoints

option is specified, it categorizes exp using the values of vamame as category cutpoints. varname might contain percentiles, generated by pctile, of another variable.

exp is

(varname) For example,

_pct ile is a programmer's command. It computes up to 20 percentiles and places the results tn r(); see [U] 21.8 Accessing results calculated by other programs. Note that summarize, detail will compute some percentiles (1, 5, 10, 25, 50, 75, 90, 95, and 99th); see [R] summarize.

Options nquantiles (#) specifies the number of quantiles. The command computes percentiles corresponding to percentages 100k/m for k = t,2,..., m- 1, where m = #. For example, nquantiles(lO) requests that the 10th. 20th, ..., 90th percentiles be computed. The default is nquantiles(2): i.e., the median is computed. genp(newvarp) specifies a new variable to be generated containing to the percentiles. altdef uses an alternative

formula for calculating percentiles.

the percentages

corresponding

The default method is to invert the

empirical distribution function using averages ((zi + z_+x)/2) where the function is flat (the default is the same method used by summarize; see [R] summarize). The alternative formula uses an interpolation method. See Methods and Formulas at the end of this entry. Weights cannot be used when altdef is specified. 491

1

I"

• i !

_

4_J i I pc°rr _3Technic Note -- Pztrial c°rrelati°n c°efficients Some caution is in order when interpreting the above results. As we said at the omset, the partial corretati )n coefficient is an attempt to estimate the correlation that would be observed if the other variable., were held cc_nstant. The emphasis is on attempt, pcorr makes it too easy to ignore the fact that you are fitting a aodel. In the above example, the model is _price= fl0+ fllmpg+ _2wei_t + _3foreign+ e which is_ in all honestk a rather silly model. Even if we accept the implied economic assumptions of the moddl--that consumers value mpg, weight, and foreign--do we really believe that consumers

i i _. i i !

place equal value on _very extra l,O00 pounds of weight? That is, have we correctly parameterized the mod_l? If we hav_ not, then the estimated :partial correlation coefficients may not represent what the3' clai_ to represen I. Partial correlation coet_icients area reasonable way to summarize data after one is cobvinced that the underlying model is reasonable. One should not, however, pretend that there is no underlying model and that the partial correlation coefficients are unaffected by the assumptions and parai aeterization.

Methodsand Fornlulas pcor_ is implemen :ed as an ado-file. Result t are obtaine, by estimating a linear regression of varnamel on varlist; see [R] regress. The partial correlation coefficient between varnamel and each variable in varlist is then defined as

!

(Theil 19)1, 174). wh_re t is the t statistic, n the number of observations, and k the number of mdependdnt variables i_cludmg the constant but excluding any dropped variables, The significance .

is _iven _y 2, trail

_n - k, abs (t))

References Thei.I.H. 1_7!. Principles_)[Econometrics.New York John Witey& Sons.

i

AlsoSee

i

Related: !

} I

JR] eorrel te, [R] spearman

pcorr -- Partial correlation l coefficients I I

i I

Syntax ;

pcorr varnamel

vartist

[weight]

[if exp] [inrange]

by ... : may be used with pcorr; see [R] by. aweights and fweights are allowed: see [U] 14.1.6 weight.

'

Description pcorr displays the partial correlation coefficient of varnamel the other variables in varlist constant.

with each variable in varlist, holding

z

Remarks Assume that y is determined by xl, x2, ..., xk. The partial correlation between 5' and xl is an attempt to estimate the correlation that would be observed by y and xt if the other x s did not vary.

> Example Using our automobile dataset (described in [U] 9 Stata's on-line'tutorials and sample datasek_), the simple correlations between price, mpg, weight, and foreign can be obtained from correlate (see [R] correlate): • corr price (obs=74)

mpg

weight

foreign

price

mpg

weight

price

i. 0000

mpg

-0. 4686

weight

0.5386

-0.8072

1.0000

foreign

0.0487

0.3934

-0.5928

foreign

I.0000 1.0000

Although correlate gave us the full correlation matrix, our interest is in just the first column. We find, for instance, that the higher the mpg, the lower the price. We obtain the partial correlation coefficients using pcorr: pcorr price (obs=74) Partial

mpg

weight

correlation

Variable mpg

foreign

of price

with

Corr.

Sig.

O. 0352

O. 769

weight

O. 5488

O. 000

foreign

O, 5402

O. 000

We now find that. holding weight and foreign constant, the partial correlation of price with mpg is virtuallv zero. Similarly, in the simple correlations, we found that price and foreign were virtually uncorrelated. In the partial correlations holding mpg and weight constant we find that price and foreign are positively correlated. q

489

!

...............

o!_,_u_,,uet-_le

aataset

)

!

t

I[ i

i

.

1

our:sheet copi_ the data currently loaded in memory into the specified file. About all that can go wrbng is the fil_ you specify already extsts: outsheet

u_ing

[ile tosend._ut

r(602) ;

tosenfl already

exists

)

In thai case, you ca_l erase the file (see [R_ erase), specify outsheet's differe at filenarne, _aen all goes well, out,beet is silent:

replace

option, or use a

i I outsheet

us .ng tosend,

replace

-

tf you are copying tl e data to a program othtr than a spreadsheet, remember to sl_ify option: ) •i outsheet

us_ ng feral,

nenames

"!-

q

i

[

Also See Compl_ mentary:

[R] insheet

Related

[R] outffle

Backgr
[U] 24 Commands to i_put data

[ t

i

the nonces

"

)

)

)

:le I outsheet-

II

Write spreadsheet-style

dataset

l

i

Syntax outsheet [varlist] using filename[if exp] [in range] [, nonames nolabel noquote comma replace ] ?

If filename is specified without a suffix, .out is assumed.

Description outsheet writes data in tab- or comma-separated most spreadsheet programs prefer.

ASCII format into a file. This is the format that

Options nonames specifies that variable names are not to be written in the first line of the file; the file is to contain data values only. nolabel specifies that the numeric values of labeled variables are to be written into the file rather than the label associated with each value. noquote

specifies that string variables are not to be enclosed in double quotes.

comma specifies comma-separated replace

format rather than the default tab-separated

specifies that it is okay to overwrite filename

format.

if it already exists.

Remarks If you wish to move ),our data into another program, you have the following 1. The use of an external data-transfer

program; see [r.j] 24.4 Transfer

alternatives:

programs.

2. Cutting and pasting from Stata's data editor; see Getting Started, chapter 6. 3. Using outsheet. 4. Using outfite;

see [R] outfile.

Concerning alternatives 3 and 4. outsheet is typically preferred for moving the data to a spreadsheet and outfile is probably better for moving data to another statistical program. If your goal is to send data to another Stata user, you could use outsheet or outfile, but it is easiest to send the .dta dataset. This will work even if you use Stata for Windows and your cohort uses Stata for Macintosh. All Statas can read each other's .dta files.

487

2 t

,,

S ,meprogram,, _referdata that are separatedby commas rather than by blanks. Stata wilt produce such a dataset if you specify the commao_tion:

0

. outfite I

u_ ing employee,

1

comma

. _ype empl,tee.raw Carl Marks , 57213,24000, male "Irene Adlez" ,47229,27000, "female" I"Adam Smith", 57323,24000, "reale" "David Walli

,57401,24590,"male"

i

!"MaryRogers',57802,27000, "female"

_.

i"Carolyn Fr_ k", 57805,24000, "remade" "Robert Laws in",57824,22500,"male"

Example t

Fin lly, outfil_can create data dictionaries that infilecan read. Dictionaries ate perhaps the best w_y to organb : your raw data. A dicl!onary describes your data so that yoa do not have to remem_er the order _f the variables, the number of v_ables, the variable names, or anything eisel The fill in which y( _store your data becorrles self-doCumentingso that when you come back to it at somi future date, 'ou can understand it. See JR]infile (fixed format) for a full description of data When you speci . :ioutfile

usi

the dictionary, employee,

dict

Stata writes a dot file: [

. itype employeb.dct

i

i s_rl5 float

i

iname _empno

"Employee "Employee

float salary I

d_ictionary { float

isex

"carl

name" number"

"Arm_aisalary" :sexlbl

"Sex _

572 3

24000 " ,ale"

"Irene Adler" "Adam Smit t"

47229 5732.3

270(_0 240_0

"female" "male"

1

"David Walli ;" I "Mary Rogerl;"

5740_1 57802

2450D 270011

"male" "female"

i

" obert Lawsoii" '_arolyn Fran] :"

57824 57805

225ob 24000

"male" "female"

i

!

q

<

AlsoSee i

I

Complementary:

[ ] infile

Related: !

[ ] outshee_

Backgrou/_d:

[_]24 Commands to input data

I i

1

!

_

outfile -- Write ASCII-format dataset

485

[3 Technical Note The nolabel option prevents Stata from substituting value label strings for the underlying numeric value; see [U] 15.6.3 Value labels. As we just said, the last variable in our data is really a numeric variable: • outfile

using

employ2,

nolabel

• type employ2.raw "Carl Marks"

57213

24000

"Irene

Adler"

47229

27000

0 1

"Adam

Smith"

57323

24000

0

"David

Wallis"

57401

24500

0

"Mary

Rogers"

57802

27000

1

"Carolyn Frank" "Robert Lawson"

57805 57824

24000 22500

1 0

0

[3 Technical Note If you do not want Stata to place double quotes around the contents of string variables, the noquote option: . outfile • type

using

employ3,

employ3.raw Carl Marks

Irene Adam

specify

noquote

57213

24000

male

Adler

47229

27000

female male

Smith

57323

24000

David

Wallis

57401

24500

male

Mary

Rogers

57802

27000

female

Carolyn Frank Robert Lawson

57805 57824

24000 22500

female male

0

I> Example Stata never writes over an ex:{sting file unless explicitly told to do so. For instance, if the file employee, raw already exists and. you attempt to overwrite it by typing outfile using employee, here is what would happen: • outfile

using

file employee.raw r(602) ;

employee already

exists

You can tell Stata that it is okay to overwrite a file by specifying using employee, replace.

(Continued

on next page)

the replace option:

outfile

> Exampl_ Youlhaveentered nto Statasome data on s ven employeesin your firm. The data contain employee r_ame.!mployee identification number, salar_i and sex: •!list !

i

,

i

name

empno

salary

sex

Ii. Carl Mark_

i

57213

24,,000

male

i2. Irene Adl_r i3. Adam Smit_

47229 57323

127,000 24,000

female male

!4. David

57401

24,500

male

i5. Mary Rogers

57802

27,000

female

!76:Carolyn F_ank , Robert La#son

57805 57824

24,000 22,500

female male

Wal_is

i

If yo_ now wish tc use a program other thin Stata with these data, you must somehow get the data over to l_at other prol "am. The standard Stata_format dataset created by save will not dothe job--it is writte_ in a special _ormat that only Stata uhderstands, Most programs, however, understand ASCII datasetsg-standard te datasets that are like ihose produced by a text editor. You can tell Stata to produceisuch a datas_ using outfile. Typi8g outfile using employee createsa dataset called employee,raw that c,)ntains all the data. We Can use the Stata typecommand to review the resulting

i

file:

_utfile

using

employee

i "Carl Marks" _ype employee.raw

57213

24000

i

i "Irene : "Adam

47229 57323

27000 24000

"female" "male"

I

!"David Walli i" "Mary Roger ;"

57401 57802

245a0 270d0

"male" "female"

[tCarolyn

578os

24o o "femaW'

IRobert Lavso_"

57824

22500

i

Adler" Smith"

"male'*

"male"

We se _ , that the fileicontainsi the four variables and that Stata has surrounded the string variables with double quotes. I

1 I

i 3 TechnicalNote

[

!

outfi_e is careful _o columnize the data in :case you want to read it using formatted input. [n the example a_bove,the firs_tstring has a }',-16s display format. Stata wrote two leading blanks and then placed th+ string in a |6-character field, outfile always right-justifies string variables even when

I

the displa__format requests left-justification.

!

The fi!st number h_s a Y,9.0g format. Th_ number is written as two blanks followed by the number. _ght-justified in a 9-character field. The second number has a Y,9.0gc format, outfile ignores tt'_ comma part of the format and also writes this number as two blanks followed bv_the number, right-justified in a 9-character field. :

!

,

The ]aatt entry is really a numeric _ariable,:: but it ha:s an associated value label. Its tbrmat is

} "

Y,-9.0g. 4o Stata wrot_ two blanks and the2 tight-justiSed the value label in a 9-character field Again. ou{fileright-jt_stifies value labels e_en:;when the display formal specifies left-justification.

i

•

I outfile -- Write ASCII-format

dataset

[

I

I

Syntax

outfile[var//s,] using te,,ameexp][inra,,e][,

dictio=y

no!abel noquote replace wide ]

Description outfile writes data to a disk :file in ASCII format, a format that can be read by other programs. The new file is not in Stata format; see [R] save for instructions on saving data for subsequent use in Stata. The data saved by outfile can be read back by infile; see [R] infile. Iffilename is specified without an extension. '.raw' is assumed unless the dictionary option is specified, in which case '.dct' is assumed.

Options comma causes Stata to write the file in comma-separated-value format. In this format, values are separated by commas rather than blanks. Missing values are written as two consecutive commas. dictionary writes the file in Stata's data dictionary format. See [R] infile (fixed format) description of dictionaries. Neither comma or wide may be specified with dictionary. nolabel causes Stata to write the; numeric values of labeled variables. labels enclosed in double quotes. noquote

for a

The default is to write the

prevents Stata from placing double quotes around the contents of string variables.

replace permits outfile to overwrite an existing dataset, replace mav not be abbreviated. wide causes Stata to write the data. with one observation into lines of 80 characters or fewer.

per line. The default is to split observations

Remarks outfile enables data to be sent to a disk file for processin_ by a non-Stata program. Each observation is written as one or more records records that will not exceed 80 characters unless you specify the wide option. The values of the variables are written using their current display formats, and unless the comma option is specified, each is prefixed with two blanks. If you specify the dictionary option, the data are written in the same way, but in front of the data outfile writes a data dictionary describing the contents of the file.

483

i

orth rely uses th( Christoffel-Darboux Both _rtlaog and ,rthpoly

recurrence formula (Abramowitz and Stegun 1968).

normalize thd orthogonal variables such that

Q_Q=MX !

where It _ -- diag(w_,w2,...,wN)

with wl:,w2,...,WN

the weights (all 1 if weights are not

I

specifiedi), and M is t_e sum of the weights (the number of observations if weights are not specified).

i i

Referenqes

I

Abramowiiz.M. and I. 4' Stegun, ed. 1968.Handbook of Mathemat/ca/Functions,7th printing.Washington.DC: Nation_dBureauof Standards.

!

Golub,G. !H.and C. F. Va_Loan. 1996,Matr/x CompUtations,3d ed. Baltimore:JohnsHopkinsUniversityPress,pp.

218-2_9. I

Sribney, _,_(Reprints,voI. M, 1995.sg37:5,Orthogonalpolynomials.S}aa TechnicalBulletin25: 17-18. Reprintedin Stata Technical Bultetii pp. 96-98.

i I

!_o }

i AlsO,See

Related: I

R] regress

Backgrot_nd:

_] 23 Estimation and I_gst-estimation, commands

:

Some of the correlations problems removed,

orthog -- O_hogonal variables and o_hogonal polynomials 481 among the powers of weight are very large, but this does not create any

for regress. Howevel; we may wish to look at the quadratic trend with the constant the cubic _end with the quadratic and constant removed, etc. orthpoly will generate

polynomial terms with this property: . orthpoly weight, generate(pw*)

dog(4) poly(P)

. regress mpg pwl-pw4 Source Model Residual Total

SS

df

MS

1652.73666

4

413.184164

790.722803

69

11.4597508

2443.45946

73

33.4720474

mpg

Coef.

pwl pw2 pw3 pw4 _cons

-4.638252 ,8263545 -.3068616 -.209457 21.2973

Std. Err. .3935245 .3935245 .3935245 .3935245 .3935245

t -11.79 2.10 -0.78 -0.53 54.12

Number of obs = F( 4, 69) = Prob > F =

74 36.06 0.0000

R-squared Adj R-squared Root MSE

0.6764 0.6576 3.3852

P>ItJ

= = =

[95_ Conf. Interval]

0.000 0.039 0.438 0.596 0.000

-5.423312 .0412947 -1.091921 -.9945168 20.51224

-3.853192 1.611414 .4781982 .5756028 22.08236

Compare the p-values of the terms in the natural-polynomial regression with those in the orthogonalpolynomial regression. With orthogonal polynomials, it is easy to see that the pure cubic and quartic trends are nonsignificant and that the constant, linear, and quadratic terms each have p < 0.05. The matrix P obtained with the poly () option can be used to transform coefficients for orthogonal polynomials to coefficients for natural polynomials: orthpoly weight, poly(P) deg(4) . matrix b = e(b)*P matrix list b b[1,5] yl

degl .02893016

deg2 -.00002291

deg3 5.745e-09

deg4 -4.862e-13

_cons 23.944212

<1

Methodsand Formulas orthog orthog's

and orthpoly are implemented orthogonalization

as ado-files.

can be written in matrix notation as

x=QR where X is the N × (d + l) matrix representation of varlist plus a column N x (d + I) matrix representation of newvarlist plus a column of ones (d in vaHist and N = number of observations). The (d + 1) × (d + I ) matrix triangular matrix: i.e.. R would be upper triangular if the constant were first, so the first row/column has been permuted with the last row/column.

of ones. and Q is the = number of variables R is a permuted upper but the constant is last,

Q and R are obtained using a modified Gram-Schmidt procedure; see Golub and Van Loan (1996) for details. Note that the traditional Gram-Schmidt procedure is notoriously unsound, but the modified procedure is quite good. orthog performs two passes of this procedure.

'

[ { t

480

orthog 0 _hogonalvariablesandiorthogonal polynomials ! _ompare '

rtrulk

trunk difference

I

, ,,i

i

r t_/_k>t runk

74

jo!ntly

74

count

defined

to_al

minimum:

average

maximum

8.88e-15

I.92e-14

3.55e-14

8,88e-15

1.92e-14

3.55e-14

74

I In [hii example,

the:recovered

variable rtrank

is almost exactly the same

as the original _runk.

Vehenodthoeonalizingman',, variables, this procedure can be performed as a check of the numerical soundnessof the orth_gonalization, Because of the ordering of the orthogonalization procedure, the last variable and the ariables near the end of the varlist are the most important ones to check.

-

The o_thpoly cot mand effectively does for polynomial terms what the orthog command does for an arbitrary set of variables.

°

i

I,

> Examplei i

A_aini considerthe auto.dta dataset.Supposewe wish to fit themode] mpg

=

+ _I weight

+/_2

we_g ht2 + _3 weigh

t3 + ;_4 weight4 + e

We will first compute he regression with natuial polynomials: !

!

i a

double

w2 = wl*wl

, g_n double

w3 = w2*wl

. g,_n double

w4 = w3*wl

. c_rrelate

i!

wl-_4

I

w2

wl

w3

wl

1 .(300

i

w2

0.(

i

w3

O.¢.565

O. 9916

I.0000

i

w4

O. t 279

O. 9679

O. 9922

!

915

1.0000 1. 0000

. r,_gress mpg wl-v4 88

I

w4

.i

df

_

Model !Residual

MS

Number

,,

_( 4,

652.73666 '90.722803

4 69

413._84164 11.4fl_97508

!443.45946

73

33.4_20474

!

i }

Adj Total

mpg

Coef.

Std.

Err.

_i

.0289302

_2 w3

.-. 0000229 5.74e-09

.0000566 _.19e-08

w4

' 4.86e-13

_cons

23.94421 I

Root

t

,1161939

69) =

Prob > F R-squared

. _ i

0.25

P>It I

74

of obs =

R-squared MSE

[95Y, Conf.

30.06

= =

0.0000 0.6764

=

0.6576

=

3,3852

Interval]

O. 804

-. 2028704

.2607307

i -0.40 0.48

O. 687 0.631

-. 0001359 -1.80e-08

.0000901 2,95e_08

9.14e-13

-0.53

0.596

-2.31e-12

1.34e_12

86.60667

'

0.783

-148.83t4

196.7198

i!

0,28

W

odhog -- Odhogonal variables and odhogonal polynomials

. regress

price

length

Source

weight

SS

Model Residual Total

price

weight headroom

trunk

df

MS

Number F( 4,

of obs 69)

74 10.20

4

59004145.0

Prob

=

0.0000

399048816

69

5783316.17

R-squared Adj R-squared

= =

0.3716 0.3352

635065396

73

8699525.97

Root

=

2404.9

Std.

Err.

t

> F

= =

236016580

Coef.

length

headroom

MSE

P>ltl

[95_

Conf.

-185.747

479

Interval]

-I01.7092

42.12534

-2.41

0.018

4.753066 -711.5679

1.120054 445.0204

4.24 -1.60

0.000 0.114

2.518619 -1599.359

-17.67148 6.987512 176.2236

trunk

114.0859

109.9488

1.04

0.303

-105.2559

333.4277

_cons

11488.47

4543.902

2.53

0.014

2423.638

20553.31

However, we may believe a priori that length is the most impo_ant predicton followed by weight, followed by headroom, followed by trunk. Hence, we would like to remove the "effect" of length from all the other predictors; remove weight from headroom and trunk; and remove headroom from trunk. We can do this by running orthog, and then we estimate the model again using the orthogonal variables: • orthog

length

• regress

price

Source Model

weight

headroom

olength

I

trunk,

oweight

SS

i

gen(olength

oheadroom df

oweight

oheadroom

otrunk)

matrix(R)

otrunk

MS

Number F( 4,

of obs 69)

74 10.20

236016580

4

59004145.0

Prob

=

0.0000

399048816

69

5783316.17

R-squared Adj R-squared

= =

0.3716 0.3352

635065396

73

8699525.97

Root

=

2404.9

price

Coef.

Std.

olength

1265.049

279.5584

4.53

oweight oheadroom

1175.765 -349.9916

279.5584 279.5584

4.21 -1.25

0.000 0.215

1.04 22.05

Residual Total

Err.

otrunk

290.0776

279.5584

_cons

6165.257

279.5584

Using the matrix R, we can transform the metric of original predictors: • matrix matrix

t

> F

= =

MSE

P>ltt

[95_

Conf.

Interval]

0.000

707.3454

1822.753

618.0617 -907.6955

1733.469 207.7122

0.303

-267.6262

847.7815

0.000

5607.553

the results obtained using the o_hogonal

6722.961

predictors back to

b = e(b)*inv(K)" list

b

b[1,5] length yl

-101.70924

weight 4.7530659

headroom -711.56789

trunk 114.08591

_cons 11488.475

Technical Note The matrix R obtained using the matrix() option with orthog can also be used to recover .¥ (the original vartist) from Q (the orthogonalized newvarlist) one variable at a time. Continuing with the previous example, we illustrate how to recover the trunk variable: .

matrix

• matrix

C = R[l...,"trunk"]" score

double

rtrunk

= C

!

478 orthog -- )rthogonal variables and orthogonalpolynomials Notei that the coef_cients corresponding tcr the constant term are placed in the last column of the matrix. The last r_bwof the matrix is all tero except for the last column, which corresponds to the _onstant term. 1

Remarks •

Ortht,gonal variab

s are useful for two reasons. The first is numerical accuracy for highly collinear

variableg._ Stata'srel_ress and other estimationcommandscan face a largeamountof coll}nearitv and variables due to stil! produce accbrate results. But, at some point, these commands will drop r ! _!

cotlineahty".- If you ktnow with certainty that the variables are not perfectly collinear, you may want to retain a_lof their eff@ts in the model. B3,'usihg orthogor orthpolytO produce a set of orthogonal . i all vanable_ ...... will be present m the estimauon results. vanablef;,

}

User i are more lik ly to find orthogonal vafi_'ablesuseful for the second reason: ease of interpreting results, brthog and _rthpoly create a set 0f variables such that the "effect" of all the preceding

_

vanable_ have been Fm°ved from each vanable. For example, ff one 2ssues the command

I

. iorthog

xt

x2

x3,

generate(ql

q2

q3)

cons:ant

xl are constant produce ql, is removed from xt the removed 2, and finally the conslant, xl. and x2 then are removed fromand x3 to produce q3.

the the fromeffett x_ toof produce Hence,

tO

}

ql = r01 + rll xl

g

q2 = r02 + r1_2xl + r22 x2

i

q3 = ro3 + rl_3xI + r23 x2 -,- r33 x3 This cm be generali

i i ! _

d and written in matrin notation as

_

X = OR

where ..J,*: is the A" ×!(d + t) matrix representation of varlist plus a column of ones, and Q is the Ar × (di+ l) matrix representation of newvarlist plus a column of ones (d = number of variables in varli._t and N = namber of observations). The (d-t- 1) × (d + 1) matrix R is a permuted upper _riangul_.r matrix: i.e.. R would be upper triangular if the constant were first, but the constant is last. so _he, first row/zolumn has been permuted with the last row/column. Since Stata's estimatmn commar_ds list the constant term last. this allows R, obtained via the matrix() option, to _ used to trans rm estimatk n results.

!i

i

i

I i

.- Example ConsiderStata's md:o. dta dataset Supposewe postulatea mode] in which price dependson the car's le lgth. weight, headroom (headroom);, and trunk size (trunk). These predictors are collinear. bat not .'xtremely so--the correlations are nch that close to l" horrelate

leiLgth weight

headroom

trrmk

(o bs=74) I__ngth

weight

-!_ength

1 0000

_eight

0 9460

i 0000

0 5163 0 7266

0.4835 0.6722

'

he_adroom trunk

"

headroom,

1.0000 0. 6620

trunk

1. 0000

regres:certainly h@ no trouble estimating _hi_,rnodeh

"itle I °rth°g

-- Orth°g°nal ,

variables and °rth°g°nal

p°ly , n°mials

]

Syntax orthog

[varlis,]

tweightl

[matrix(matname)

orthpoly

varname

[if

expl

[in

range],

g_enerate(newvarlist)

]

[weight]

{ generate(newvarlist)

Iif

exp]

[in range],

[p_oly(matname)

} [ degree(#)

]

orthpoly requires that either generate() or poly(), or both. be specified, iweights, fweights, pweights, and aweights are allowed, see [U] 14.1.6 weight.

Description orthog orthogonalizes a set of variables, creating a new set of orthogonal variables (all of type double), using a modified Gram-Schmidt procedure (Golub and Van Loan 1996). Note that the order of the variables determines the orthogonalization: hence, the "most important" variables should be listed first. orthpoly computes orthogonal polynomials

for a single variable

Options generate(newvarlist) is not optional; it creates new orthogonat variables of type double. For orthog, newvarlist will contain the orthogonalized varlist. If varlist contains d variables, then so will newvarlist. For orthpoly, newvarlist will contain orthogonal polynomials of degree 1, 2, .... d evaluated at varname, where d is as specified by degree (d). newvarlist can be specified by giving a list of exactly d new variable names, or it can be abbreviated using the styles newvar 1newvard or newvar,. For these two styles of abbreviation, new variables newvar 1, newvar2, .... newvar d are generated. matrix(mamame) (orthog by X = QR, where X is and Q is the N × (d + 1) of variables in varlist and

only) creates a (d+ 1) × (d + l) matrix containing the matrix R defined the N × (d+ 1) matrix representation of vartist plus a column of ones, matrix representation of newvarlist plus a column of ones (d = number N := number of observations).

degree(#) (orthpoly only) specifies the highest degree polynomial to include. Orthogonal nomials of degree 1, 2.... , d - # are computed. The default is d = 1.

poly-

poly(mamame) (orthpoly only) creates a (d + 1) × (d 4- 1) matrix called matname containing the coefficients of the orthogonal polynomials. The orthogonal polynomial of degree i < d is matname[ i, d + I ] + matname[ i, 1 ] *varname + matname[ + " • + matname [ i, i ]*varname" 477

i, 2 ] *varname 2

I_

I !

,°,,au,_s

..........

In _,aataset

a Tectlnical Note :_

If _,our data ( ontain variables the_ correctly andi yee_r2.

e ,en though

named yearl,

to most computer

year2 ..... programs,

yea_rig,

yearl0

year20_

is alphabetically

aorder

will order

bep,veen yearI

i

I

Methi,dsandI_ormulas a ,rder is imp emented

as an ado-file,

Refe. nces Gleas_n, J, R. 1997. tmSl: Defining and recordirig variable ot'derings. Stata Technical Bultetin40: 10-12. Reprinted in IStata, TechnicaJ Bulletin Reprints, rot. 7, p_, 49-52. }

Weesi_. J. 1999. din7 .: Changing the order of variables in a dataset. Szala Technical Bulletin 52: 8-9. Reprinted in St_ta Technical B_ Iletin Rep6nts, vol. 9, pp. 6_-62. !

AlsoS_ee Coml_lementary:

[R]descry'be

Related:

[R] edit,

[R] rename

W

Contains

data

from

obs: I

74

1978

6

vars: size:

7

2,368

(99.6%

storage I

order -- Reorder variables in dataset

auto.dta

variable

name

of memory

Automobile

Jul

2000

475

Data

13:51

free)

display

value

type

format

label

variable

label

!

i ;1

i

make

strl8

%-18s

Make

mpg

int

%8,0g

Mileage

price

int

%8.0gc

Price

weight

int

%8.0gc

Weight

length

int

%8.0g

Length

(in.)

rep78

int

%8.0g

Repair

Record

Sorted

and

Model (mpg)

(Ibs.) 1978

by:

Note:

dataset

has

changed

since

last

saved

[

1} ' I '

If we now wanted length to be the last variable in our dataset, we could type order price weight rap78 length but it would be easier to use move: . move length describe

rep78

Contains

from

data

auto.dta

obs:

74

wars:

6 2,368

size:

variable

name

1978

Automobile

7 Jul (99.6%

storage type

of memory

display format

2000

Data

13:51

free)

value label

variable

label

make

strl8

Z-18s

Make

mpg

int

_8.0g

Mileage

price

int

_8.0gc

Price

weight

int

%8.0gc

Weight

(Ibs.)

rep78

int

%8.0g

Repair

Record

length

int

%8.0g

Length

(in.)

Sorted

make mpg

and

Model (mpg)

1978

by:

Note:

dataset

has

changed

since

last

saved

We now change our mind and decide that we would prefer that the variables be alphabetized. aorder describe Contains

data

from

obs:

auto.dta 74

1978

6

7 Jul

vars: 2,368

size:

(99.4%

storage

of memory

Automobile 2000

free)

display

value

type

format

label

length make

int sift8

_8.0g _-18s

Length (in.) Make and Model

mpg

int

_8.0g

Mileage

price

int

%8.0gc

Price

rep78 weight

int int

%8.0g _8.0gc

Repair Weigh_

variable

Sorted

name

Data

13:51

variable

label

(mpg) Record (ibs.)

1978

by:

Note:

dataset

has

changed

since

last

saved

_

i

Title tl i

!! 't

i

I

[

1

I

u

"

Syntax ord._r_ vartist Yartlame

movi

_rname2

1

aor_er [varlist]

Descriion

i

order changes tl_e order of the variables in the current dataset. The variables specified in varlist are m@ed, in order, lto the front of the data_t. ! ! movi_ also reorder_ variables, move relocaies vamamel to the position of vamame2 and shifts the

ii_

remain!ng_variables, !includingl varna,ne2, to make room.

_-

aor_er alphabeti_esthe variablesspecifiedin varlistand movesthem to the front of the dataset. If no vhrlist is specihed. _all

Remarks _- Examplb When using order, ., describe C)ntains

I "

obs: tars:

}

i

i

_

i

you must specify a vadist, but it is not necessa_' to specify all the variables

in the dataset._ For e::ample, i

!

!

is assumed.

data from

auto.dta 74 6

• 2,368

;ize:

1978 7 Jul Automobile 2000 13:51 Data (99.6}',of memory

storage

free)

display

value

type

format

label

p :ice

int

_8, Ogc

Price

w_ight m@g m_ke

int int strl8

_,8.0gc 7,8.0g Y,-18s

Weight (Ibs.) Mileage (mpg) Make and Model

l_ngth r_p78

int int

Y,8. Og '/,8.0g

Length Repair

v_riable

S_rted

name

by :

Note: ._order

variable

make

dataset

has

cIianged since

last

pg

describe

474

saved

label

(in.) Kecord

1978

oprobit m Maximum-likelihood

,

ordered probit estimation

473

Saved Results oprobit

saves

in e():

Scalars e (N)

number of observations

e (11)

log likelihood

number of categories model degrees of freedom pseudo R-squared

e(ll_O) e(chi2) e (N_clust)

log likelihood, constant-only model X2 number of clusters

e(cmd)

oprobit

e(vcetype)

covarlance estimation method

!

e(depvar) e(wtype)

name of dependent variable weight type

e(chi2type) e(offset)

Wald or LR: type of model X2 test offset

[ i

e(wexp) e (clustvar)

weight expression name of cluster variable

e(predict)

program used to implement predict

coefficient vector category values

e (V)

variance-covariance estimators

e (k_cat) e(df_m) e (r2_p) Macros

Matrices e (b) e (cat)

[ t

matrix of the

Functions e fsample)

marks estimation sample

Methodsand Formulas Please

see the Methods

and Formulas

section

of [R] ologit.

References Aitchison. J. and S. D. Silvey. 1957, The generalization of probit analysis to the case of muhiple responses. Biometrika 44: 131-140. Goldstein. R. 1997. sg59: Index of ordinal variation and Neyman-Barton Reprinted in Stare Technical Bulletin Reprints, vol. 6, pp. 145-147.

GOE Stat_ Technical Bultetin 33: 10-12.

Greene, W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River, NJ: Prentice-Hall. Long, J. S. 1997. Regression Models tbr Categorical and Limited Dependent _,hriable.s. Thousand Oaks, CA: Sage Publications. Wolfe, R. 1998. sg86: Continuation-ratio models for ordinal response data. Stata TechnicJ Bulletin 44:18-21. in Stata Technical Bulletin Reprints, vol. 8, pp. 149-153.

Reprinted

Wolfe. R, and W. W. Gould. 1998. sg76: An approximate likelihood-ratio test for ordinal response models. State Technical Butletin 42: 24-27. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 199-204.

Also See Complementary:

[R] adjust,

[R] lincom,

[R] linktest.

[R] lrtest.

[R] mix.

[R] predict,

[R] test, [R] testnl, [R] vce, [R] xi Related:

[R] logistic, [R] mlogit, [R] ologit, [R] probit, [R] svy estimators

Background:

[U] 16.5 Accessing coefficients and standard errors_ [u] 23 Estimation and post-estimation commands. [u] 23.11 Obtaining

robust

[U] 23.12 Obtaining tR] maximize

scores,

variance

estimates,

[R] sw,

!

I

!!" r

472

_

_

op_obit

m

Maximum-|i_lihood

o_ered ,;

P _0_

Hypothesistests md predictions

See u] 23 Estim tion and post-estimation commands for instructions on obtaining the variancec_3varialce matrix oi the estimators, predicted values, and hypothesis tests. Also see [R] lrtest for perforating likelihoo_ -ratio tests.

Exampi tn t_e above example, we estimated the model oprobit rep77 foreign length mpg. The predict command can be used to obtain the predicted probabilities. You type predict followed by the nantes of the ne_ variables to hold the p_dicted probabilities, ordering the names from low to high. I_ our data, the lowest outcome is poor and the highest excellent . We have five categories and so inust type fiv, names following predict; the choice of name is up to us:

I !

. !predict poor fair avg good exc (_ption p assu_ed; predicted probabilities)

_

. ilist make mo_el exc good

'

i

! }

i 13.

i

if rep77==.

make AMC

model Spirit

exc .0006044

good .0351813

Ford _zlO[ Buick 44. Mere. _3. Peugeot _6. Plym. _7. Plym.

Fiesta Opel Monarch 604 Horizon Sapporo

.0002927 .0043803 .0093209 ,0734199 .001413 .0197543

.0222789 .1133763 .1700846 .4202766 .0590294 .2466034

_3.1

Phoenix

.0234156

.266771

Pont.

i

For Srdered probill, predict, xb produces Sj = Xlifll -t- x2jfl2 +"" + xk3flk. Ordered probit is identlcal to ordered logit except that one uses a different distribution function for calculating probabilities. The orc_ered-probit predictions are then the probability that Sj 4- uj lies between a pair of cut _ints e;i-1 arid _i. The formulas in the case of ordered probit are

l

I

e_timatiOn

i

Pr(Si

+ u < n)=

I

Pr(Sj

+ w > _,) = i - _(_ - Sj) = _(Sj

Rather than using pr diet i . _predict I . _en . _en

,F(_-

Sj) - n)

directly, we could calculate the predicted probabilities by hand. " "

psco_re,xb I

probexc T norm(pscore-_b[_cut4]:) probgood norm( b[ cut4]-pscol_e)

- norm(

b[ cut3]-pscore)

oprobit -- Maximum-likelihood ordered probit estimation

471

Remarks An ordered probit model is used to estimate relationships between an ordinal dependent variable and a set of independent variables. An ordina/variable is a variable that is categorical and ordered, for instance, "poor", "good", and "excellent", which might be the answer to one's current health status or the repair record of one's car. If there are only two outcomes, see [R] logistic, IN] logit, and [R] probit. This entry is concerned only with more than two outcomes. If the outcomes cannot be ordered (e.g., residency in the north, east, south and west), see IN] mlogit. This entry is concerned only with models in which the outcomes can be ordered. In ordered probit, an underlying score is estimated as a linear function of the independent variables and a set of cut points. The probability of observing outcome i corresponds to the probability that the estimated linear function, plus random error, is within the range of the cut points estimated for the outcome: Pr(outcomej

= i) = Pr(n__l

< fllzlj

+/32x2j

+'"

< _)

uj is assumed to be normally distributed. In either case, one estimates the coefficients ill, 132, ..., flk together with the cut points nl, n2, ..., nz-1, where I is the number of possible outcomes. no is taken as -oo and nz is taken as 4-00. All of this is a direct generalization of the ordinary two-outcome probit model.

> Example In [R] ologit, we sample datasets) to logit to explore the proxy for size), and togit:

use a variation of the automobile dataset (see [U] 9 Stata's on-line tutorials and analyze the 1977 repair records of 66 foreign and domestic cars. We use ordered relationship of rep77 in terms of foreign (origin of manufacture), length (a mpg. Here we estimate the same model using ordered pmbit rather than ordered

. oprobit

rep77

Iteration

O:

log

likelihood

= -89.895098

Iteration

I:

log

likelihood

= -78.141221

Iteration

2:

log

likelihood

= -78.020314

Iteration

3:

log

likelihood

= -78.020025

Drdered

Log

probit

likelihood

foreign

length

mpg

estimates

N_raber of obs LR chi2(3) Prob > chi2

= = =

66 23.75 0.0000

= -78.020025

Pseudo

=

0.1321

repY?

Coef.

foreign

1.704861

length

,0468675

mpg

Std.

Err.

R2

z

P>Iz[

[95_

.4246786

4.01

0.000

.8725057

2.537215

.012648

3.71

0.000

.022078

.0716571

.1304559

.0378627

3.45

0.001

.0562464

.2046654

_cutl _cut2

10.1589 11.21003

3.076749 3.10T522

_cut3

12.54561

3.155228

_cut4

13.98059

3,218786

(Ancillary

Conf.

Interval]

parameters)

We find that foreign cars have better repair records, as do larger cars and caus with better mileage ratings. q

clus_er(varnamt

specifies that the observations are independent

across groups (clusters) but

;

n__ necessarily vithin groups, varname:specifies to which group each observation belongs; e.g,, catuster(pers mid) in data with repeated observations on individuals, cluster() affects •the estimated stand_trd errors and variance-covariance matrix of the estlmators (VCE), but nol the es_mated coeffi :ients; see [t2] 23,11 Obtaining robust variance estimates, cluster() can be us#d with pwe: ghts to produce estinmtes for unstratified cluster-sampled data. but see the sWoprobit colnmand in [R] svy estimators for a command designed especially for survey data.

i

cl_aster()

imp ies robust;

specifying robust

cluster()

is equivalent to typing cluster()

by iitself, }

scor_(newvarlist) creates k new variables, where k is the number of observed outcomes. The firs_ variable cot tains OlnLj/O(xjb); the second variable contains OlnLj/O(_cutlj); the third conhins OlnLj/, _(_cut2j); and so on. Note that if you were to specify the option score(sc*), Sta!a would creale the appropriate number of new variables and they would be named seO. scl, level #) specifies le confidence level, in percent, for confidence intervals. The default is level or _ set by set level: see [U] 23.5 Specifying the width of confidence intervals.

(95)

_,

offse_ (varname) s_cifies that varname is to be included in the model with coefficient constrained to be 1.

i l

maximi_e..options control the maximization process; see [R] maximize. You should never have to spedfy them.

i

Optionsior predicl

I

p.:the d_ault, calculat _s the predicted probabilities. If you do not also specify the out come () option. you must specify new variables, where kis the number of categories of the dependent variable. Say vbu estimatec _ model by typing oprobit result xl x2. and result takes on three values.

i

Then i,ou could tyl:e predict pl p2 p3. to obtain all three predicted probabilities. If you specie' the ot_tcome() opt on, then you specify one new variable. Say that result takes on values 1.2. and 3i Then typing predict pl outcome(I) would produce the same pl. xb. calculates the line_ • prediction. You specify one new variable; for example, predict linear, xb. Tt_e linear prod ction is defined ignoring the contribution of the estimated cut points. i

xb calcult_tes the line prediction. You specify one new variable: for example, predict linear, xb. Ttje linear pred_ fion is defined ignoring the contribution of the estimated cut points.

_ _"

s_:dp calculates the stm dard error of the linear prediction. You specify one new variable: for example, predittse, stdp. outcome outcome) sp 'cities for which outcome the predicted probabilities are to be calculated. owcco_e() should dontain either a single value of the dependent variable, or one of #I, #2 ..... _vith #i meaning the_first categor_ of the dependent variable, #2 the second category, etc.

i _

nooffsetiis

relevant o

if you specified olfset (varname) for oprobit It modifies the calculations made bi, predict s_ that they ignore the offset variable; the linear prediction is treated as x3b rather t_an xjb + eraser,.

le ,

[

oprobit

-- Maximum-likelihood

ordered probit estimation

,

]

T

Syntax oprobit cluster :

depvar

[varlist]

(varname)

[weight]

[if

score (newvarlist)

exp] level

[in

range I [,

(#) 9ffset

t_able_robust

(varname)

maxbnize_options

]

by ... : may be used with oprobit; see [R] by, fweights, iweights, and pweights are allowed; see [U] 14.1.6 weight. oprobit shares the features of all estimationcommands; see [U] 23 Estimation and post-estimation commands. oprobit

may be used with sw to perform stepwise estimation: see [R] sw,

Syntaxfor predict predict [O,pe]newvarname(x)[if exp] [in range] [. { p I xb I stdp } outcome(outcome)

nooffset ]

Note that with the p option, you specify either one or k new variablesdepending upon whether the outcome () option is also specified (where k is the number of categories of depvar). With xb and stdp, one new variable is specified. These statistics are available both in and out of sample; type predict ... if e(sample) ... if wanted only for the estimation sample.

Description oprobit estimates ordered probit models of ordinal variable depvar on tile independent variables varlist. The actual values taken on by the dependent vmiable are irrelevant except that larger values are assumed to correspond to "higher" outcomes. Up to 50 outcomes are allowed in Intercooled Stata and up to 20 are allowed in Small Stata. See [R] logistic for a list of related estimation commands.

Options table requests a table showing how the probabilities equation.

for the categories

are computed from the fitted

robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster () allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,

robust is implied; see [U] 23.13 Weighted 469

estimation,

+

i + ' i

I_;+

i

468

oneway-- Dne-wayanalysis of variance

+

The :cheff_ test (Scheffd 1953. 1959: also see Winer. Brown, and Michels 1991, 191-t95)differs in derivdttion, but it altacks the same problem. Let there be k means for which we want to make all the pair,k,ise tests. Two means are declared significantly different if

i

+ iI_

t >_ v/(k-

1)F(a:k-

1,_,)

where /_(a:_ k - 1.__, is the a+-critical value of the F distribution with k - 1 numerator and 12 denominator degrees of freedom. Scheffd's test has the nicety that it never declares a contrast si!mificalt if the over tll F test is nonsignificant. Turnihg the test ar )und, Stata calculates a significance level

}

g=F

,k-l,v

! [ i

I [ ! ;

J For instance,, you hay.• a calculated t statistic of 4.0 with 50 The F test equivalent, says the significance evel is says the same. If vou are doing three comparisons, however, and S0 degrees of'frec:dom, which says the significance level 100021

_

degrees of freedom. The simple t test 16 with t and 50 degrees of freedom. you calculate an F test of 8.0 with 2 is .0010.

+ Referendes +

Ahman. D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall. Bartlett, _. S. 1937, Pro _erties of sufficiency and statistical tests, Proceedings .268-2_2. Hochberg. Judge. G. 2d ed.

of the Royal Socieb', Series A 160:

"_, and A.C. mhane. 1987. Multiple Comparison Procedures. New "_brk: John Wile)' & Sons, t13_ W. E. Gri ffit_ s, R, C. Hilt, H L/itkepohl, and T.-C. Lee. 1985. The Theoo, and Practice of Economerncs. !New York: Johb Wiley & Sons.

;

Miller, R. +!G"Jr. 1981. S_nultaneous

•_

Scheff_. H!+t953. A method for judging all contrasts in the analysis of variance. Biometrika

[ ! i

i i

-,

4

195 +. The Analysis

Statistical Inference. 2d ed, New York: Springer-Verlag.

of Variance. New York: John Wiley & Sons.

Sid_k. Z. !1967. Reetangu ar confidence Ameri_n,

Statistical A;sociation

regions for the means of multivariate

normal distributions.

of the

Snedecor, 1_, W. and W. ( Cochran. 1989. Statistical Methods. 8th ed. Ames. tA: Iowa State University Press. \Viner, B. D.R. Brown and K. M. Michels. t991. Statistical Principles in Experimental Design. 3d ed. New York: +tcGra_v-Hfll.

AlsoSeei +

Complementary:

!

Backgrodnd: Related:

+

Journal

62: 626-633.

}

i

40: 87-104.

m]encode

U] anova, 21.8 Accessing results by other programs tl_] [R] loneway, i[R]calculated table

_M

ultiple-comparison tests

oneway n one-way analysis of variance

Let's begin by reviewing the logic behind these adjustments. comparison of two means is

The "standard"

467

_ statistic for the

t=

/±

1

s_/ n_ + nj where s is the overall standard deviation, ffi is the measured average of ,Y in group i, and ni is the number of observations in the group. We perform hypothesis tests by calculating this t statistic. We simultaneously choose a critical level a and took up the t statistic corresponding to that level in a table. We reject the hypothesis if our calculated t exceeds the value we looked up. Alternatively, since we have a computer at our disposal, we calculate the significance-level e corresponding to our calculated t statistic and, if e < c_, we reject the hypothesis. This logic works well when we are performing a single test. Now consider what happens when we perform a number of separate tests, say n of them. Let's assume, just for discussion, that we set oLequal to 0.05 and that we will perform 6 tests. For each test we have a 0.05 probability of falsely rejecting the eq uality-of-means hypothesis. Overall, then, our chances of falsely rejecting at 1east one of the hypotheses is 1 - (1 - .05) 6 _ .26 if the tests are independent. The idea behind multiple-comparison tests is to control for the fact that we will perform multiple tests and to reduce our overall chances of falsely rejecting each hypothesis to c_ rather than letting it increase with each additional test. (See Miller 1981 and Hochberg and Tamhane 1987 for rather advanced texts on multiple-comparison procedures.) The Bonferroni adjustment (see Miller I981; also see Winer, Brown, and Michels 1991, 158-166) does this by (falsely but approximately) asserting that the critical level we should use. a, is the true critical level a divided by the number of tests n, that is, a = a'/n. For instance, if we are going to perform 6 tests, each at the .05 significance lev el, we want to adopt a critical level of .05/6 _ .00833. We can just as easily apply this logic to e, the significance level to our critical level a. If a comparison has a calculated significance adjusted for the fact of n comparisons, is n- e. If a comparison has and we perform 6 tests, then its "real" significance is .072. If we cannot reject the hypothesis. If we adopt a critical level of .10, we

associated with our t statistic, as of e, then its "real" significance, a significance level of, say, .012, adopt a crilical level of .05, we can reject it.

Of course, this calculation can go above 1, but that just means that there is no a < 1 for which we could reject the hypothesis. (This situation arises due to the crude nature of the Bonferroni adjustment.) Stata handles this case by simply calling the significance level t. Thus. the formula for the Bonferroni significance level is eb = min(1, en ) where n - k(k - 1)/2 is the number of comparisons. The Sidg_kadjustment {Si&ik 1967; also see Winer, Brown, and Michels 1991. 165-166) different and provides a tighter bound. It starts with the assertion that a=l-(1-a) Turning this formula around and substituting

1/n

calculated significance

e_=min{1,1-(1-e)

is slightly

levels, we obtain

n}

For example, if the calculated significance is 0.012 and we perform 6 tests, the "'real" significance is approximately 0.07.

i

......

/

466

111_

........

i oneway --

)ne-way analysis of variance

The rbodel one- ay analysis of variance is Methods andiofFondulas

for level! i = 1.... ,/_]and observations j = 1i .... hi. Define Yi as the (weighted) mean of Yij over Yij Yij!_Define --# nt- O_i-t_ij as the weight associated with Yij, which j and _ is the overaili(weighted) mean of w_j is 1 if the, data are untweighted,wij is normaI!zed to sum to 'n = _. _ ni if aveights are used and is othen__seunnormaltz_ • wi refers to _j wij and w refers to _i u'i. The between group sum of squares is then

i ,!

Sl = _ _,,(_ - _)_

! l

i

The t_tal sum of s_uares is

The _ithin group @m of squares is given By S_ = S - $t. The _etween gro@ mean squ_e is s_ = S1/(k - 1) and the within group mean square is s_ = Se/!(u, - k). Th_ test statistic is Y = s21/s2e.See, for instance, Snedecor and Cochran (1989).

t

i

Bartlett'stest

! _=

Bartleit's test assum,._sthat you have m independent,normal random samplesand tests the hypothesis 2 The test statistic, M, is defined c_ =...= c_m.

t t

} _ i

M - (T-

m) tr!_2 - _'_(Ti - 1) ln_?

1 --t-3(m__l)Z_." Ti--'l

T-m

where th(ire are T ove all observations, T/obs_p,_ations in the ith group, and r_ j=l

o

i=l 5

i _

i

An:approkimate test ott the homogeneity of variance is based on the statistic 3I with critical values oNamed"_rom the k"_q_stnbut_on" " " of m"- 1 degrees of freedom. See Bartlett (t937) or Judge et al.

(_9s5,,4_-449).

/

oneway -- One-way analysis of variance

465

Weighted data Example oneway a one-way data, Your population

can work with both weighted and unweighted data. Let's assume that you wish to perform layout of the death rate on the four Census regions of the United States using state data contain three variables, d_rate (the death rate), region (the region), and pop (the of the state).

To estimate the model, you type oneway drate region abbreviates weight as w. We will also add the tabulate summary statistics differs for weighted data:

[weight=pop], although one typically option to demonstrate how the table of

oneway drate region [w=pop], tabulate (analytic weights assumed) Census region

Mean

Sum, mary of Death Rate Std. Dev. Freq.

NE N Cntrl South West

97.15 88. I0 87.05 75.65

5.82 5.58 i0.40 8.23

49135283 58865670 74734029 43172490

9 12 16 13

Total

87,34

10.43

2.259e+08

50

Obs,

Analysis of Variance SS df MS

Source Between groups Within groups Total

2360.92281 2974. 09635

3 46

786.974272 64,6542685

5335. 01916

49

108.877942

Bartlett's test for equal variances:

chi2(3) =

F

Prob > F

12.17

5.4971

0.0000

Prob>chi2 = 0.139

When the data are weighted, the summary table has four rather than three columns. The column labeled "Freq." reports the sum of the weights. The overall frequency is 2.259- l0 s , meaning that there are approximately 226 million people in the U.S, The ANOVAtable is appropriately

weighted. Also see [u] 14.1.6 weight. q

Saved Results oneway saves in r(): Scalars r(N)

number

of observations

r(F)

F statistic

r(df_r)

within group degrees

r(mss)

between

of freedom

group sum of squares

r (dr..m)

between

group degrees

r(rss)

within

r(chi2bart)

Bartlett's

_c_

r(df_bart)

Bartlett's

degrees

of freedom

group sum of squares

of freedom

l

i

.................

Ur :lemeath that number is reported "0.001".

i

This is the Bonferroni-adjusted significance of the

.,,

differ_nce. The dif _renee is significant at the 0.1% level. Looking down the coIumn, we see that

'.

concelltration 3 is lso worse than concentrmion 1 (4.2% level) as is concentration 4 (3.6% level). Ba_;edon this e idence, we would use concentration 1 if we grew apple trees.

!

i

_>Examl_le

i

We!can just as asily obtain the Scheff_adjusted significance levels. Rather than specifying the bonfoirroni i

option_ we specify the scherzo

option.

Weiwill also addlthe noanova option to prevent Stata from redisplaying the ANOVAtable:

i

_ oneway

_omparison o_ Average weight in fframsby Fertilizer weight treatment, noauova (S_heffe) _cheffe

_owMean-I _01 Mean [

!I

2

1

3

0.001 3

-33.25

25.9167

0.039

O. 101

,_ 4

-34.4

I

24.7667

0.034

-1.15

O. 118

0.999

The di] 'erences are he same as we obtained in the Bonferroni output, but the significance levels

are noti According , the Bonferroni-adjuste_ numbers, the significance of the difference between !

fertilize_-concentrations 1 and 3 is 4.2%. Thq Scheff6-adjusted significance level is 3.9%.

I

We _'ill leave it t( you to deride which rdsults are more accurate.

l _ Example !_

Let'si.,.conclude thi I example by obtaining the Sidfik-adjusted multiple-comparison tests. We do this to illustrate Stata s capabilities to calculate these results. It is understood that searching across adjustm4tnt methods u_til you find the results yo_ want is not a valid technique for obtaining significance

!

levels.

I

i

. freeway weigh_ noanova we!ght si_al_ in grams by Fertilizer ; Cc treatment, _arison of Average

I

RO_ MeanCol.: Mean

!

!

2

I

1 -5

2

(Sldak)3

. 1667 0,001

"

3

-33.25 0.04i

25. 9167 O, 116

4

-34,4 0.035

24.7667 O. 137

:

J

_-t. 15 I.000

We find _esutts that an similar to the Bonferroni-adjusted numbers.

oneway -- One-way analysis of variance

, :

[no] standard tabulate

includes or suppresses only the standard deviations option. See tabulate above.

[no]freq includes or suppresses option. See tabulate above.

only the frequencies

461

from the table produced by the

from the table produced by the tabulate

[no] obs includes or suppresses ordy the reported number of observations from the table produced by the tabulate option. If the data are not weighted, the number of observations is identical to the frequency and by default only the frequency is reported. If the data are weighted, the frequency refers to the sum of the weights. See tabulate above. : ! i

t

'_

bonferroni scheffe sider

reports the results of a Bonferroni

multiple-comparison

reports the results of a Scheff6 multiple-comparison reports the results of a Sid_k multiple-comparison

test.

test.

test.

Remarks The oneway command reports one-way analysis-of-variance (ANOVA) models. To perform a oneway layout of a variable called endog on exog, type oneway endog exog.

> Example | i

You run an experiment varying the amount of fertilizer used in growing apple trees. You test four concentrations, using each concentration in three groves of twelve trees each. Later in the year, you measure the average weight of the fruit. If all had gone well, you would have had three observations on the average weight for each of the four concentrations. Instead, two of the groves were mistakenly leveled by a confused man on a large bulldozer. You are left with the following dataset: • use

apple

(Apple

trees)

describe Contains

data

obs : wars:

from

apple.dta 10 2

size:

Apple trees 19 Jul 2000

140

(99.9_, of memory

storage variable

name

type

display

value

format

label

variable

label

treatment

int

7,8.Og

Fertilizer

weight

double

7,10.Og

Average

Sorted

by :

list

1.

treatm~t I

weight 117.5

2.

1

113.8

3.

1

104.4

4.

2

48.9

5.

2

50.4

6. 7.

2 3

58.9 70.4

8.

3

86.9

9.

4

87.7

10.

4

67.3

16:04

free)

weight

in grams

__

,

pkcollapse-- Generate!pharmacokineticmeasurementdataset

r

Remarks pkcolla_segenerate_ all the summar3, pharnlacokinetic measures.

) Example We demonstrate the u_e of pkcollapse with '_thedata described at the end of [R] Ilk, We have drug concer_tration data d_n15 subjects. Each subject is measured at 13 time points over a 32-hour period. So_e of the recol'ds are • lisl id

s#q

concA

i

1

o

o

o

i i

t 1 I

1 1 I

3.073403 5.188444 5.898577

3.712592 6. 2306d2 7.885914

.5 1 1.5

"

1 1

1 1

5. 096378 6. 094085

9. 241735 13.10507

2 3

con_

time

.92095930

.5o 1.5 2 1

i i

(ou(p Jt omitted)

l

2 2

1 1

l

2 2 2

1 11

7.253442 5. 849345 4.883569

8.7105_9 10. 90552 5. 92581i8

1

6. 761085

8. 42986

i

515

2. 48462 o

i i

(ou[p Jt omitted ) 2

3

Although pt:summ allows _us to view all the pharmacokinetic measures, we can create a dataset with the measure using pkcoilapse i pkc(llapse time honcA concB, id(id) sta:t(auc)keep(seq)

I

..,...

............

_.......

] ':

. ......

. lis_

I

i i

The

i. 2.

id 1 2

seq 1 1

auc.cencA 150.9643 14_°7606

auc#concB 2_B.5551 138.3201

3. 4.

3 4

1 1

160.6548 157,8622

126. 0635 96i. 17461

5. 6.

5 7

I 1

133.6957 160.639

188.9038 228.6922

7. 8. 9.

8 9 10

1 1 2

131. 2604 168.5186 137.0627

1_. 0139 2_. 8962 139,7382

lO.

12

2

12. 11. 13. 14.

14 13 15 18

2 2 2

146.0462 163.4593 158.1457

1tN.5191 1_.7848 I_. 8654 i

15. I6.

19 20

2 2 2

147. t977 164.9988 145.3823

139. 235 16_. 2391 15B,_ 5146

534o38 2o 2.3942

resultin_ dataset conlains one observation pei subject.

:

q

i I

i

i

_ •

! f

ii

516 pkcollapse -- Generate Methods and Formulas pkcollapse is implemented The statistics generated

pharmacokinetic measurement dataset

as an ado-file.

by pkcollapse

are described in [R] pkexamine.

Also See Related:

[R] pkcross, [R] pkequiv,

Background:

JR] pk

[R] pkexamine,

[R] pkshape, [R] pksumm

r

pkcross -- Analyze :rossover experiments

.... .

I

=

i,i

i

gvntax

I

I ill

I.I

II

I

!

f

i

pkcross! outcome

[i_ exp] [in range]

t rreat ment (varnad_e) carryover ra_ode'l(string)

s_eec, ueng ial

[, _aram(#)

(varnanie I none)

l

I

If.

:

i i

sequence(varname) period(varname)

±d(varnarne)

!

]

t

Descriptioi

pkcross i is one of thd pk commands. If you lave not read [R] pk, please do so before reading this entry, i

i

pkcrossl

!

i

analyzes data from a crossover desig n experiment. When analyzing pharmaceutical trial

data, if the tleatment, car!3'over, and sequence variables are known, the ommbus test for separability of the treatn_ent and cart)over effects is calculated.

Optio

!

pax'am (#) s_ecifies which of the 4 parametefiza6ons to use for the _nalysis of a 2 x 2 crossover experiment. This opti_ is ignored with higher-_order crossover designs. The default is param(3). See the t_chnicat note'for 2 x 2 crossover designs for more details.

i I

i

'

paramet_rization

_ estimates the overall _ean. the period effects, the treatment effects, and

the carrylver effects, _ssuming that no sequenie effects exist. paramet_rization_ estimates the overall mean, the period effects, the treatmenteffects, and the period-b3!-treatment inieraction, assuming that ilo sequence effects and no carr3,o,,er effects exist. paramet

rization

3 estimates the overall mean, the period effects; the treatment effects, and

!

the sequehce effects, a_sumJng that no carrvov_ effects exist. This is the default parameterization. paramet _rizat ion 4 estimates the overall m_an. the sequence effects, the treatment effects, and

'

sequence bv-treatmentlmteract]on, assuming that no period or crossover effects exist. When the sequence by treatment !is equivalent to the peridd effect, this reduces to the third parameterization. sequence(_'arname)

sp_ifies

the variable that dontains the sequence in which the treatment was

i_ _-

administered. If this @tion is not specified, sdquence (sequence). is assumed. treatment ='varname) s_cifies the variable that dontains the treatment information. If this option is

,

not speci ]ed, treat I ('_reat) is assumed. carryover{varname l_one) specifies the vafable that contains the carryover information. If carry(n_ne) is sp'eci_ed, the carryover effecis are to be omitted from the model. If this option

i

is not sp4cified, carry_ (carry) is assumed. period(vahTame) specifies the • variable that cor_tains the period information, if this option is not , l . i. specMed,i perlod(pe_'zod) is assumed.

l

id(varname!) specifies thei variable that contains tile subject identifiers. If this option is not specified,

5i7 id(id)

i_ assumed.

!

,!j ,

model(string) specifies the model to be fit. For higher-order crossover designs, this can be useful if you want to fit a model other than the default. However, anova (see [R] anova) can also be used to estimate a crossover model. The default model for higher-order crossover designs is outcome predicted by sequence, period, treatment, and carryover effects. By default, the mode/ statement ismodel(sequence period treat carry). sequentialspecifies that sequential sums of squares are to be estimated.

Remarks pkcross is designed to analyze crossover experiments. Use pkshape first to reshape your data; see [R] pkshape, pkcross assumes that the data were reshaped by pkshape or are organized in the same manner as produced with pkshape. Washout periods are indicated by the number 0. See the technical note in this entry for more information on analyzing 2 × 2 crossover experiments. E3Technical Note The 2 × 2 crossover design cannot be used to estimate more than four parameters because there are only four pieces of information (the four cell means) collected, pkcross uses ANOVAmodels to analyze the data. so one of the figur parameters must be the overall mean of the modet, leaving just three degees of freedom to estimate the remaining effects (period, sequence, treatment, and carryover). Thus, the model is overparameterized. The estimation of tTeatment and carryover effects requires the assumption of either no period effects or no sequence effects. Some researchers maintain that is it is bad idea to estimate carryover effects at the expense of other effects. This is a limitation of this design, pkcross implements four parameterizations for this model. They are numbered sequentially from one to four and are described in the Options section of this entry. Q

_>Example Consider the example data published in Chow and Liu (2000) and described in [R] pkshape. We have entered and reshaped the data with pkshape, and have variables that identify the subjects, periods, treatments, sequence, and carryover treatment. To compute the ANOVAtable, use pkcross:

(Continued on next page)

pkcrbss -- Analyze crossover experiments pkcr

)ss

outcome

i

sequence

variable

=

period

variable

= period

variable variable variable

= = =

treatment carryover +

id Analysis S_urce

of

of

Variation

l+tersubjects + Sequence

variance

(ANOV_)

SS

_ffect

l

_ffect _ffect i

22

62.79 35.97

11 1

3679.43

r

+ Omnibus

2x2

crossover

sequence

treat carry id

study F

Prob

>

276. O0

O. 37

O. 5468

736.89

4.41

O. 0005

62.79 35.97

O. 38 O. 22

O. 5463 O. 6474

F

•

Res _duals

i

a

MS

I

16211.49

i

Treatment Per iod I_ trasubjects

for

df

276. O0

Residuals

i

519

20265.68 meaeure_ofiTotal separability

22

of

167.25

47" treatment

and

carryover

=

29.2893_

There i:s evid race of inters_bject variability, but there are no other significant effects. The omnibus test for: sepan bility is a m_sure reflecting the degree to which the study design allows the treatment

i _

effects be estimated inl:lependent of the caro, dver effects. The measure of the treatmentto and carryover ef_cts indicates approximately 29% separability. Thisofcanseparability be interpreted as

i _

the degree to which the treatment and carryover effects are orthogonal, that is, the treatment and carryover eff_ts are about 129% orthogonal. Thxs _s a characteristic of the design of the study. For a complete dischssion, see R_tkowsky, Evans, and All&edge (1993). Compared to the output in Chow t

l +

• i . and Liu (200t_), the sequence effect is mislabeled as'+. a carryover • effect.• See Ratkowsky, Evans, and Alldred_e (I_3) section 3i2 for a complete dlscusmon of the mlslabehng.

!

By specify ng param(1),

-

l

pkcro:;s

outcome,

+

we obtain parameterization

i

1 for this model.

_aram(1) ;

period

variable

= period

I

i

treatment sequence

variable

=

sequence treat

i

:,

carryover id

variable

=

id carry

AnalySis

•

i

of

(ANOVA)

for

a 2x2

crossover

study

• Treatment

•

e_fect

301.04

1

301.04

0.67

0.4189

I Carryover I Period

e_ect e_fect

_276.00 255.62

1 I

276.00 255.62

0.61 O. 57

0.4388 O. 4561

19890.92

44:

452.07

20265.68

47:

i

f

variance

Residuals Total U_nnJ )us

measure

_f

separabil_ty

of

treatment

and

carryover

=

29.2893}'.

q

i ;

_ Example Consider th ._case of tw0-treatment, four-sequence, two-period crossover design. This design is commonly referred to as Bal_am's design. Ratkowsky et al. (1993) published the following data from trial: an amantadine +

rl_]

520

pkcross-- Analyzecrossoverexperiments

il

id

i

E

seq

period1

period3

period2

2 1 3

-ab -at) -ab

12 9 17

10.5 8.75 15

9.75 8.75 18.5

4

-ab

21

21

21.5

1

-ha

23

22

18

2

-ba

15

15

3

-ha

13

14

4

-ba

24

22.75

21.5

5

-ha

18

17.78

16.75

I

-aa

14

12.5

13

2

-aa

27

24.25

22.5

3

-aa

19

17.25

16.25

4

-aa

30

28.25

29.75

1

-bb

21

2

-bb

11

10.5

3

-bb

20

19.5

20.75

4

-bb

25

22.5

23.5

13 13.75

20

19.51 10

The sequence identifier must be a string with zeros to indicate washout or basefine periods, or a numben If the sequence identifier is numeric, the order option must be specified with pkshape. If the sequence identifier is a string, pkshape will create sequence, period, and treatment identifiers without the order option. In this example, the dash is used to indicate a baseline period, which is an invNid code for this purpose. As a result, the data must be encoded; see [R] encode. encode

seq,

gen(num_seq)

pkshape

id num_seq

pkcross

outcome,

period1

period2

period3,

order(Oaa

Oab

Oba

Obb)

se sequence

variable

= sequence

period treatment

variable variable

= period = treat

carryover variable id variable

Source

Analysis of Variation

Intersubjects Sequence effect Residuals

Intrasubjects Period

of

variance SS

(ANOVA) (:If

for

a crossover MS

study F

= carry = id

Prob

> F

285.82 1221.49

3 13

95.27 93.98

1.01 59.96

0.4180 0.0000

effect

15.13

2

7.56

6.34

0.0048

effect

8.48

1

8.48

8.86

0.0056

Carryover effect Kesiduals

0.11 29.56

1 30

0.Ii 0.99

0.12

0.7366

Total

1560.59

50

Treatment

Omnibus

measure

of separability

of treatment

and

carryover

=

64.6447_

in this example, the sequence specifier used dashes instead of zeros to indicate a baseline period during which no treatment was given. For pkcross to work, we need to encode the swing sequence variable and then use the order option with pkshape. A word of caution: encode does not necessarily choose the first sequence to be sequence I as in this example. Always double check the sequence numbering when using encode. W

pkctoss -- Analyze crossoverexperiments

521

finishlthe analysis hat was started in [R] pk, little additional work is needed. The data were wi_h pkshape a ad are id 1 2 3 4 5 7

sequence 11 1 1 1 1 il

outcome 150.9643 146.7606 160.6548 157.8622 133.6957 160.639

treat A A A A 1 i

carry 0 0 0 0 0 0

period 1 1 1 1 1 1

8 9 I0 12 13 14 15 18

il 11 i2 !2 12 !2 12 !2

131.2604 168.5186 137.0627 153.4038 163.4593 146.0462 158.1457 147.1977

1 1 B B S B B B

0 0 0 0 0 0 0 0

1 1 1 i 1 1 I 1

19 20 1 2 3 4

12 !2 !I i1 i1 !1

B B B B B B B B B B _ A A A

0 0 A A A n A A A A B B B B

1 1 2 2 2 2 2 2 2 2 2 2 2 2

R _ A A

B B B B

2 2 2 2

5

11

7 8 9 '10 12 13 14

!1 Ii !i i2 12 12 12

164.9988 145.3823 218.5551 133.3201 126.0635 96.17461 188.9038 223.6922 104.0139 237.8962 139.7382 202.3942 136.7848 104.519i

I5 18 19 i20

_ _ _ _

165.8654 139.235 166.2391 158.5146

i

model is fi_ using pkcross: i

_s outcome

i i

sequence variable = sequence period variable = period treatment variable = treat carryover variable = carry id variable = id

;

Ana!_sis of variance (ANOV_) for a 2x2 crossover s%udy urce of Variation SS dd MS F Prob > F tersubjacts Sequence _ffect Residuals

378.04 17991.26

_ 14

378.04 1285.09

0.29 1.40

0.5961 0.2691

Iz_rasubjects i Treatment _ffect Period _ffect

455.04 419.47

1 1

455.04 419.47

0.50 0.46

0.4931 0.5102

i

.......

i I Total 32104.59 3_ Residuals 12860.78of treatment 14 918.63 Ounibus measurelof separability and carryover

,

=

29.2893_

q

w

522

pkcross -

Analyze crossover experiments

> Example Consider the case of a six-treatment crossover trial where the squares are not variance balanced, The following dataset is from a partially balanced crossover trial published by Ratkowsky et al. (1993): . list cow i 2 3 4 1 2 3 4 1 2 3 4

seq adbe baed ebda deab dafc fdca cfda acdf efbc beef fceb cbfe

periodl 38.7 48.9 34.6 35.2 32.9 30.4 30.8 25.7 25.4 21.8 21.4 22.8

period2 37.4 46.9 32.3 33.5 33.1 29.4 29.3 26.1 26 23.9 22 21

period3 34.3 42 28.5 28.4 27.5 26.7 26.4 23,4 23.9 21.7 19.4 18.6

period4 31,3 39.6 27.1 25.1 25.t 23.i 23.2 18.7 19.9 17.6 16.6 16.1

block I 1 1 1 2 2 2 2 3 3 3 3

In cases where there is no vEiancc balance in the design, a square or blocking variable is needed to indicate in which treatment cell a sequence was observed, but the mechanical steps are the same. . pkshape cow seq period1 period2 period3 period4 pkcross outcome, model(block

cowlblock period|block

treat carry) se

Number of obs = 48 Root MSE = .730751

R-squared = Adj R-squared =

Source

Seq. SS

df

MS

Model

2650.0419

30

block cowlblock

1607.17045 628.621899

2 9

803.585226 69.8468777

periodlblock treat

407.531876 2.48979215

9 5

carry

4.22788534

5

Residual

9.07794631

17

.533996842

Total

2659.11985

47

56.5770181

88.3347302

F

0.9968 0.9906 Prob > F

165.42

0.0000

1504.85 130.80

0.0000 0.0000

45.2813195 .497958429

84.80 0.93

0.0000 0.4846

.845577068

1.58

0.2179

When the model statement is used and the omnibus measure of separability is desired, specify the variables in the treatment(), carryover(), and sequence() options to pkcross.

q

Methods and Formulas pkcross is implemented pkcross

as an ado-file.

uses ANOVAto fit models for crossover experiments;

The omnibus measure of separability

is

S= where V is Cramer's

100(1

V)%

V and is defined as

V=

min(r-

1,c-

1)

see [R] anova.

pkcr_ss-- Analyze crossoverexpedments :

523

=

The X2 is calculated as

where 0 andIE are the obs er_'ed and expected counts in a table of the number of times each treatment !

!

i

i

is followed I_ the other treatments.

References Chow. S. C. a_LdJ.

R Liu. 2tl00.Design and Analysis of Bioavedtabilityand BioequivalenceStudies.2d ed New York:MarcelDekker.

Neter. J., M t1. Kutner,C. , Nacbtsheim.and W. Was_rman. 1996. Applied Linear Statistical Models. 4th ed. Chicago:lr_era. Ratkowsky,D. _tk.,M. A. Evans_and J. R. Alldredge.1993. Cross-overExperiments:Design,Analysisand Application. New York: VlarcelDekker.

AlsoSee Related:

[R] _kcollapse, [R] pkequiv. _[R] pkexamine, [R] pkshape, JR] pksumm

Complemenl ary:

[R] Statsby

Background:

[R] _k

Title

f

pkequiv

-- Perform bioequivalence

I

I

I11 II

II I

I

tests II

[ I I

I

exp]

[in range]

Syntax pkequiv

outcome treatmentperiod

sequence id [if

[, compare(string)limit(#) _level(#)noboot fieller symmetric anderson tost ]

Description pkequiv this entry.

is one of the pk commands.

If you have not read [R] pk, please do so before reading

pkequiv performs bioequivalence testing for two treatments. By default, pkequiv calculates a standard confidence interval symmetric about the difference between the two treatment means. pkequ:tv also calculates confidence intervals symmetric about zero and intervals based on Fieller's theorem. Additionally, pkequiv can perform interval hypothesis tests for bioequivalence.

Options compare(string) specifies the two treatments to be tested for equivalence. In some cases, there may be more than two treatments, but the equivalence can only be determined between any two treatments. limit (#) specifies the equivalence limit. The default is 20%. The equivalence limit can only be changed symmetrically, that is, it is not possible to have a 15% lower limit and a 20% upper limit in the same test. level (#) specifies the confidence level, in percent, for confidence intervals. Note that this is not controlled by the see level command.

The default is level

(90).

noboot prevents the estimation of the probability that the confidence interval lies within the confidence limits. If this option is not specified, this probability is estimated bv resampling the data. fieller symmetric

specifies that an equivalence

interval based on Fieller's

specifies that a symmetric equivalence

theorem is to be calculated.

interval is to be calculated.

anderson specifies that the Anderson and Hauck hypothesis test for bioequivalence is to be computed. This option is ignored when calculating equivalence intervals based on Fieller's theorem or when calculating a confidence interval that is symmetric about zero. tost specifies that the two one,-sided hypothesis tests for bioequivatence are to be computed. This option is ignored when calculating equivalence intervals based on FielIer's theorem or when calculating a confidence interval that is symmetric about zero.

_

524

pke_l_ uiv -- Perform bioequivalencetests

525

i

Remarks } :_

l

_

',

i

4

•

pkequiv i designed to +nduct tests for bioequivalence based on data from a crossover experiment. pkequiv requires that the User specify the outcome, uvatment, period, sequence, and id variables. The data mus I be in the sake format as produced lff pkshape;

see [R] pkshape.

> Example

We will co ]duct equivalence testing on the data i_troduced in [R] pk. After shaping the data with .list

i

pkshape, the data id are

I

I

i

[

\_

se(uence

1. 2. 3. 4. 5. 6, 7. 8. 9. 10. 11. 12. 13. 14. i5. 16. t7. 18, 19. 20. 21. 22. 23. 24.

1 1 2 2 3 3 4 4 5 5 7 7 8 8 9 9 I0 t0 12 12 13 13 14 14

1 1 1 1 i 1 1 1 1 1 1 1 1 i 1 1 2 2 2 2 2 2 2 2

outcome 150.19643 218.5551 146.7606 133.3201 160.6548 126.0635 157.8622 96.17461 133.6957 188.9038 160,639 223.6922 131.2604 104,0139 168.5186 237.8962 137.0627 139.7382 153.4038 202.3942 163.4593 136.7848 146.0462 104.5191

great

26. 27. 25. 28. 29. 30. 31.

15 18 15 18 19 19 20

2 2 2 2 2 2 2

165.8654 147.1977 158.1457 139.235 164.9988 166.2391 145.3823

A B B h B h B

B 0 B0 0 B 0

2 1 2! 1 2 1

32.

20

2

158.5146

A

B

2

A B A B A B A B A B A B A B A B B A B A B A B A

now can _onduct a bio_quivalence test between treat

!

!

.pkequlv outcome t_eat period seq id

carry 0

period 1

A 0 A 0 i 0 A 0 A 0 A 0 A 0 A 0 B 0 B 0 B 0 B

2 1 2 I 2 1 2 1 2 i 2 l 2 1 2 1 2 1 2 1 2 t 2

-- A and treat

-- B.

i

C}aSSiC confidence interval for bioe{uivalence ! I

4 i ;

i

! difference: rat_o:

[equivalence limits]

[

-30.296 80X

-11.332 92.519_

plobability t_st•limits are

30.296 120X

test limits

within equivalence limits =

]

26.416 i17.439_ 0.6350

The defau|t tput for pk_quiv shows a confidence interval for the difference of the means (tes! limits), the ra_io of the means, and the federal equivalence limits. The classic confidence interval can

!

.

be constructed around the difference between the average measure of effect for the two drugs or around the ratio of the average measure of effect for the two drugs, pkequiv reports both the difference measure and the ratio measure. For these data, U.S. federal government regulations state that the confidence interval for the difference must be entirely contained within the range [-30.296, 30.296 ], and between 80% and I20% for the ratio. In this case, the test limits are within the equivalence limits. Although the test limits are inside the equivalence timks, there is only a 63% assurance that the observed confidence interval will be within the eqmvalence limits in the long run. This is an interesting case because although this sample shows bioequivalence, the evaluation of the long-run performance indicates that there may be problems. These fictitious data were generated with high intersubject variability, which causes poor long-run performance. If we conduct introduced in [R] limits are within seen in expanded pkequiv

a bioequivalence test with the data published in Chow and Liu (2000), which we pk and fully describe in [R] pkshape, we observe that the probability that the test the equivalence limits is very high. The data from Chow and Liu (2000) can be form in [R] pkshape. The equivalence test is

outcome

Classic

treat

period

confidence

seq

id

interval

for

[equivalence difference:

test

limits]

-16. 512

ratio : probability

bioequivalence

16. 512

80_, limits

are

[

limits

-8. 698

120_. within

test

4. 123

89. 464Y,

equivalence

limits

]

=

104. 994_ 0.9970

For these data, the test limits are well within the equivalence limits, and the probability that the test limits are within the equivalence limits is 99.7%. <3

> Example Using the data published in Chow and Liu (2000), we compute a confidence interval that is symmetric about zero: pkequiv

outcome

Westlake's

treat

period

symmetric

seq

confidence [Equivalence

Test

formulat

ion:

id,

75. 145

symmetric interval

for

bioequivalence

limits] 89. 974

[

Test

mean

]

80. 272

The reported equivalence limit is constructed symmetrically about the reference mean, which is equivalent to constructing a confidence interval symmetric about zero for the difference in the two drugs, In the above output, we see that the test formulation mean of 80.272 is within the equivalence limits, indicating that the test drug is bioequivalent to the reference drug. pkequiv will display interval hypothesis tests of bioequivalence if you specify the tost the anderson options. For exanlple,

(Continued on next page)

and/or

ikequiv

i

pk_quiv _i/

outcomeitreat

Classic

con

g

period

dence

seq

interval

id, for

i

i

_oequtvalencetests

527

tpst anderson i blOequlvalence ' limits]

[equivalence

i

Perform

[

test limits

]

i

difference

:

-16.512

r_tio:

I_.512

80Y,

-8.698

120Y,

4.123

89.464Y,

104. 994Y,

! probability i

!test limits

1 schuirmann'i lupper test

Anderson i i

tw° °ne-sided _tatistic

an_ Hauck's

ilo_er test

tes_

1

I

=

tests

limits =

0.9980

I

-5.036

p-value

=

0.000

p-value

=

0.001

test

_tatistic

noncentralit_

!

are withiniequivalence

3.810

parameter

=

4.423

statistic

:

-0.613

" empirical

p-value

=

0.0005

Both of Sc uirmann's ohe-sided tests are hight! significant, suggesting that the two drugs are bioequivale_t. A similar Conclusionis drawn fro_ the Anderson and Hauck test of bioequivalence.

t

q

i

i

t i

Saved Resldts pkexaani: e saves Sc_ ars r(stddev)

in

r_<):

i l_ooled sample std. dev. of p¢riod differences from both sequences

r (uci) r (lci) r(delta) r(u3) I r(13)

_pper confidence interval for'a classic interval lower confidence interval for a classic interval eIta value used in calculating a symmetric confidence interval _pper confidence interval for Fietler's confidence interval l_wer confidence interval for Fiel]er's confidence interval I

)

Methods and Formulas pkequiv s imp|ementcdas an ado-_|e. The ]oweJconfidenceinten,a] for the differencein the two treatmentsin the classicshortest confidence in erval is }

i

The upper linlit is /1 gt I i

i The limits Ior theratio

easure are

t

r!ll!

528

pkequiv -- Perform bioequivalence tests

!,

L2 = (L_-_R+ 1) 100% and

where YT is the mean of the test formulation

of the drug, I_R is the mean of the reference formulation

of the drug, and t(o_,,_l+n2-2) is the t distribution with nl + n2 - 2 degrees of freedom. pooled sample variance of the period differences from both sequences, defined as

n l + n2 - 2

2

nk

k=l

i=1

_d is the

(d,k - _k)2

The upper and lower limits fbr the symmetric confidence interval are 3_ .- A and 12R- A, where

/1

A

1

+ n2 -

ki_dv_/ nl

-

and (simultaneously)

and kl and k2 are computed

1

t

nl

n2

iteratively to satisfy the above equality and the condition

fk k2 f(t)dt 1 where f(t) freedom.

is the probability

density function

1 - 2a of the t distribution

See Chow and Liu (2000) for details about calculating theorem. The two test statistics for the two one-sided

with n_ + n2 - 2 degrees

the confidence

tests of equivalence

of

interval based on Fieller's

are

TL = (_zT - YR) - OL

and

Tvad _g/1 where --OL = 0U and are the regulated confidence

+

n21

limits.

The logic of the Anderson and Hauck test is tricky, and readers are encouraged to read Chow and kiu (2000) for a complete exphmation. However. the test statistic is

TAH _i

l

O'd

-_

n2

l i

, p_equiv -- Performbioequivalencetests and the non entrality parameter is estimated by '

t I

_=

529

^ or-oL ii 1

The empirical p-value !is calculated as

i

l i

where Ft is!the cumulatii'e distribution function _f the t distribution with r_l + n2 - 2 degrees of freedom.

!

! t

i t_

Z

;Reference _,

!

Cho_. S, C. dnd J, P. Liu. _)0. Design and Analysis df Bioavaitabili O, and BioequivalenceStudies. 2d ed. New York:Mar_t Dekker. i Neter, L. M, _t. Kutner. C, i. Nachtsheim,and W. Wa_serman.19%, Applied Linear StatisticalModels. 4th ed, i

t

Chicago:IrWin. ! Ratkowsky,D iA.. M. A. EvanS.and J. R. Alldredge,1993.'Cross-overExperiments:Design,AnalysisandApplic_tion_ Ne_ York:lMarcelDekker

Also See i

Related:

i

l

t

i

[R]

pkeollapse, [R]

Complemen'Iary:

[R] statsby

Background:

JR]pk

pkcrossl [R] pkexamine, [R] pkshape, [R] pksurnm

J

"

[

okexamine

-- Calculate

pharmacok/netic

measures

I

Syntax pkexamine { line

time concentration [ log

I exp(#)

[if exp] [in range]

[, fit(#)

} _graph graph_options

t_rapezoid

]

by ... : may be used with pkexamine; see [R] by.

Description pkexamine is one of the pk commands. this entry.

If you have not read [R] pk, please do so before reading

pkexamine calculates phannacokinetic measures from time-and-concentration subject-level data. pkexamine computes and displays the maximum measured concentration, the time at the maximum measured concentration, the time of the last measurement, the elimination time, the half-life, and the area under the concentration-time curve (AUC). Three estimates of the area under the concentration-time curve from 0 to infinity (AUC0,o_) are also calculated.

Options fit(#) specifies the number of points, counting back from the last time measurement, to use in fitting the extension to estimate the AUC0,oo. The default is the last 3 points, which should be viewed as a minimum: the appropriate number of points will depend on your data. trapezoid specifies that the trapezoidal rule should be used to calculate the AUC. The default is cubic splines, which give better results for most functions. In cases where the curve is very irregular, trapezoid may give better results. line and log specify which of the estimates of the AUC0,c_ to display when graphing the AUCo oo. These options are ignored unless specified with the graph option. exp (#) specifies that the exponential fit for the AUCo,_ be plotted. You must speci_ the maximum time value to which you want to plot the curve, and this time value must be greater than the maximum time measurement in the data. If you specify 0, the curve will be plotted to the point where the linear extension would cross the z-axis. This option is not valid with the line or log options and is ignored unless the graph option is also specified. graph

tells pkexamine

graph_options

to graph the concentration-time

curve.

are any of the options allowed with graph,

twoway; see [G] graph

options.

Remarks pkexamine computes summary statistics for a given patient in a phanlmcokinetic idvar: is specified, statistics will be displayed for each subject in the data. 530 ii

,t

trial. If by

i

!

i

i

1

! i

1

pkexamine+ Calculatepharmacokineticmeasures

531

> Example Chow anal Liu

i (2000,!t, l) presents data on a study examining primidone concentrations versus time for a s_bject over a _2-hour period after dosing • list time cone time 0 .5

,

I. 2.

con c 0 0

3. 4.

1 1.5

2.8 4.4

5.

2

4.4

6.

3

4.7

8. 9. 7. 10.

6 8 4 12 16

4 3.6 4.13 2.5

13.

32

i. 6

t2.

24

! t

11.

We use pkexmnine pkex_ine

!

2

tolproduce the summary statistics.

time cSnc, graph

Maximum concentration = Time of maximum concentration = Tmax =

4.Z 3 32

Elimination rate = Half life

0.0279 24.8503

Ar._aunder the _curve

-

i

AUC [0, inf.) Linear of log cone,

AJC [0, Tmax] i 85.24

AUC [0, inf.) Linear fit

AUC [0, inf.) Exponential fit

107.759

142.603

142.603

i

Fi_ based on l_Lst3 points.

I

i '<__, ! i

\ .\

I

l

! ! !

i 1 0-

_,4 t Ahalysis

l

3'2

T!me

The maximu n concentration of 4.7 occurs at time ] and the time of the last observation (Tmax) is 32. In addition the AUC, c_]culated from 0 to the rhaximum value of time_ pkexamine also reports

_

_,_,

53z

pkexamine -- Calculate pharmacokinetic measures

the area under the curve, computed by extending the curve using each of three methods: a linear fit to the log of the concentration; a linear regression line; and a decreasing exponential regression line. See the Metho& and Formulas section for details on these three methods. By default, all extensions to the AUC are based on the last three points. Looking at the concentrationtime graph for these data, it seems more appropriate to use the last seven points to estimate the AUCo.oo: pkexamine

time

conc,

fit(Z)

Time

Maximum

concentration

=

of maximum

concentration

=

3

Tmax

=

32

rate

=

0.0349

life

=

19.8354

Elimination Half Area

under

the

curve

AUC AUC

[0, Tmax]

Linear

,

[0, inf.) of log

' AUC

cone.

[0, inf.)

Linear

AUC

fit

based

131. 027

on last

96. 805

pkexamine

time

cone,

fit(Z)

graph

129. 181

To see a graph of the AUC0,oo using a

line

Time

Maximum

concentration

=

of maximum

concentration

=

3

Tmax

=

32

rate

=

life

=

Elimination Half under

the

Fit

85.24 [0, Tmax] based

4.Z

O. 0349 19.8354

curve

AUC

AUC

fit

7 points.

This decreased the estimate of the AUCo,oc for all extensions. linear extension, specify the graph and line options.

Area

[0, inf.)

Exponential

J.......................

85.24

Fit

4.7

Linear

on last

[0, inf.)

I AUC

131. 027 of log cone.

]

[0, inf.)

AUC

96. 805 Linear fit

[0, inf.)

129. 181 Exponential

fit

7 points.

,7

\

\ \ \ \

c o

_

\ \ \k \

0

1

I

T

T 46,4557

A na I y'St$ Time

W

mkexamine _ Calculatepharmacokineticmeasures

:

SavedResUlts pkexami_e

i

saves in rk)" Scalars

r(auc) r(ke)

area under the conceritrationcurve half life of the drug eliminationrate

r(tmax)

time at last concentrationmeasurement

I

r (cmax)

maximum concentration

[

r(tocm) r(auc_l_ne)

time of maximumconcentration AUC0,ocestimatedwith a linear fit

!

r(auc_e_p)

AUC0,ooestimatedwith an exponentialfit

r(auc_l_)

AUC0,oo

r(half)

Methodsa I

i

estimated with a linear fit of the natural log

FormUlas

pkexami_e is implem!nted as an ado-file. The AUC_,_x is defir/ed as AUCo,t.... =

Ctdt

i i

533

'° where Ct is the concentralion at time _. By default the integral is calculated numerically using cubic splines. Hov,ever. if the tapezoidal rule is used, ihe AUC0,tmaxis given as

i

AUC0,tmax

=

2

i=2

The AUCb,_c is the AHCo,tm_ + AUC_..... oc, or

; [

AUCo,ac -

Ctd_ +

_

JO trnax

Ctdt ftm

ax _

[

When usinglthe linear exlension to the

,

x-axis at - _. The log e::tension is a linear extension on the log concentration scale. The area for the exponenlial extension' is

!} !

_tUCo_ -

:

i

the integration is cu( off when the line crosses the

L""

e-f_°+t_3_) dt = -

,&

Finally, t!_e eliminatio_ rate Keq is the negatNe of the parameter estimate for a linear regression of log time pn concentrattion and is given in the standard manner:

i ! I

•

AUC0,tmax.

Keq--

g i=1

(< e)

and ti/2

=:

In 2 K.---_q

I_ _'_"

:3_q

pl(examine -- Calculate pharmacokinetic measures

If,

"

References Chow. S. C. and Liu, J. P. 2000. Design and Analysis of Bioavailability and Bioequivalence Studies. 2d ed. New York: Marcel Dekker.

Also See Related:

[R] pkcollapse,

Complementary:

[R] statsby

Background:

[R] pk

[R] pkcross, [R] pkequiv, [R] pkshape, [R] pksumm

le

'

F

pkshape

i

'__

-- Reshape

pharmacokinetic) Latin square data

i

I

i

I

i

rl

I,

i

Syntax i i

1I I

pkshape

id sequence f,eriod] period2 [period tlst]! [, order(string) J

ou_3teol}e(newvar) :reatment perio, l(newvar) !

(newvar)

c_ryover(newvar)

sequence(newvar)

°°'°"""°" ]

pkshape

!s one of the!pk commands. If you have not read [R] pk. please do so before readin_ i ! this er_try. _ _ pkshape ieshapes the dita for use with anova, pkcross, and pkequiv. Latin square and crossover data are ofter_ organized in t_manner that cannot be ehsily analyzed with Stata. pkshape wiI! reorganize the dam in n_emory for us_ in Stata.

i Options '_ ! !_

order(string) specifies tl_e order in which treatr_ents were applied. If the sequence() specifier is a strin_l variable which specifies the order, this option is not necessary. Otherwise. order() specifies i_ow to generate the treatment and car_,over variables. Any strin_ variable can be used to specie' the order. In!the case of crossover ddsigns, any washout periods can be indicated with

_ !

the numN :r O. outcome(nevvar)

i

ti specifirs the name for the outcome variable in the reorganized data. By default.

outcome outcome)is used.

!_

treatment

lwwvar) specifies the name for the tre_itment variable in the reorganized data. Bv. default.

I

treat (t !eat)is used i i carryover ,wwvar) specifies the name for the carryover variable in the reorganized data. By default,

I i l

carry(c rry) isusedt sequence (npwvar) speci_es the name for the seqt_ence variable in the reorganized data. By default. sequence(sequence) is used.

1

period(new_,ar) period(feriod)

i

specifie_ the name for the peribd variable in the reorganized data. By default. is used.

Remarks '

T

}

Often. dath from a Latin. square experiment are naturally organized in a manner that Stata cannot easily manage, pkshape _ill reorganize Latin sqdare type data so that it can be used with a.nova

I

(see [R] ano_ca) or any p!¢ command. This includes the classic 2 x 2 crossover design commonly u_ed in phanhaceutical research, as well as many _ther Latin square designs.

e

l

i

s35

[

pKsnape-- Reshape (pharmacokinetic) Latin square data

the example data published in Chow and Liu (2000). There are 24 patients, 12 in sequence. Sequence i consists of the reference formulation followed by the test formulation; is the test formulation followed by the reference formulation. The measurements reported AUCO-tm_x for each patient and for each period. list, noobs ID 1 4 5 6 11 12 15 16 19 20 23 24 2 3 7 8 9 10 13 14 17 18 21 22

Sequence 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2

Periodl 74,675 96.4 101.95 79.05 79.05 85.95 69. 725 86. 275 112.675 99.525 89.425 55. 175 74.825 86. 875 81. 675 92.7 50.45 66.125 122.45 99. 075 86 . 35 49. 925 42.7 91. 725

Period2 73. 675 93.25 102. 125 69.45 69. 025 68.7 59. 425 76. 125 114.875 116.25 64.175 74. 575 37.35 51. 925 72. 175 77.5 71.875 94. 025 124. 975 85. 225 95 . 925 67.1 59. 425 114.05

outcome for a single person is in two different variables, the treatment that was applied individual is a function o1"the period and the sequence. To analyze this using anova, all the need to be in one variable and each covariate needs to be in its own variable. To reorgamze use pkshape: pkshape id seq periodl periodi, order(ab ha) seq id treat

id 1 1 4 4 5 5 6 6 11 11 12 12 15 15 16 16 19

sequence 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

outcome 74. 675 73. 675 96.4 93.25 101.95 102. 125 79.05 69.45 79.05 69. 025 85.95 68.7 69. 725 59. 425 86.275 76. 125 112.675

treat 1 2 1 2 1 2 1 2 t 2 1 2 1 2 1 2 1

carry 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

period 1 2 1 2 1 2 1 2 1 2 1 2 1 2 I 2 1

i

pkshape-- Reshtpe (pharmacokinetic)Latin square data

537 !

18, 19. 20,

19 20 20

1 1 1

114.875 99. 525 116,25

2 1 2

1 0 1

2 1 2

21 22, 23,

23 23 24

1 t :L

89.425 64,175 55.175

1 2 I

0 1 0

1 2 1

24,

24

i

74,575

2

1

2

25, 26, 27, 28, 29, 30 31

2 2 3 3 7 7 8

2 2 2 2 2 2 2

37,35 74,825 51,925 86,875 72,175 81,675 77,5

I 2 1 2 1 2 1

2 0 2 0 2 0 2

2 1 2 1 2 1 2

32 33 34 35 36 37 38 39 40. 41. 42. 43. 44.

8 9 9 10 10 13 13 I4 14 17 17 18 18

2 2 2 2 2 2 2 2 2 2 2 2 2

92,7 71,875 50,45 94.025 66, t25 124,975 122.45 85,225 99. 075 95,925 86.35 67,1 49.925

2 1 2 1 2 1 2 1 2 t 2 1 2

0 2 0 2 0 2 0 2 0 2 0 2 0

1 2 1 2 1 2 1 2 1 2 i 2 1

45, 46, 47, 48,

21 2t 22 22

2 2 2 2

59,425 42.7 114.05 91.725

1 2 1 2

2 0 2 0

'

:'

2 1 2 1

the dat_ are orgam,ord into separate variables that mdlcate each factor level for each of the covariates, s(, the data ma_" be used with anova of pkcrosm see [R] anova and [R] pkcross.

q

Exam#e Consider the stud',, of b_ckground music on bank teller productivity published in Neter et al. (1996). data are " ! _ '

Week 1 2 3 4 5

flonday I18(D)

Tuesday 17(C)

Wednesday, :!4(A)

Thursday 21(B)

Friday 17(E)

17(A) ii7(E) 21(B)

34(B) 29(D) ]3(A) 26(E)

32(B) 24(C) 26(D)

16(A) 27(E) 31(D) 31(C)

15(D) ] 3(C) 25(B) 7(A)

I]3(c)

21(E)

numbers are the p{oductivity scores, and the letters represent the treatment. \Ve entered the into Stala as i id I

seq dcabe

2

cbeatd

[13

day2 17

day3 14

da

day5 17

34

21

_6

15

42,

I

adbec eacdb

7 17

29 13

32 24

27 m

13 2s

5

i

bedca

21

26

26

3tl

7

---

I" ......

_-

n=oIl=p_

tpnarmacoKIneUc) Latin square data

We reshape these data with pkshape: pkshape id seq dayl day2 day3 day4 day5 . list

1. 2. 3. 4. 5. 6. 7. 8. 9. I0. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

id 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

sequence 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

outcome 18 13 7 17 21 17 34 29 13 26 14 21 32 24 26 21 16 27 31 31 17 15 13 25 7

treat 1 2 3 5 4 2 4 1 3 5 3 5 4 2 1 4 3 5 1 2 5 1 2 4 3

carry 0 0 0 0 0 1 2 3 5 4 2 4 1 3 5 3 5 4 2 1 4 3 5 1 2

period 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5

In this case, the sequencevariable is a string variable that specifies how the treatments were applied, so the order option is not used. In cases where the sequence variable is a string and the order is specified, the arguments from the order option are used. We could now produce an AN0¥7_table: . anova outcome seq period treat Number of obs = 25 Root MSE = 3.96232

R-squared = Adj R-squared =

Source

Partial SS

df

MS

Model

1223.60

12

101.966667

sequence period treat

82.O0 477.20 664.40

4 4 4

20.50 119.30 166.10

Residual

188.40

12

15.70

Total

1412.00

24

58.8333333

F

0.8666 0.7331

Prob > F

6.49

O. 0014

i.31 7.60 I0.58

O. 3226 0.0027 O.0007

q > Example Consider the Latin square crossover example published in Neter et al. (1996). The example is about apple sales given different methods for displaying apples.

pkshape-- Reshape(pharmacokinetic)Latin square data

q

• _ ._,

i

P_ttern _ 1

Store 1

t

_

i

, _

2

I

,

3

If the data 1

Week _ 9(B)

Week 2 12(C)

Week 3 15(A)

2

4(B)

]2(c)

9(n)

1 2

12(A) 13(A)

14(B) 14(B)

3(C) 3(C)

21

7(C) 5(C)

18(A) 20(A)

6(B) 4(B)

539

re entered ir_to Stata as

-listi I. 2.

I

I

then

idi 2

1 seq 1

9 pl 4

12 p2 12

15 p3 9

1 square 2

3.

i

3

2

12

14

3

1

4. 5. 6.

i" ! :

5 4 6

3 2 3

137 5

18 14 20

36 4

1 2 2

!

i , the data! can be reorganized using descriptp,,e names for the outcome variables. •t ld ^ p3, oraer(bca " • cab) seq(pattern) period(order) pksnipe " seq p !l!pz aDc } | . • > trea_(displ_ys)I anovg outcome pattern order dlsplay idli_ttern

I i

Number of obs = | Sburce

i 7

Root MSE Partial SS

odel I

pa_tern #rder

18

= 1.59426 df

R-squared

=

0.9562

Adj R-squared = 0.9069 MS F Prob > F

443.666667

9

49.2962963

19.40

0.0002

.333333333 233.333333

2 2

.166666667 116.666667

0.07 45.90

0.9370 0.0000

21,00

3

7.00

2.75

0.1120

8

2.54166667

17

27.2941176

/

!

i

id Ipa_tern Residual

20.3333333

t

|

_otal

!

!

These are the same results reported by Neter et al. (1996).

_ Example Returning

i

I i 1

o the examp e from the pk entry, the data are idI

I

se

auc_concA 150.9643

auc_concB 218.5551

i2 '3 i4

146.76o6i33.32ol _1t 16o 6548 126.o635 157.862296.17461

!5

1{

133.6957

188,9038

17 i8 19

11 1t 1

160.639 i31.2604 168.5186

223.6922 104. 0139 237.8962

li

2 2 2!

153. 4038 163. 4593 137.0627

202. 3942 136. 7848 139.7382

158. 1457

165. 8654

! }

'r

464.00

4

W_

:

b4u

pksbape

I,

--

Reshape

i

18 19 20

[_i

pkshape ±d seq . sort id

,i

2 2 2

(pharmacokinetic)

147.1977 164.9988 145.3823

Latin square

data

'_o

139.235 166.2391 158.5146

auc_concA

auc_concB,

sequence 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

outcome 150.9643 218.5551 146.7606 133.3201 126.0635 160.6548 96.17461 157.8622 188.9038 133.6957 160.639 223.6922 131.2604 104.0139 237.8962 168.5186 137.0627 139.7382 202.3942 153.4038 163.4593 136.7848 104.5191 146.0462 165.8654 158.1457 139.235 147.1977 164.9988 166.2391 158.5146 145.3823

order(ab

ha)

list

:_

id 1 1 2 2 3 3 4 4 5 5 7 7 8 8 9 9 i0 10 12 12 13 13 14 14 15 15 18 18 19 19 20 20

1. 2. 3. 4. 5. 6. 7, 8, 9. 10. 11. 12. 13. 14. 15. 16. 17, 18. 19. 20. 21. 22. 23. 24, 25. 26. 27, 28. 29. 30. 31. 32.

These

data

can

be analyzed

with

pkcross

treat 1 2 1 2 2 1 2 1 2 1 1 2 1 2 2 1 2 1 1 2 2 1 1 2 1 2 1 2 2 1 1 2

carry 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 2 2 0 0 2 2 0 2 0 2 0 0 2 2 0

period 1 2 1 2 2 1 2 1 2 1 1 2 1 2 2 1 1 2 2 ! I 2 2 1 2 1 2 1 1 2 2 1

or anova.

Methodsand Formulas pkshape

is

implemented

as an ado-file.

References Chow. S. C. and J. E Liu. 2000, York: M_cel Dekker. Neter, J,, M. H Kutner. C_cago: [rwin.

C. J

De_;ign and Analysis

Nachtsheim,

of BioavailabiHty

and W. Wasserman.

and Bioeq_livalence

1996, Applied

Linear

Studies.

Statistical

2d ed. New

Models.

4th ed.

Also See ;

Related:

JR] pkcollapse,

Background:

JR] pk

[

fi

[R] pkcross,

[R] pkequiv,

[R] pkexamine,

[R] pksumm;

JR] anova

le i_

'_

pksumm -- Summari e pharmacokinetic data I

"

I

t

I

I1

I

I

I

"

Syntax pksu_ i_ i

i

i_lti,,,econce_tration [if expl [in ra,,gc] [, fit(#)

t__rapezoid

stat (lneasure)_ no'orsnotim_chk_graph graph_options ! where meas_treis one of auc !area under concentration-time cu_'e (AUC0,ec,) aucline under concentration-time curve from 0 to _ using a linear extension t !area under !ii concentration-time curve from 0 to vc using an exponential extension aucexp _area auc!og area under tt_e log concentration-time curve extended with a linear fit

l

half

half life of

drug

i

ke

!elimination cdncentration tmaximum r_ate

I o

tmax tome

cmax

!time at last _oncentration !time of maxilrnumconcentration

Deseriptio, pksummii one of the p commands. If you have no_read [R] pk, please do so before reading this

i

pksumm dbtains the firit four moments from the empirical distribution of each pharmaeokinetic measurement and tests tt{e null hypothesis that the distribution of that measurement is normally distributed.

Options }

_ fit (#) spedides the nur_e ,b r of points, counting back from the last time measurement, to use in

_

fitting the!extension to _stimate the AUC0,vc, The default is fit (3). the last 3 points. This should be viewed as a minimlJm;the appropriate number of points will depend on the data,

;i

.ipecifiesthat _hetrapezoidal role should be usedto calculate the AUC.The default is cubic splines, Whichgive belter results for most situations. In cases where the curve is very irregular,

trapezoid

!

the trape+idal rule m+ give better results. star (statistic) specifies the statistic that pksnrm should graph. The default is stat(auc).

i

graph o_ion is not specified, this option is ignored. nodots suppresses the progress dots during calculation, By default, a period is displayed for every call to calculate the phlarmacokineticmeasures.

!

not imechk

! l

If the

iuppresses th_ check that the follow:up time for all subjects is the same, By default. pksumm e_pects the m_ximum follow-up time _obe equal for all subjects.

graph reque_ts_a graph ofl the distribution of the gtatistic specified with s'cat(), graph_optioi_sare any of lthe options allowed with graph, twoway: see [G] graph options. 541

_,,tt

o_z

pKsumm -- _ummanze pharmacokinetic data

-7

Remarks pksnmm will produce summary statistics for the distribution of nine common pharmacokinetic measurements. If there are more than eight subjects, pksumm will also compute a test for normality on each measurement. The nine measurements summarized by pksurr_ are listed above and are described in the Methods and Formulas sections of [R] pkexamine and [R] pk.

> Example We demons_ate the use of pksumm with the data described in [R] pk. We have drug concen_ation data on 15 subjects, each measured at 13 time points over a 32-hour period. A few of the records are . list id 1 1 1 1 1 1

1. 2. 3. 4. 5. 6.

time 0 .5 1 1.5 2 3

cone 0 3.073403 5.188444 5.898577 5.096378 6.094085

0 .5 1 1.5 2 3 4 6 8 12 16 24 32

0 3.86493 6.432444 6.969195 6.307024 6,509584 6.555091 7.318319 5.329813 5.411624 3.891397 5.167516 2,649686

(ou_utomi_ed) 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195.

15 15 15 15 15 15 15 15 15 15 15 15 15

We can use pksmm

to view the summary statistics for all the pharmaco_netic

parameters.

. pksnmm id time cone

Summary statistics for the pharmacokinetie

measures Number of observations =

star. auc aucline aucexp auclog half ke emax tome tmax

15

Mean

Median

Variance

Skewness

Kurtosis

p-value

150.74 408.30 691.68 688.98 94.84 0.02 7.36 3.47 32.00

150.96 214.17 297.08 297.67 29.39 0.02 7.42 3.00 32.00

123.07 188856.87 762679.94 797237.24 18722.13 0.00 0,42 7.62 0.00

-0.26 2.57 2.56 2.59 2.26 0.89 -0.60 2.17

2.10 8.93 8.87 9.02 7.37 3.70 2.56 7.18

0.69 0.00 0.00 0.00 0.00 0.09 0.44 0.00

For the 15 subjects, the

mean

AUCo,t .....

is 150.74 and cr2 = 123.07. The skewness of -0.26 indicates

that the distribution is slightly skewed left. The p-value of 0.69 for the X 2 test of normality indicates that we cannot reject the null hypothesis that the distribution Is normal.

i

'

....

_ pksumm-- Summarizepharmacoldnetic data 543 If we were to consider an3 of the three variants of the AUC&oo, we would see that there is huge variabilib' ant that the distribution is heavily skewed. A skewness different from 0 and a kurtosis different from,3 are expecu,d because the distribution of the AUCo._ is not normal. We now grapt the distribut on of AUe0,tm_ and specify the graph option,

i

!

i "

. pksuml_ id time con:)

graph bin(20)

I

Summari statistics _or the pharmacokinetic measures i

Number of

i

150.7_

Median 150.96

au_line a_cexp

408.38 691._

214.17 297.08

188856.87 762679.94

l

a_clog lhakl_e 1 lcmax

688. 94. 0. O_ 7.3_5

297.67 29.39 0.02 7.42

797237:24 18722,13 0.O0 0.42

l

!tmax !tome

0.00 7.62

1 I

s_at"

n

_ auc

)_e_i_

32. 3.

Variarlce Skewness 123.07 -0.26

32. O0 3.00

f I

i

observations = Kurtosis 2.10

p-value 0.69

8.93 8.87

O.O0 O.O0

2.59 2.26 0,89 -0.60

7.37 9.02 3.70 2.56

O. O.O0 O0 0.09 0,44

2.17

7.18

0.00

2.57 2.56

_

I Area Under

Cu!ve

15

168,5_9

(AUC}

_i

graph, AI.r_0.tm.,.To a graph ofwe one of ask the other pharmacokineticmeasurements. _e needbyde_ult, to specify plots the star () option. plot For example, can Stata to produce a plot of the AUCo._:

I

using the log t;xtension:

(Continued on next page)

544

pksumm -- Summarize pharmacokinetic data . pksumm id time cone, stat(auclog)

graph bin(20)

Summary statistics for the pharmacokinetic measures Number of observations = star.

18

Mean

Median

Variance

Skewness

Kurtosis

p-value

auc aucline

150,74 408.30

150.96 214.17

123,07 188856.87

-0.26 2.57

2,10 8.93

0.69 0.00

aucexp auclog half ke cmax tome tmax

691.68 688.98 94.84 0.02 7.36 3.47 32.00

297.08 297.67 29.39 0.02 7.42 3.00 32.00

762679.94 797237.24 18722.13 0.00 0.42 7.62 0.00

2,56 2.59 2,26 0,89 -0,60 2.17

8,87 9.02 7,37 3.70 2.56 7.18

0,00 0.00 0.00 0.09 0.44 0.00

,666667

-

o.

I

II

I

=

|

18g,135

3624!8 Linear

fit to tog concentration

<1

Methods and Formulas pks-mm is implemented The X2 test for normality test of normality.

as an ado-file. is conducted with sktest;

see [R] sktest

for more information

The statistics reported by pksumm are identical to those reported by summarize [R] summarize and [R] sktest.

and sktest;

Also See Related:

[R] pkcollapse,

Background:

[R] pk

[R] pkcross, [R] pkequiv. [R] pkexamine.

[R] pkshape

on the see

ie i plot-

i i

Dr

Syntax

,!

plot

)','a_1[yvar2 [yl_ar3]])a'ar [if exp]

° 1

i 1

using typewriter characters

in range

[, _columns(g)

e_ncode _

i hlines(#)

by ...:

lines(

) [lines(#)

1

may b_ used with ptSt: see [R] by.

Description plot pro@ces a two-way scatterplot of yvar against war using typewriter characters. If more than one vvarlis specified, single diagam is prodhced that overlays the plot of each yvari a_ainst .t'va r.

i

graph prol,ides more sdphisticated capabilities than does plot:

see the Stata Graphics Manuel.

Options colmms(#) ipecifies the dolumn width of the plot. The number of columns must lie between 30 and I33: the ddfault is 75. _ote that the plot occupids ten fewer columns than the number specific& The extra _n columns ire used to label the diagam. ! i i _ I

encode plots _oints that oct:ur more than once with _ symbol representing the number of occurrences. Points that ioccur once a!e plotted with an asterisk (*). twice with the numeral two (2), three times with the ntimeral three d3), and so on. Points tMt occur ten times are plotted with an 'A. eleven with a B ._and so on. u+td Z. The letter Z is used subsequently, encode may not be specified if there is _ore than on_ war.

! _

hlines(#) c_uses a horiz_)ntal line of dashes (-) io be drawn across the diagram every #-th line. where # re_resents a nulnber between 0 and the line height (lines) of the plot. Specifvin_ # as

i

0. which i_ the default. !esults in no horizontal lines.

i !

lines(#) spdcifies the lin_ height of the plot. The number of lines must lie between 10 and 83; the default is 4_3.Note that !the plot occupies three fewer lines than the number specified. The three extra lines iare used to l_bet the diagram

l

vlines(#)c',i es a verti I line of bars (1) to be drawn on the diagam every #-th column, where # is a number between 0 and the column width columns) of the plot. Speci_'ing # as 0. which is the default, results in lno vertical lines.

i

i ! !

I

i

2

Remarks

4 i

plot displays a line-prin !erplot--a scatter diagram drawn using characters available on an ordinary typewriter • or iine printer. ?,s a result, this scatter diagram can be displayed on any monitor, printed i on any prmteri and edited _y any word processor. _e diagram necessarily has a rougher appearance than one designed to be di,_played on a graphics monitor.

i

i

•

> Example 546 plot -- Draw scatterplot using typewriter characters _ i_

We use the plot command to display the function 9 - z2 for values of x ranging between -10 and 10. Each point is plotted with an asterisk (*). The minimum and maximum xvar are marked, and the variable names are displayed along the axes.

values of yvar and

• set obs 21 obs was O, now 21 generate x=_n-ll • generate y=X*x plot y x 100 +

Y *

,t

l I I 0

*

+

* *

-I0

*

*

*

*

x

10

> Example You can reduce the size of a graph by specifying the number of lines and columns to be used. In this version, we plot y = z 2 in 16 lines and 50 columns: plot y x, lines(16) columns(f0) I00 +

y

*

.

*****

O+ + .......................................

-I0

+

x

I0

<1

I

plot -- Draw s_tterplot using typew_

characters

547

> Example You can u_e the hline_and !

_

vlines

options t°add horizontal and vertical lines to a graph. We

place a horizdntali line e',er_; ''''_" 5 lines and a vertical iine every i0 columns by typing .plot

1

_ x,

hlines(_)

vlines(lO)

Io_+

I I I

* ........

_-+

* i

*

_

y

........

t 1 1

.........

I I I

l I I

*

+ .........

+._........

+ .........

+ .......

)l )I il (t _4_+. .........

i I I t , .........

I l l I +_........

i l I I , .........

t l * I l* +.......

*II il

I I

I I

I 1

,l ......... t

I +._ ........ I

I * +__,...... I

t +....... I

*I

I

*I

I

i

il*

i

I

.......... I

_-+___, ..... I

I

I 1 i

*

Q ++............................... l

I *

-10

*

*1 I

* *1"=.......................... * I

I

x

+ I0

I

i

<1

i > Example Real data _an be messi,'.r to plot than the simple mathematical function used in the previous 74 automobile_: examples. The! following pl _t displays the combinaiions of miles per gallon (mpg) and veight

for

. plot41} _pg+weight

i

, M i 1 e

*

**

a g

I *

e

.

,

i

*

i i

m

{

p g )

I

* **

(

*

*, *

*

I

* * *

**

* *

1

** *** *

,

** , * ***I, ***

*

* *

,

**** *

_

*

*

* *

**

*

12 i + +.........4......................_........................... + 1760 Weight (ibs_) 4840 i

i

Although it is _ot revealed _y this graph, several automobiles have virtually the same mpg weight • • i combination; some of the asterisks represent more than one observation. The encode option reveals

l

this:

_'r

548

plot -- Draw scatterplot using typewriter characters

'41

--_

plot mpg weight, encode 41 + M i I e

*

**

a

g e

* * 2 2

( m

* * **

2* * ***

* *2

p

**

*

g )

*

**

*** *

*

2 2 332* •

*

*

* *

•

*

**2 ** *

*

*

*

*

2-

*

12 +

* 1760

Weight

*

4840

(Ibs.)

Each '*' in this diagram represents one point, each '2' represents two points, and so on.

q

b. Example You can graph up to three y-variables at a time against the x-variable. The first variable is plotted with A's, the second with B's, and the third with C's. Below we graph the price of domestic and foreign cars dprice and fpriee, respectivelymagainst weight: • plot dprice fprice weight 15906 + A A A B

A

B A B

A

B

A

A A

B A B

B

B B BB B B * BB B A A 3291 + B BBA B

A B AAA A AA A A

AAAA AAA A A AAAAAA A A A

A A

.j. ..........................................................

1760

Weight

(Ibs.)

+

4840

The graph indicates that domestic cars typically cost less per pound than foreign cars.

Also See Related:

Stata Graphics Manual

q

i ! '

i i

i t

poisson

Poisson

J

Syntax

i1

poissonde_var [indepv_rs] [weigh,] [if exp] [inrange] [' irr level(#) exposur _(varname) _ffset (varname) robust cluster (varname) score(newvar)

_

noconst __ntconstraints

_

(numlist) notog maximize_options

poisgof by ... : may bemused with poi.son;see [R] by. poisson

shares t_e features of a

i

poissen

may be !used with sw t!) perform stepwise estimation: see [R] sw,

!

i i

!

ts,

Syntaxfor p predict

and pweigttts

are allowed: see [U] 14,1.6 weight. estimation commands: see [U] 23 Estimation and post-estimation commands,

fweights,

I

i_ei

4

i

ict

i

[b,pe]newvalname

noo:f fse£

i

[if exp]

[in ra'lge]

[,

{ n ]ir

i xb'

stdp }

i

These statistics ar; available botl_ in and out of sample: type predict

...

if egsamp!e) ..,

if wanted only for

the estimation sample.

Description

!

poissones![matesa Pois_onmaximum-likelihoodregression of depvar on indepvars, where depvar is a nonnegativ_ count variable.

'

t

Persons whq have panel ata should see [R] xtpols. poisgof, w_aichmay be!used following poisson, performs a goodness-of-fit test of the model. If the test is significant, thi_ would indicate that the Poisson regression model is inappropriate. In this case, you doutd t_'a n_gative binomial model: gee [R] nbreg.

Options

i

! i ! i

irr reports eat!mated coeffijients transformed to incidencerate ratios, i.e.. eb rather than b. Standard errors and _onfidence inervals are similarly transformed. This option affects how results are displayed, nbt how they are estimated. ±rr may be specified at estimation or when replaying previously eitimated resu ts.

i

level (if) specifiesthe conficlencelevel, in percent, for confidenceintervals. The default is level or as set by'set level:see [U] 23.5 Specifying the width of confidence intervals.

=

549

(95)

F_,-

:)ou

polssonm Poissonregression

exposure (varname) and offset (varname) are different ways of specifying the same thing, exposure () specifies a variable that reflects the amount of exposure over which the depvar events were observed for each observation; ln(varname) with coefficient constrained to be t is entered into the log-link function, offset() specifies a variable that is to be entered directly into the log-link function with coefficient constrained to be 1; thus, exposure is assumed to be e varname. robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster () allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,

robust

is implied; see [U] 23.13 Weighted estimation.

cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups, varname specifies to which group each observation belongs; e.g., cluster(personid) in data with repeated observations on individuals, cluster()affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U]23.11 Obtaining robust variance estimates, cluster () can be used with pweights to produce estimates for unstratified cluster-sampled data, see [U] 23.13 Weighted estimation, but also see [R] svy estimators for a command designed especially for survey data. cluster() by itself.

implies robust;

specifying robust

cluster

() is equivalent to typing cluster()

score (newvar) creates newvar containing uj = OlnLj/O(xjb) for each observation j in the sample. The score vector is _ OlnLj/Ob = _ujxj; i.e., the product of newvar with each covariate summed over observations. See [U] 23.12 Obtaining scores. noconstant

suppresses the constant term (intercept) in the regression.

constraints (nurnlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command; see [R] constraint. nolog suppresses the iteration log. maximize_options control the maximization process; see [It] maximize. You should never have to specify them. although we often recommend specifying trace.

Optionsfor predict n,

the default, calculates the predicted number of events, which is exp(xjb) if neither o:ffset(varname) nor exposuxe(varname) was specified when the model was estimated; exp(xjb + offset)if ofgset(varname) was specified; or exp(x3b ) • exposure if exposure (varname) was specified.

ir calculates the incidence rate exp(xjb), is equivalent to n when neither offset the model was estimated.

the predicted number of events when exposure is 1. This (varname) nor exposure (varname) was specified when

xb calculates the linear prediction. strip calculates the standard error of the linear prediction. nooffset is relevant only if you specified offset (varname) or exposure(varname) when you estimated the model. It modifies the calculations made by predict so that they ignore the offset or exposure variable: the linear prediction is treated as x_b rather than xjb- offsety, and specifying predict ... is equivalent to specifying predict ._. , nooffset it.

poisson-- Poisson regression

551

//'ks

!

" _-

The basic id a of Poisson regression was outlined by Coleman (1964. 378-379). See Feller (1968. 156-164) for iJ formation about the Poisson distribution. See Long (1997, chapter 8), McNeil (1996,

"

chapter 6), and Selvin (19951 chapter 12) for an introduction to Poisson regression. Also see Selvin Poisson distrib_ tion. ! (1996, chapter ;) for a discussion of the analysis of spatiat distributions including a discussion of the

i

Poisson regr _ssion is useq to estimate models of the number of occurrences (counts) of an event. i i

" ....

: '

The Poisson di _tribution hasl been applied to divers_ events such as the number of soldierS kicked to death by ho'ses in the Pt'ussian armor (Bortkewitsch I898); the pattern of hits by buzz bombs launched again ;t London d@ng "World War II (Clarke 1946); telephone connections to a wrong number (Thorn(like 1926); ar]d disease incidence, typically with respect to time. but occasionally with respect to space. The basic atssumptions are .

I

,

.

•

,

i. There _s a qtmnt_ty called!the mc_dence rate that is the rate at which events occur. Examples are

ii

5 per second, 20 per l,0(_0 person-years, 17 per square meter, and 38 per cubic centimeter. , _. Yhe incide n_e q rat e can be nultiplied by exposure toiobtain the expected number of obse_'ed e',ents. For example, a rate of 5 _er second multiplied by 30 seconds means t50 events are expected; a rate of 20 p_r 1.000 persc >years multiplied by 2._00 person-years means 40 events are expected:

i I

and so on. _ 3. Over very s_nall exposun'.s e. the probabilib, of finding more than one event is small compared with _. 4. Nonoverlapt_ing exposure t

i !

l l l

are mutually independent.

With these assumptions, to fi _dthe probability of k events in an exposure of size E. divide E into rz subinter,:als Eat, E2 ..... E._ and approximate the answer as the binomial probability of observing h successes in i_ztrials. If ycu let n _ oc. you obtain the Poisson distribution. In the Poiss+n regression model, the incidence rate for the jth observation is assumed to be given

,

?,j = e13a+'21xl,j+"'+13_zkj

tf Ej is lhe ex )osure, the elpected number of events Cj will be

,i

Cj i

i

T

'

= E3eB°+fllXl"+"+c%x_,J = CIn(E-_ )+3°+Btx_'J_'"+fl_zk'3

1

i

his model is elstimated by l_oisson. Without the exposure() or offset() options. Ej is assumed _o be t (equivalent to assuming that exposure i_ unknown_ and controlling for exposure, if necessary,

i :

i is your responsibility. One often _ants to comptre rates and this is mos_ easitv done by calculating incidence rate ratio_

}

(IRR). For inst!nce, what i@he relative incidence rate of'chromosome interchanges in cells as the intensity of radiation increases; the relative incidence irate of telephone connections to a wrong number

i !

as load increases: or the reMive inci:ence rate of deaths due to cancer for females relative io males? That is, one w}nts to hold _ll the x s in the model i:onstant except one. say the ith. The incidence rate ratio for atone-unit cha_ge in xi is

_

i e_(_)+_'++'_(_"+a)+

+,_x_

__ e¢3i

!

More generally, lhe inciden 'e rate ratio for a _xi _hange in xi is e_z_. The lincor_ command can be used atter poisson to display incidence raie ratios for any group relative to another: _ee

i

JR] iincom.

,

> Example Chatteuee, Hadi, and Price (2000, 164) give the number of injury incidents and the propo_ion of flights for each in a single year: airline out of the total number of flights from New York for nine major U.S. airlines • list

I. 2. 3. 4, 5. 6, 7. 8. 9.

airline I 2 3 4 5 6 7 8 9

injuries II 7 7 19 9 4 3 1 3

n 0.0950 0.1920 0.0750 0.2078 0.1382 0.0540 0.1292 0.0503 0.0629

XYZowned 1 0 0 0 0 1 0 0 1

To their data we have added a fictional variable, XYZowned. W%will imagine made that the airlines owned by xYZ Company have a higher injury rate. , poisson

injuries

XYZowned,

exposure(n)

that an accusation is

irr

log likelihood = -23.027197 log likelihood = -23.027177 log likelihood = -23.027177

Iteration O: Iteration 1: Iteration 2:

Poisson regression

Number of obs LR chi2(1) Prob > chi2 Pseudo R2

Log likelihood = -23.027177 injuries

IPd{

XYZowned n

Std. Err.

1.463467 (exposure)

.406872

z 1.37

= = = =

9 1.77 0.1836 0.0370

P>lzl

[95_ Conf. Interval]

0,171

,8486578

2.523675

We specified irr to see the incidence rate ratios rather than the underlying coefficients. We estimate that XYZ Airlines" injury rate is 1.46 times larger than that for other airlines, but the 95% confidence rate. interval is .85 to 2.52; we cannot even reject the hypothesis that xYz Airlines has a lower injury

log likelihood = -22.333875 log likelihood = -22.332276 log likelihood = -22.332276

1

......

_

Poisso4 regression i : _ i

Nlmbez of obs LR chi2(2) Prob > chi2

= = =

9 19.15 0.0001

Log li_elihood = -2_.332276

Pseudo R2

=

0.3001

injhries I •

poisson-- Poissonregression

'

,

.,!

Std. Err.

z

P>Izl

[957,Conf. Interval]

.6_40667 1.424169

.3895877 .3725155

1,76 3;82

0.079 0.000

-.0795111 ,6940517

4 ._ 63891

.7090501

6;86

O.000

3.474178

r

XYZhwned I :! InN I __

Coef.

553

/.cons! I

1.447645 2.154285 6.253603

L

, ! !

In this case, +ther than sp_'cifyingthe exposure (} option, we explicitly included the variable that would normalize for expos1:re in the model. We did not specify the irr option, so we see coefficients rather than incidence rate r fios. We started with the model i

rate

=

e¢3°+fltXYz°wned

COllltt = /2¢ _°+_lXYzowned

i

= e tn(n)+fl°+fllXYZ°wned

The observed :ounts are therefore which amount_ to constrai dng the coefficient on in(n) to 1, This is what was estimated when ourselves and, !rather than o)nstraining the coefficientto be I, estimated the coefficient.

i

weThe specified tt_e exposure In the abovedistance model away we included estimated coefficienI(n)option. is 1.42, a respectable from 1,the andnormalizing consistent exposure with our speculation thai larger airlin#s also use larger airplanes. With this small amount of data, however, we

I

also have a wi!le confidenc_interval that includes 1. Our estimai_d coefficienI on XYZo,anedis now .684, and the implied incidence rate ratio is e.TM ,_, 1.98 (#lhich we could also see by typing poisson, irr). The 95% confidence interval for the coefficient .till includes ) (the interval for the incidence rate ratio includes 1), so while the point estimate is no_alarger, we ill cannot be very certain of our results.

I i

Our expert 4pinion woulc be that, while there is insufficientevidence to support the charge, there !

is enough evidence to justif3 collecting more data.

"i ! !

Example

:

In a famous_'age-specific_tudy of coronary disease deaths among male British doctors, Dolt and Hilt (1966) rep6rted the folNwing data (reprinted in Rothman and Greenland I998. 259):

i

Smokers Deaths Person-years

Nonsmokers Deaths Person-years

_.

Age

_,

35 - 44 45- 54

32 104

52.407 43.248

2 12

18,790 10.673

55-64 75 - 84 65- 74

206 102 t 86

28,612 5.317 t 2.663

28 31 28

5.710 1.462 2585

i i

The first step __sito I enter thes, data into Stain, which we have done: • list

_

i! i

agecat 1 2 3

smokes 1 1 1

deaths 32 104 206

5.

5

1

102

5,317

6. 7. 8. 9. 10.

21 3 4 5

00 0 0 0

122 28 28 31

18,790 10,673 5,710 2,585 1,462

1. 2. 3.

4

4

1

pyears 52,407 43,248 28,612

186 12,663

agecat 1 corresponds to 35-44, agecat 2 to 45-54, and so on. The most "natural" analysis of these data would begin with introducing indicator variables for each age category and a single indicator for smoking: tab agecat, gen(a) agecat

Freq.

1 2 3 4 5

2 2 2 2 2

20.00 20.00 20.00 20.00 20.00

10

100.00

Total • poisson Iteration Iteration Iteration Iteration

Percent

Cum. 20.00 40,00 60.00 80.00 100.00

deaths smokes a2-a5, exposure(pyears) O: log likelihood = -33.823284 1: log likelihood = -33.600471 2: log likelihood = -33.600153 3: log likelihood = -33.600153

Poisson regression

Number of obs LR chi2(5) Prob > chi2 Pseudo 22

Log likelihood = -33.600153 deaths

IP_

smokes a2 ] a4 a5 pyears a3

1.425519 4.410584

28.51678 40.45121 (exposure) ; 13.8392

irr

Std. Err. .1530838 .8605197

z 3.30 7.61

= = = =

i0 922.93 0.0000 0.9321

P>Iz_

[95_ Conf. Interval]

0.001 0.000

1.154984 3.009011

1.759421 6.464997

5.269878 7.775511

18.13 19.25

0.000 0.000

19.85177 27.75326

40.96395 58.95885

2.542638

14.30

0.000

9.654328

19.83809

• poisgof Goodness-of-fit Prob > chi2(4)

In the above, we began by using

chi2

= =

tabulate

12,13244 0.0164

to create the indicator variables,

equal to 1 when ageeat = 1 and 0 otherwise; a2 equal to 1 when agecat and so on. See IV] 28 Commands for dealing with categorical variables.

tabulate created al = 2 and 0 otherwise;

We then estimated our model, specifying irr to obtain incidence rate ratios. We estimate that smokers have 1.43 times the mortality rate of nonsmokers. We also notice, however, that the model does not appear to fit the data well; the goodness-ogfit X 2 tells us that, given the model, we can reject the hypothesis that these data are Poisson distributed at the 1.64_ significance level. So let us now back up and be more careful. We can most easily obtain the incidence within age categories using ir; see [R] epitab:

rate ratios

i,

I

i i

po_sson-- Poissonregression

_-

, ir

d4aths smokes _yeexs, by(_ecat) nocz_de nohet ! agecat IRR [95_ Conf. Interval] t 2 3 4 5 H-_ combined

, 5.736638 2.138812 1.46824 1.35606 .9047304

1.463519 1.173668 .9863626 .9082155 .6000946

49.39901 4.272307 2.264174 2.09649 1.399699

_ 1.424882 I

1.t5470_

1.757784

g,nal=soo os*lagecat= >

'

I

gen _a2 = smokes* agecat==2) gen

34 = smokes_(agecat==3 I agecat==4)

. pois_on deaths sa

i

!zeratdon O:

log

i I

I_erat_on 1: iterat_n 2: IteratiOn 3:

log .ikelihood = -27.788819 log .ikelihood = -27.573604 log .ikelihood = -27.57fi645

i

Iterat_n

I ° i i

I i i

!

(exact) (exact) (exact) (exact) (exact)

"i although we Mll begin by!combining age categories 3 and 4: i

i

I

1.472169 9.624747 23.34176 23.25315 24.31435

robust.} Seeing thiL we will nbw parametefize the smoking effects separately for each age category. l . .

. gen _5

I

M-H Weight

We find that !the mo_alityl incidence ratios are greatly different within age category, being highest for the youn_st categofie_ and actually dropping below 1 for the oldest. (in the las( Case, we might argue that th_se who smoke and who have not died by age 75 are sel_selected to be particularly

i

i

555

•

= smokes*_agecat==5)

4:

sa2 sa34 sa5 a2-a5, exposure(pyears) irr ikelihood = -31.635422

log .ikelihood = -27.572645

P_isso_ regression i

Number of obs LR chi2(8)

= =

i0 934.99

L_g li_e lihood = -2' .572645

Pseudo Prob > R2 chi2

=

0.9443 0.0000

IRR iaths sal sa2 sa34 sa5 a2 a3 a4 a5

P>_z{ Std. Err.

i

5.f36638 2._38812 1._12229 .9_47304 1_.5631

4.181257 .6520701 .20t7485 .1855513 8.067702 34.3741

98}22766 _99.21

70.85013 145.3357

7.671

[9SZConf.

z

Interval]

2.40 2.49 2.42 -0.49 3.09 5.36

0.017 0.013 0.016 0.625 0.002 0.000

1.374811 1.176691 1.067343 .6052658 2.364153 11.60056

23.93711 3.887609 1.868557 1.35236 47.19624 195.8978

6.36 7.26

0.000 0.000

23.89324 47.67694

403.8245 832.365

pois f _ Goodness-o -fit chi2

=

.0774185

i Prob > chit(1)

=

0.7808

£ Note that the _oodnes_-o_f it X2 is now small: we are no longer running roughshod over the data. Let u_ no_ consider simpli_,in _the model. The point estimate of the incidence rate ratio for smoking in age category i1 is much la__er than that for _moking in age category v but the confidence interval | . for sal iS s_ularty wide, [s the difference real?

• test sal=sa2 =ao ;

(I) [deaths]sal[deaths]sa2 = 0.0 pumson _ l"olsson regression chi2( 1) = 1.56 Prob > chi2 = 0.2117

The point estimates may be far apart, but there is insufficient data and we may be observing random differences. With that success, might we also combine the smokers in age categories 3 and 4 with those in I and 2? • test sa34=sa2, accum (I) (2)

[deaths]sal - [deaths]sa2 = 0.0 - [deaths]sa2 + [deaths]sa34 = 0.0 chi2( 2) = Prob > chi2 =

4,73 0.0938

Combining age categories 1 through 4 may be overdoing it to stop us, although others may disagree.

the 9,38%, significance level is enough

Thus, we now estimate our final model: • gen sal2 = (sallsa2) . poisson deaths sa12 sa34 sa5 a2-a5, exposure(pyears) Iteration Iteration Iteration Iteration

O: I: 2: 3:

log log log log

likelihood likelihood likelihood likelihood

= = = =

Poisson regresslon

Number of obs LR chi2(7) Prob > chi2 Pseudo R2

Log likelihood = -28.514535

deaths

IRR

sat2 2.636259 sa34 1.412229 sa5 .9047304 a2 4. 294559 a3 23.42263 a4 48.26309 a5 97,87965 pyears i (exposure)

irr

-31. 967194 -28.524666 -28.514535 -28.514535

Std. Err. .7408403 .2017485 .1855513 .8385329 7. 787716 16. 06939 34. 30881

z 3.45 2.42 -0.49 7.46 9.49 11.64 13.08

P>Izl 0.001 O.016 O.625 O.000 O.000 O. 000 O. 000

= = = =

i0 933. Ii 0.0000 0.9424

[95Y,Conf. Interval] 1,519791 i.067343 .6052658 2. 928987 12.20738 25,13068 49. 24123

4.572907 i.868557 1. 35236 6. 296797 44.94164 92. 68856 194.561

The above strikes us as a fair representation of the data. q

(Continued on next page)

l

i i

l

!._

i i

poigson -- Potsson regression

557

SavedResults .....

:i

poisson s_wes in e():

e(N) Scalars e (k) e (k_.eq) e(k.dv)

number of

observations number of lvariables number of equations number of dependent variables

e(ll_0)

log likelihood, constant-only model

e (1__clust) e (re) e(chi2)

number of clusters return code X2

e (dL.m)

model deglMs of freedom

e (p)

significance

e(r2_p) e(lt)

pseudo

e(ic)

e(rank)

number of iterations rank of e(V)

R-squared

log likelihcod

Macros e(emd)

poisson

e(user)

name of likelihood-evaluator program

e(depvar)! e(title)

name of &pendent variaNe title in estimation output

e(opt) e(chi2tTpe)

type of optimization Wald or LR; type of model X2 test

e(wgype) i e(wexp)

weight tyl_ weight exp ession

e(offseSt) e(prediet)

offset program used to implement predict

e(ctustv@) e (vee'_yp@

name of cluster variable covariance _stimation method

e(cnsli_t)

constraint numbers

e (V)

variance-covariance matrix the estimators

Matrices

!!

e (b) e (ilog)

coefficient _vector / iteration lo_ (up to 20 iterations)

of

Functions e (sample)

marks estit nation sample

MethodsanqtVormul_ts _:)oi_so_.

_d

_Doi_go_

_re

inr_plei_e.ted

The log lil_elihood (witl" weights

a_

ande-_Ay offsets)

ado_|_s.

and scores are given by

Pr Y = 5') -

I

5'!

1

(i = _ifl + offseti

!

e- exp(_i)e_,_,

i

I

i=1

}'i

s,:ore(,3)i

= Yi _ e{_

I I

References

i

Bonke_tsch, I.i yon. 18c_8.D_s Gesetz der Kleiner_ Zahlen: Leipzig: Teubner. Cameron, A. C.{and R K. Triv,_di. t998. Regres.sion analysis of count data. Cambridge: Cambridge Universit> Press.

i

Chat|e@e. S.. ,,k. S. Hadi, and B. Price. 2(_10.Regres._ionAnalvsis _ Example. 3d ed. New York: John Wiley &

i

Clarke. R. D. 1146. An applic lion of the Poisson distribution. Journal of the Institute of Actuarie_ 22: 48.

.,.,_,

pu,_un

--

romson

regression

Coleman. J. S. 1964. Introduction m Mathematical Sociology New York: Free Press. [

Doll, R. and A. B. Hill. 1966. Mortality of British doctors in relation to smoking; observations on coronary thrombosis. In Epidemiological Approaches to the Study of Cancer and Other Chronic Diseases. ed, W. HaenszeL National Cancer Institute Monograph 19: 204-268. Feller, W. 1968. An Introduction to Probability Theory and Its Applications. vol. 1. 3d ed. New York: John Wiley & Sons. Hilbe, J. 1998. sg91: Robust variance estimators for MLE Poisson and negative binomial regression. Stata Technical Bulletin 45: 26-28. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp, 177-180. . 1999. sgt02: Zero-truncated poisson and negative binomial regression. Stata Technical Bulletin 47: 37-40. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 233-236. Hilbe, J. and D. H. Judson. 1998. sg94: Right, left, and uncensored Poisson regression. Stata Technical Bulletin 46: 18-20. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. t86-189. Long, L S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications. McNeil, D. i996. Epidemiological Research Methods. Chichester, England: John Wiley & Sons. Poisson, S. D, 1837 Recherches sur ta probabilitd des jugements en matibre criminetle et en mati_re civile, pr_cddrs des rbgles grndrales du calcul des probabilitds. Paris: Bachelier. Rodrlguez. G. 1993. sbel0: An improvement to poisson. Stata TechnicaI Bulletin Technical Bulletin Reprints, vol. 2. pp. 94-98.

11: 11-14. Reprinted in Stata

Rogers, W_H. 1991. sbel: Poisson regression with rates. Stata Technical Butletin t: l I-12, Bulletin Reprints, vol. I, pp. 62-64.

Reprinted in Stata Technical

Rothman. K. J. and S. Greenland. 1998. Modem Epidemiology. 2d ed. Philadelphia: Lippincott-Raven. Rutherford, E., J. Chadwick. and C. D. Ellis. 1930. Radiations from Radioactive Substances. Cambridge: Cambridge University Press. Selvin, S. 1995. Practical Biostatisticat Methods. Belmont, CA: Duxbury Press. 1996. Statistical Analysis of Epidemiologic Data. 2d ed. New York: Oxford University Press. Thorndike, E 1926. Applications of Poisson's probability summation. Bell System Technical Journal 5: 604-624. Tobias. A. and M. J. Campbell. 1998. stsl3: Time-series regression for counts allowing for autocorrelation. Stata Technical Bulletin 46: 33-37. Reprinted in Stata Technical Bulletin Reprints, vo], 8, pp. 291-296.

Also See Complementary:

[R] adjust, [R] predict,

[R] constraint, [R] lincom, [R] sw, [R] test, [R] testnl,

Related:

[R] epitab,

[R] glm, [R] nbreg,

Background:

[U] 16.5 Accessing [U] 23 Estimation

coefficients and

[R] svy estimators, and

post-estimation

[u] 23.11 Obtaining

robust

[El 23.12 Obtaining

scores

[R] linktest, [R] lrtest, [R] vce, [R] xi

variance

standard

errors.

commands. estimates,

[R] xtpois

[R] mfx,

l!litle ....

'

i

ppermn

i [

i ,

test for unit roots

Syntax pperron pperron

vtrname i

[if

,,xp]

[in

range]

[,

is for u,_e with time-series data; see [R] tsset

noconstant

lags(#)

You must_ tsset

t xrend

regress ........

]

your data before usinz_ pperron.

varHame may coniain time-series loperators; see [U] 14.4.3 Time, series varlists

Description [

excludePperr°nthe co_stant,Pe_f°rms the. P illips-Perron testand/or for umt_ roots lagged on_a variable. userdifference may optionally mcludl a trend term, include values The of the of the variable in the regression.

Options ! i

noconstant

st_ppressesthe constant term (intercep0 in the model.

tags (#) specit_esthe number of Newey-West lags io use in the calculation of the standard error. •

i

/

trend speclfie.,tthat a trend!term should be included in *..heassociated regression. This option may not be speci_ed if nocon_tant is specified. regress speci_es that the ",lssociatedregression table should appear in the output. By default, the re_ression t_ble is not pr_luced,

|

I ! f_

i l I

I i

Remarks

,i

Hamilton (I_94) and Fuller (1976) give excellent overviews of this topic; see especially chapter 17 of the forme_r.Phillips (1_86) and Phillips and Pe_on (1988) present statistics for testing whether a time series h_d a unit-roottautoregressive,component.

Example

:

ii

Here, we ttse the international airline passengers dataset (Box, Jenkins, and Reinsel I994, Series G). This datasei has 144 obs_rt ations on the monthlynumber of international airline passengers from 1949 through 1_760. • pperro_ air Phillips4Perron test [or unit root

i

{

Tes Statis ic Z(rho) Z(t)

-6. _64 -i _44

* HacKi_on

i

Number of obs = Newey-West lags =

143 4

Interpolated Dickey-Fuller I_,Critical 5Y,Critical I0_,Critical Value

Value

Value

-19. 943 -3,496

-13. 786 -2,887

"11.057 -2,577

approxima:e p-value for Z(t) = 0.3588 559

i

; I

Note fail to--reject the hypothesistestthattorthere a unit root in this time series by looking either at ,.,vv that we Vk_,_un rrml,ps-i-'erron unit isroots the MacKinnon approximate asymptotic p-value or at the interpolated Dickey-Fuller critical values.

_>Example In this example, we examine the Canadian lynx data from Newton (1988, 587). Here we include a time trend in the calculation of the statistic. • pperron lynx, trend Phillips-Perron

test for unit root

Number of obs = Newey-West lags =

Test Stat istic

IY,Critical Value

-38.365 -4.585

-27.487 -4. 036

Z (rho) Z(t)

113 4

Interpolated Dickey-Fuller 5Y,Critical lOY,Critical Value Value -20.752 -3.448

-17.543 -3. 148

* MacKinnon approximate p-value for Z(t) = 0.0011

We reject the hypothesis that there is a unit root in this time series. q

Saved Results pperron

saves in r ():

Scalars r(N)

number

of observations

r(lags) r (pval)

Number of lagged differences used MacKinnon approximate p-value (not included

if noconstant

r(Zt)

Phillips-Perron

r test statistic

r(Zrho)

Phillips-Perron

p test statistic

specified)

Methods and Formulas pperron

is implemented as an ado-file.

In the OLS estimation of an AR(I) process with Gaussian errors, Yi = PYi-_ + e_ where ei are independent and identically distributed as N(O, cr2) and Yo = O, the OLS estimate (based on an n-observation time series) of the autocorrelation parameter p is given by n

_-_ Yi- I Yi Pn

i=1

71

i=l

We know that if IPI < 1 then v_{(_'n - p) --+ .IV(0,1 p2). If this result were valid for the case that p = 1, then the resulting distribution collapses to a point mass (the variance is zero).

..........

i

_

I

,

............................

pperron-- Phillips;-Perrontest for unit roots

561

L

It is this m6tivation that rives one to check for the possibility of a unit root in an autoregressive process. :In order to comput| the test statistics, we compute the Phillips-Perron regression Yi = a + PYi-1 + q where we mayIexclude the :onstant or include a trend term (i). There are two statistics, Z o and Z_-, calculateA as 1)

=n(pn

1

_r

__

=

_j n = _'-

2 sr_

I!

2

--

?2i_ti-j

n i=j+l

A

j 1 j=l

_

1

I

_j,,_ q+l_

^2 i=l

where ui is the OLS residu I, k is the number i_

1 n_ Ls. -

"2

A

A 70,,_+ 2

)n

2

f covariates in the regression, q is the number of

Newey-West lhgs to use in lthe calculation of _n, and _ is the OLS standard error of ft. The critical !,'alues (which! have the same distribution as the Dickey-Fuller

statistic: see Dickey and

i !

Fuller (1979))included in tt_e output are linearly inte_olated from the table of values which appear in Fuller (197(_), and the M_cKinnon approximate p-,alues use the regression surface punished in

I

MacKinnon (I _94).

i

Referenoes

i

BOX,EnglevaoodG. E. P.,cllffs,G. M.Nj:JenkinS,Prentic,__Hall.and G. C. Reinsel. !994. Time Series Analysis: Forecastingand Control.3d ed.

l

Dickey..D A. an_ W. A. Fuller. 1979. Distributionof the estimatorsfor autore_ressive_ time series with a umt root. Journalof theiAmerican Stati;ticalAssociation74: 427-431. Fuller,W. A. 197B.Introduction:o Statistical TimeSeries. New York:John Wiley& Sons. Hakkio.

C

S.

19i)4., sts6:

Apprcximate

p-values

for unit

root

and cointegration

tests.

Srata

Technical

Bulletin

17:

i

25-28. Repriniedin Stata Te_hnical BulletinReprints,vo]. 3, pp. 219-224. Hamilton,J. D. 1}94. Time Seri_s Analysis. Princeton:PrincetonUniversityPress.

i l

MacKinnon.J. G.!1994.Approxilaateasymptoticdistributionfunctionsfor unit-rootand cointegrationtests. Jottrnal(ff Businessand _conomic Statisics 12: 167-176. Newton,H, J. 19,_8,TIMESLAB A Time SeriesLaboratory.PacificGrove.CA: Wadsworth& Brooks/Cole.

I. g

Phillips.P.aC.B. _986. Time series regression,xith a unit root Economemca56: I021-104._."

}

Phillips,P_C. B. _nd R Pen-on. 988. Testingfor a unit root in time series regression.Biomemka 75: 335-346.

!

Also See

!

l

i t

Complementary:

[R] tss_t

Related:

[R] d ller

i

-i

prais

m Prais-Winsten I

regression

and Cochrane-Orcutt

[

I

regression nrl

II

i,

i

:-

Syntax prais

depvar

twostep

[vartist]

robust

nodw level(#)

[if exp]

[in range I [, corc ssesearch rhotype(rhomethod)

cluster(varname) no log

hc2

maximize_options

hc3 noconstant

h_ascons

savespace

]

is for use with time-series dam; see [R] tsset. You must :sset your data before using prais. depvar and _,arlistmay contain time-series operators: see [U] 14.4.3Time-series varlists. prais shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. prais

Syntaxfor predict predict

[type] newvarname

[if

exp]

[in range]

[,

These statistics are available both in and out of sample; type predict the estimation sample.

{ xb I residuals I stdp } ] ...

if e(sample) ...

if wanted only for

Description prais estimates a linear regression of depvar on varlist that is corrected for first-order serially-correlated residuals using the Prais-Winsten (1954) transformed regression estimator, the Cochrane-Orcutt (1949) transformed regression estimator, or a version of the search method suggested by Hildreth and Lu (1960).

Options corc specifies that the Cochrane-Orcutt transformation be used to estimate the equation. With this option, the Prais-Winsten transformation of the first observation is not perfomaed and the first observation is dropped when estimating the transformed equation; see Methods and Formulas below. ssesearch specifies that a search be performed for the value of p that minimizes the sum of squared errors of the transformed equation (Cochrane-Orcutt or Prais-Winsten transformation). The search method employed is a combination of quadratic and modified bisection search using golden sections.

562

! i

! i

-].

............

, prais rais-Winsten regressionand Cochrane-Orcutt regression 563 rhotype(rhor_ethod) setedts a specific computation for the autocorrelation parameter p. where rhomethod ban be • re,tress frbg

['reg -- 9 from the residual regression et - tier-1 /'freg = _3from the residual regression et = flet+l

!

ts_orr clwl _

f'tscorr = e'et_l/e'e, where e is the vector of residuals /'dw -- 1 -- dw/2, where dw is the Durbin-Watson d statistic .

!

th_il

/'theit = Ptscorr(N - k)/N

i

na!gar

[',_gar = (Pdw * N 2 4- k_)/(N 2 - k 2)

I i !

li

The prais !estimator can use any consistent estimate of p to transform the equation and each of these estimdtes meets tha_requirement. The default is regress and it produces the minimum sum of squares s_lution (sses_arch option) for the C0chrane-Orcutt transformation--no computation will produc_ the minimur_ sum of squares solution for the full Prais-Winsten transformation. See Judge Grif_ths, Hill. Lii_kepohl. and Lee (1985) for _ discussion of each of the estimates of p. twostep

speci_es that pra±_ will stop on the first iteration after the equation is transformed bv p

the

two-step efficient estima@. Although it is customa_' to iterate these estimators to convergence.

!

they are ef_ient at eachlste p. , " robust specifies that the Nuber/White/sandwich estimator of variance is to be used in place of the traditiov_al calculation, robust combined with cluster() further allows observations which

,t

are not ind+p endent witt in cluster (although they must be independent between clusters). See

I ! !

[U] 23.11 Obtaining rob,,st variance estimates. Note that all estimates fr,)m prais are conditional on the estimated value of ,o. This means that robust variahce estimates in this case are only robust to heteroskedasticity and are not generally robus! to n{isspecificatio_ of the functional form or omitted variables. The estimation of the

! !

functional fiirrn is intertwined with the estimate of p, and all estimates are conditional on p. Thus, we cannot t{erobust to _isspecification of functional form. For these reasons, it is probably best

li :

i

i

• ! i

to mterpret _obust

mth t spirit of White's _19801 original paper on estimatton of heteroskedastic

consistent @variance matrices. cluster(,,arnb,,,e) specifie_ that the observations are independent across groups (clusters) but not neces._arilv Within groups._'lvarname specifies to which group each observation belongs, cluster () affects the astlmated _ " v stan lard errors and variance,-covariance matrix of the estimators (,Cg). but not lhe estirhated coeffici .'nts. Specifying cluster () implies robust he2 and he3 _pecifv an alt _rnative bias correction for the robust variance calculation; for more informationlsee [R] regre_s, he2 and he3 may not be specified with cluster() Specifying he2 or hcB imp_es robust. 1 t

.

/

• I

!

.

.

.

hascons rodin}tea that a us@defined constant, or set of variables that m hnear combination form a constant, ha_ been includetd in the regression. For Some computational concerns, see the discussion in [RI regreb's. savespace

sNcifies that pz ais attempt to save as much space as possible by retaining only those

!!

variables for eslimation. Theused original are isrestored after space estimation. This option rarely usedre+uired hnd should g meratty be only data if there insufficient to estimate a modelis

[

without the _ption.

!

nodw suppresses reporting o the Durbin-Watson

statistic'.

level(#)specifies the confidence level, in percent, for confidence intervals. The default is level (95) oo,, prms -- _,ram-wmslen regress=onand Cochrane-Orcutt regression H

ii ,

*

or as set by set level, see [U] 23.5 Specifying nolog suppresses the iteration log. maximize_options specify them.

control the maximization

the width of confidence

process;

see [R] maximize.

intervals.

You should never have to

Options for predict xb, the default, calculates the fitted values the prediction of xjb for the specified equation. This is the linear predictor from the estimated regression model; it does not apply the estimate of p to prior residuals. residuals

calculates the residuals from the linear prediction.

strip calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is al so referred to as the standard error of the fitted value. As computed for prais, this is strictly the standard error from the variance in the estimates the parameters of the linear model under the assumption that p is estimated without error.

of

Remarks The most common autocorrelated error process is the first-order autoregressive assumption, the linear regression model may be written

process. Under this

Yt --=xtfl + ut where the errors satisfy ?_t -'=- P U¢--I

and the et are independent and identically error term e may then be written as

if,

1 1

distributed

_

et

as N(O, 02). The covariance

1 p

p 1

f12 p

... . ..

pT-1 pT-2

p2

p

1

...

p T- 3

pT-3

...

1

matrix if" of the

p2 pT-1

pT-2

The Prais-Winsten estimator is a generalized least squares (GLS) estimator. ]'he Prais-Winsten method (as described in Judge et al. 1985) is derived from the AR(1) model Jbr the enor term described above. Whereas the Cochrane-Orcutt method uses a lag definition and loses the first observation in the iterative method, the Prais-Winsten method preserves that first observation. In small samples, this can be a significant advantage.

Q TechnicalNote To estimate a model with autocorrelated errors, you must specify your data as time series and have (or create) a variable denoting the time at which an observation was collected. The data for the regression should be equalty spaced in time. Q

!

i

i

!

i

,

V

prais-- =rais-Winstenregressionand Cochrane-Orcuttregression

Example

You wish td estimate a e-series model of usr on idlebut are concerned that the residuals may be _nally, correlated. We will declare the variable t to represent time by typing . tsset

t_ 1

We can obtain _ochrane-Or:utt estimates by specifying the core option: • prais

_sr

Itera_io_

idle,

O:

cor"

rho = 0 i0000

Itera_io_ I: rho = 0,3518 (output o_nirted ) Iteratio_ Cochrane

13:

rho = i.5708

Orcutt

So _ rce M

AR(1) i |regressio; _S d

I

el

Resic_ual

40.13 I

9584

--iterated MS I

166.8T8474

27

est imates Number Prob

6.18142498

R-squared Adj R-squared

C_ef.

Std.

Err.

t

_?n, 14.,7641 4.2,2299

I

Durbin-Wdson

statist:.c

(original)

1.295766

I

Durbin-Wa_son

statist:.c

(transformed)

1.4662_2

!

t

}

I

i i i i

i

]'he estimated model is

P>lt

> F

Root.sE

!

_sr

of obs =

40.1309584_

T_al / 207.0_943328 7.39390831

!

i

565

I

[95%

Conf.

29

=

0.0168

= =

0.1938 O. 1640

: 2.4862 Interval]

0.002 5.78036 23.3,=45

,

_srt = --.125 idler

14.55 + u_

+

and

_t = .5708_t_1 + et

We can also estlmate the m_lel with the Prais-Winsten method: • prais u3r idle Iteration 0: rho = 0 0000 Iteration

I:

rho = 0 3518

(output or #tted) Iteration 14: Prais-Win_ten

rho = (.5535 AR(1)

i Source Mo_el

Residgal i To_al

I

rcgression

-- iterated

estimates

43,00 _ S 6941

df1

MS 43. 0076941

F( 1, 28) = Number Prob > Fof obs = =

7.12 0.012530

169.1( 5739 212.1i3433

28 29

6.04163354 7.31632528

R-squared Adj R-squared Root MSE

0.2027 0.1742 2.458

_sr i_le c_ns

Durbin-Wa son statistic Durbin-Wa_son statist5

Std.

Err.

.0472195 4. 160391

(original) (tremmformed)

t -2.87 3.65

1.295766 1.476004

P>It I O. 008 O. 001

[95_, Conf. -. 2323769 6. 681978

= = =

Interval] -. 0389275 23. 72633

P !i!

ooo

where

pram m P'rals-wlnsten

the Prais-Winsten usrt

As the results

regression and Cochrane-Orcutt

estimated

-

model

-.1357±diet

indicate,

is

+ 15.20 + ut

for these

regression

data there

and

u_ = .5535ut_1

is little to choose

between

+ et

the Cochrane-Orcutt

and

Prais-Winsten estimators whereas the OLSestimate of the slope parameter is substantially different. q

> Example We have data on quarterly sales, in millions of dollars, for five }'ears, and we would like to use this information to model sales for company X. First, we estimate a linear model by OLS and obtain the Durbin-Watson

statistic

regress

csales

using dwstat;

diagnostics.

isales

Source

SS

Model Residual

d_

MS

Number

of obs

=

20

110.256901

1

110.256901

F( 1, Prob > F

.133302302

18

.007405683

R-squared

=

0.9988

5.81001072

Adj R-squared Root MSE

= =

0.9987 .08606

.... Total

110.390204

csales

Coef.

isales

.1762828

_cons

see [R] regression

-1,454753

19

Std.

Err.

.0014447

¢

P>Itl

122.02

0.000

.2141461

-6.79

0.000

[95Z

18) =14888.15 = 0.0000

Conf.

Interval]

.1732475 -1.904657

.1793181 -1.004849

• dwstat Durbin-Watson

d-statistic(

2,

20)

=

.7347276

Nofng that the Durbin-Watson statistic is far from 2 (the expected value under the null hypothesis of no sefiN correlation) and well below the 5% lower limit of 1.2, we conclude that the disturbances are serially correlated. (Upper and lower bounds for the d statistic can be found in most econometrics texts, e.g., Harvey, 1993. The bounds have been derived for only a limited combination of regressors and observations.) We correct for the autocorrelation using the ssesearch option of prais to search for the value of p that minimizes the sum of squared residuals of the Cochrane-Orcutt transformed equation. Normally the default Prais-Winsten dataset, but the less efficient Cochrane-Orcutt of the estimator's

_ansformations would be used with such a small transformation will allow us to demons_ate an aspect

convergence.

. prais csales isales, core ssesearch Iteration I: rho = 0.8944 , criterion

=

-.07298558

Iteration

=

-.07298558

2:

(ou_utomittcd) Iteration 15: Cochrane-Orcutt Source Model Residual Total

rho rho

= 0.8944 = 0.9588

AR(1)

, criterion ,

criterion

regression SS

-- SSE

df

=

-.07167037

search MS

estimates Number

of obs

19

2.33199178

1

2,33199178

.071670369

17

.004215904

R-squared

=

0.9702

.133536786

Adj R-squared Root MSE

= =

0.9684 .06493

2.40366215

18

17)

=

F( I, Prob > F

= =

553.14 0.0000

I

t

-

.

i

|

_

1 _c_ns ho

Ccef.

! i

i

I I

i"

i

Std. Err.

t

P>ltl

[95_ Conf. .1461233

.160_233

.0068253

23.52

0.000

1 1.73_946 .958_209

1.432674

1.21

0.241

I)urbin-Wa_sonstatistic (original)

.

567

Interval] .1749234

-1.283732

4.761624

0.734728

(transformed) 1.724419

Durbin-Wa_son statist

1 ! i

_._

prais -- _rais-Winsten regressionand CoChrane-Orcu_ regression l

csa_es ! isa_es

.............

It was noted m the Optic

section that with the default computation of p the Cochrane-Orcutt

themeth°dssesearchPr°duce_an estima!, IGi_e )fthat p thatthe minimiZeSmethodsthe sum of squared residuals the same criterion as bption, two produce the same results, why would the search method ever be _referred? lt,t__rnsout that the back-and-forth iterations employed by Cochrane-Orcutt can o_en have difficulty, corn e cging if the value of p is large. Using the same data, the Cochrane -Orcutt iterative procedure requires o_er 350 iterations to converge and a higher tolerance must be specified to prevent premature converg, mce: • prais c_ales isales, core tol(le-9) iterate(500) Iteration O: Iteration_l: Iterationl2:

rho = O. rho = 0.5312 rho = 0.5866

Iteration!3: rho = 0.T161 3000 Iteration!4: rho = Iterationl5: rho = (output onA_t_d) Iteration!377: rho Iterationi378: rho Iteration!379: rho

0.7373 0.[550 = 3.9588 = ).9588 = ).9588

Cochrane-_reutt AR(1) regression -- iterated estimates Source

S

df

Mo_el

2,3319

To_a! !

2,4036_208

csa_es

Co f.

isal_s _colOns

,o

t71

MS

Number of obs =

19

1

2.33199171

Prob > F

=

0.0000

18

.133536782

Root MSE Adj R-squared

= =

.06493 0.9684

Std. Err.

.1605_33

.0068253

1.738_46

1.432674

t 23.52 1.21

P>ltl

[95X Conf. Interval]

0_000

.1461233

.1749234

0.241

-1,283732

4.761625

9 8 1o9

Durbin-Wat_on statisti_ (original) Durbin-WatNon statisti

0.734728

I (transformed) 1.724419

Once convergende is achieve 1,| the two methods produce identical results.

(Con6nued on next page)

q

568

prais --

_1 _|

Saved Results prais saves in

:t

scal

I

Prais-Winsten

regression and Cochrane-Orcutt

regression

-_'_

e()"

e (/0

number of observations

e (ross) e (df_ta)

model sum of squares model degrees of freedom

e(rss)

residual sum of squares

e(df_.r) e (r2)

residual degrees of freedom R-squared

e (r2_a) e(F)

adjusted R-squared F statistic

e(rmse) e (11) e(N_clust)

root mean square error log likelihood number of clusters

e(rho)

autocorrelation parameter p

e(dw)

Durbin-Watson

e (dw_O) e (to 1) e(max_ic)

Durbin-Watson d statistic of transformed regression target tolerance maximum number of iterations

e (ic)

number of iterations

e(N._gaps)

number of gaps

d statistic for untransformed regression

Macros e(cmd) prais e(depvar) name of dependent variable e(clustvar) name of cluster variable e(rhotype) e(method) e (vcetype)

methodspecified inrhotype option twostep,iterated,or SSE search covariance estimation method

e(tranmeth) corc or prais e(cons) e(predict)

noconstant or notreported programusedto implement pred£ct

Matrices e(b)

coefficient vector

e(V)

variance-covariance

matrix of the estimators

Functions e (sample)

marks estimation sample

Methods and Formulas prais is implemented

as an ado-file.

Consider the command 'prais from the standard linear regression:

y x z'.

The

Yt = axt An estimate regression:

of the correlation

in the residuals

0-th iteration

is obtained

by estimating

a, b, and c

+ bz_ + c + ut is then obtained.

_t _- PUt--i+ et

By default,

prais

uses the auxiliary

i

I

Y

!

_,.

[_l*_iS

_

_r_is

_W|ns_n

r_ssJ_n

and

Cochr_ne_

O_ut_

This can be cMnged to any )f the computations noted in the rhotype

!

Next we apl_ly a Cochran ,,-Orcutt transformation(l)

_ssion

_9

() option.

for observations t = 2,...,

n

4

! I

i l

v, -

!

=

-

+

-

+ 41 - p)+

(1)

!

and the transformation (1') fi)r t = 1

Thus, the diffe: ences betwe4 n the Cochrane-Orcutt and the Prais-Winsten methods are that the latter uses equa ion (1') in a. Idition to equation (1), _,hereas the former uses only equation (I) and

l i

necessarily dec_ _.asesthe san pie size by one. Equations (1i and (t ') are used to transform the data and obtain new estimates of a, b, and c. When the tubstepoption is specified; the estimation process is halted at this point andthese are

i

the estimates re_orted. Under the default behavior of i!erating to convergence, this process is repeated i i

until the Changei in the estim_Lteof p is within a specified tolerance. The new esti_nates are used to produce fitted values i Yt = axt 4- bzt + "d

t and then p is re[estimated, b

default using the regression defined by

Yt

-- Y% =

lO(Vt--1

Y_--I)

-t- Ut

(2)

i

We then re-estir_ate equatiol and (2) until the_estimate of Convergence ts declared af coixelation between two iterx that this processiwiil always i

I ! i

Under the ss_search opt on a combined quadratic and bisection search using golden sections is used to search f& the value (4 p that minimizes the Sum of squared residuals from the transformed e_uation. The transformation may be either the Cochiane--Orcutt (I only) or the Prais-Winsten (1

i

and 1').

I

(1) using the neu, estimate of p, and continue to iterate between (t) converges. _riterate () iterations or when the absolute difference in the estimated ions is tess than tel (): , see [R] maximize. Sargan (1964) has shown :onverge.

All reported _tatistics are ased on the p-transformed variables and there is an assumption that p is estimated witl_out error. S, Judge et ak (1985) for;:details. I

The Durbin-g[atson

d sta, stic reported by praisand d_rstat is n--1

!

_ (uj+l- uj

l

j=l

j=l

:

t

where uj represo_nlsthe residual of the jth t t 4

observation.

]

ozu

prms N P'ram-wlnsten

regression and Cochrane-Orcutt

regression

Acknowledgment We thank Economics

Richard

Dickens

and Political

Science

of the Centre for testing

for Economic and assistance

Performance with an early

at the version

London

School

of

of this command.

References Chatterjee, S.. A. S. Hadi, and B. Price. 2000. Regression Analysis by Example. 3d ed. New York: John Wiley & Sons. Cochrane, D. and G. H. Orcutt. 1949. Application of least-squares regression to relationships containing autocorrelated error terms. Journal of the American Statistical Association 44:32-61. Durbin, J. and G. S. Watson. 1950 and 1951. Testing for serial correlation in least-squares regression. Biome_ika 409-428 and 38: I59-178.

37:

Hardin, J. W. 1995. stsl0: Prais-Winsten regression. Stata Technical Bulletin 25: 26-29. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 234-237. Harvey, A. C. t993. The Econometric Analysis of Time Series. Cambridge, MA: MIT Press. Hildreth, C. and J Y. Lu. 1960. Demand relations with autocorrelated disturbances. Agricultural Experiment Station Technical Bulletin 276. East Lansing, MI: Michigan State University. Johnston. J. and J. DiNardo. 1997. Econometric Methods. 4th ed. New York: McGraw-Hill. Judge, G. G., W. E. Griffiths, R C. Hill, H. Lfitkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wiley & Sons. Kmenta, J. 1997. Elements of Econometrics. 2d ed. Ann Arbor: University of Michigan Press. Prais, S. J. and C. B. Winsten. 1954. Trend Estimators and Serial Correlation. Cowtes Conm_ission Discussion Paper No. 383. Chicago. Sargan. J. D. 1964 Wages and prices in the United Kingdom: a study in econometric methodology. In Econometric Analysis for National Economic Planning, ed. P. E. Hart, G. Mills, J. K. Whitaker, 25-64. London: Butterworths. Theft, H. 1971. Principles of Econometrics. New York: John Wiley & Sons. White, H. 1980. A heteroskedasticity-consistent Econometrica 48: 817-838.

covariance matrix estimator and a direct test for heteroskedasticity.

Also See Complementary:

[R] adjust, [R] iincom, [R] vce, JR] xi

Related:

[R] regress,

Background:

[U] 16.5 Accessing

[R] mfx, jR] predict,

[R] regression

[U] 23 Estimation [u] 23.11 Obtaining

diagnostics

coefficients

and standard

and post-estimation robust

JR] test,

variance

errors.

commands, estimates

[R] testnl,

i

.....

l :? 1°..o It,. e t --

_

ic ions, residuals, etc., after estimation .i i i i i

i

i

i

After single-eqtlation (SE) estimators

t

Syntax

predict;

[_,pe_ newvarlrame [if

other_op_ons

exp] [in range] [, xb stdp

nooffset

]

After multiple-_quation CME)iestimators

stdp stdrtp nooffse_

other_options ]

DescriptiOn predict call :ulates predicl ions, residuals, influence statistics, and the like after estimation. Exactly what predict an do is dete mined by the previous estimation command; command-specific options are documented larith each est mation command. Regardless of command-specific options, the actions of predict shale certain sirr ilarities across estimation commands:

i

l) Typing p_edict newv _rname creates newvanvame containing "predicted values"--numbers related to,ithe E(_ljlxj t. For instance, after linear regression, predict newvarname creates t l

t

xjb and, _fter probit, creates the probability/b(xjb). 2) predict _ewvarname, xb creates newvarname containing xjb. This may be the same result i hnear , as (1) (e.g_, regression) or different (e.g., probit), but regardless, option xb is allowed.

! i

3) predict _ewvarname,'_ stdp creates newvarnanie containing the standard error of the linear prediction !xj b. !

I

4) predict/lewvarname,_ther_options may createnewvarname containing other useful quantities; _ee help _r the referende manual entD for the particular estimation command to find out about other avai_ble options. I

i i

5) Addling th4 noel fset @tion to any of the above requests that the calculation ignore any offset or e_posule variable s_cified by including the offset(varname) or exposure(varname)

i

options w_en you estim!ted the model. predict

!

can be used to mah in-sample or out-of-sample predictions:

6) tn general, predictcall ulates the requested statistic for all possible observations, whether they were used in estimating the model or not. predict does this for standard options (1) through (3), and generally does ;his for estimator-specific options (4). 7) To restrict ithe predictio_ to the estimation subsample, type , predict

l

I

!newvarname

i:

e(sample)

....

8) Some stati._tics make se _se only with respect to the estimation subsample. In such cases, the calculation iis automatically restricted to the estimation subsampte, and the documentation for the specific!option states this. Even so, you can still specify if e (sample) if you are uncertain.

571

572

predict -- Obtain predictions, residuals, etc., after estimation

9) predict's you can • use

dsl

(estimate • use

I!

ability to make out-of-sample prediction even extends to other datasets. In particular,

a model) /*

two

• predict

hat,

...

another

/* fill

*/

dataset

in the

predictions

*/

I: <

i Options xb calculates the linear prediction from the estimated model. That is, all models can be thought of as estimating a set of parameters bl, b2, ..., bk, and the linear prediction is _ -- bzzlj + b2z2j + • " + bkzkj, often written in matrix notation as _j = xjb. In the case of linear regression, the values _'j are called the predicted values or, for out-of-sample predictions, the forecast. In the case of logit and probit for example, _'j is called the logit or probit index. It is important to understand that zlj, x2j .... , zkj used in the calculation are obtained from the data currently in memory and do not have to correspond to the data on the independent variables used to estimate the model (obtaining the bl, b2, .. , bk). strip calculates the standard error of the prediction after any estimation command• Here, the prediction is understood to mean the same thing as the "index", namely xjb. The statistic produced by strip can be thought of as the standard error of the predicted expected value, or mean index, for the observation's covariate pattern. This is also commonly referred to as the standard error of the fitted value. The calculation can be made in or out of sample. stddp is allowed only after you have previously estimated a multiple-equation error of the difference in linear predictions (Xljbx2jb) between equations

model. The standard 1 and 2 is calculated.

equation(eqno[, eqno] )--synonym outcome ()--is relevant only when you have previously mated a multiple-equation model. It specifies to which equation you are referring.

esti-

equation() is typically filled in with one eqno it would be filled in that way with options xb and strip, for instance, equation(#1) would mean the calculation is to be made for the first equation, equation(#2) would mean the second, and so on. Alternatively, you could refer to the equations by their names, equation(income) would refer to the equation named income and equation(hours) to the equation named hours. If you do not specify equation(),

results are as if you specified equation(#:t).

Other statistics refer to between-equation concepts; stddp might specify equation(#:t,#2) or equation(income,hours). specified, equation() is not optional.

is an example. In those cases, you When two equations must be

noof_set may be combined with most statistics and specifies that the calculation should be made ignoring any offset or exposure variable specified when the model was estimated. This option is available even if not documented for predict the offset (vatvTame) option nor the exposure (varname) was estimaled, specifying noo/_fsel; does nothing. other_options

refers to command-specific

after a specific command. If neither option was specified when the model

options that are documented

with each command.

redict -- Obtain predic|ions,residuals,etc,, after estimation •

573

i

Remarks

,

:

Remarks ar_ presented u _der the headings

{

Estimation-sat:

]pie predictions

Out-of-sample Ra_iduals

predictions

Singte-equatio_

(SE)

i

estimation

Multiple-equation(ME) estimation I i

Most of thd examples ar to all estimators.

presented using linear regression; but the general syntax is applicable

One can th nk of any e fimation command as estimating a set of coefficients bl, b2, ..., bk corresponding to the variabl_ s :rl, z2,.. zk along with a (possibly empty) set of ancillary statistics All estimttion commands the and that _1. % ..... _m. save bi's _i's. predict accesses saved information anal combines : with the data currently in memory to make various calculations. For instance, the linear, predicti, n is _j = bl:rlj + b2:r2j +..-+ bkzl_j. Among other things, predict can make that _catculation. "Ihe data on which predict makes the calculation can be the same data

i I } }

on which the rrodel was esti nated or a d_fferent datalet-- it does not matter, predict uses the saved parameter esti_ates from th_model and, for each observation in the data, obtains the corresponding values of :c, a_d then combines them to produce the desired result.

!

Ii

preC EStimation-sa mple ictions

;

_ Example

:

We have a _4-observatio I dataset on automobiles, including the mileage rating (mpg), the car's }

weight (_eigh!),

and wheth._r the car is foreign ffo_eign).

• regres_ _!

mpg weight

SoUrce

I

M(del

I

Resit [ual

I

_ T_tal

I

I

......

Number

of obs =

22

I

427.990298

Prob

> F

=

0.0005

20

24.493666_

R-squared

=

0.4663

2t

43.7077922

Adj R'squared Root MSE

= =

0.4396 4.9491

917.8

_3636

_

!

............ C_ef. -.01 )426

._?ns

48. !)183

To obtain the

MS

489.8 T3338

If we were to ty _e predict

I

df

427,9_0298

weight

1 i_

if foreign _S

mpg

!

We estimate the model

Std. Err. .0024942 5.871851

t

P>[t [

[95Y, Conf.

Interval]

-4.18

0.000

-.0156287

- .0052232

8,3'3

O. 000

36,66983

61,16676

mpg now, we would obtain

e linear predictions for all 74 observations.

_edictions _iusI for the sample on which we estimated the model, we could type

. predict I pmpg

if e(s_unple)

(option (52 missihg x_ assumed; values ge_erated) f:.tted values) !

:

!

In this e×ample_

I

e_;timatedI the nlodel and the: e are no missing values among the relevam variables. Had there been missing ,,,alues._e (sample) ,'ould also account for t_ose.

e(sample)

is true only for foreign cars because we typed

if

foreign

when we

!

I

574

predict -- Obtain predictions, residuals, etc., after estimation

ti

By the statistics way, theonifthee(sample) be type used with any Stata command, summary estimation restriction sample, wecan could . summarize (output

if

omitted

so to obtain

e(sample) )

<1

Out-of-sample predictions By out-of-sample predictions, example above, typing 'predict

we mean predictions extending beyond the estimation sample. In the pmpg' would generate linear predictions using all 74 observations.

predict will work on other datasets, too. You can use a new dataset and type predict results for that sample.

to obtain

> Example Using the same auto dataset, assume that you wish to estimate the model: mpg = _lweight + fl2weight2+ fl3foreign+ t4 We first create the weight 2 variable and then type the regress command: • use

auto

(1978

Automobile

generate

Data)

weight 2=weight'2

• regress

mpg

weight

Source

weight2 SS

foreign df

MS

Number F(

Model

of

3,

= =

52.25

=

0.0000

3

563.05124

754.30574

70

10.7757963

R-squared

=

0.6913

33.4720474

Adj K-squared Root MSE

= =

0.6781 3.2827

Total

2443.45946

73

mpg

Coef.

Std.

Err.

t

P>ltl

> F

74

1689.15372

Residual

Prob

obs 70)

[957, Conf,

welght

-. 0165729

.0039692

-4.18

O. 000

-. 0244892

weight 2

1.59e-06

6.25e-07

2.55

O. 013

3.45e-07

foreign _cons

-2.2035 56. 53884

1.059246 6.197383

-2.08 9.12

0.041 O.000

-4.3161 44. 17855

Were we to type 'predictpmpg' now, we would obtain predictions data. Instead, we are going to use a new dataset.

Interval] -. 0086567 2.84e-06 -.0909002 68.89913

for all 74 cars in the current

The dataset newautos, dta contains the make, weight, and place of manufacture of two cars. the Pontiac Sunbird and the Volvo 260. Let's use the dataset and create the predictions: use newaut os (New

Automobile

Models)

list

i. Pont. 2.

make Sunbird

we ight 2690

260

3170

Volvo

predict mpg (optlon xb assumed; variable r(lll) ;

weight2

noZ

fitted found

f or e ign Domestic Foreign

values)

II

! !_ }i

1

_

,

pl_dict -- Obtain predictions, residbals, etc., after estimation

Things did not work. We typed predict mpg and Stata responded with the message "weight2 not found", predictcan calcuh Ltepredicted values on a different dataset only if that dataset contains the variables that ,_ent into the aaodet. In this case, our data do not contain a variable called weight2. weight2 is just the square _,f weight, so we can create it and try again: • generate . predic_

(o_tion i

575

weight2=we ight*2 mpg

Ib assttmed;

itted

values)

. list

i 1.

make Pon_:. Sunbird

2.

Volvo

260

weight 2690

foreign Domestic

weight2 7236100

mpg 23.47137

3170

Foreign

1.00e+07

17.78846

i

i

\Ve obtained o,tr predicted alues. The Pontiac SunNrd has a predicted mileage rating of 23.5 mpg whereas _heVovo 260 has alpredicted rating of 17.8mpg By way of comparison, the actual mileage

'_

ratings are 24 or the Pontia_ and 17 for the Volvo.

q

Residuals i

_, Example

!

t

With many estimators, p_edictcan calculate more than predicted val'ues.With most regressiontype estimator_ we can, for _instance, obtain residuals. Using our recession example, we return to

i

our original daia and obtain residuals by typing

-_

. use

I

(Automobfle

au$o,

predic$ !

,

Models)

!

double

summar!ze ,

clear

res_d,

resid

Variable

residuals

i 8_s

Mean

Std.

J , Dev.

Min,

Max

resid

i

_ i

i

i t

l

1 ?_4

-1,78e-15

3.214491

-5.636126

i3.85172

Notice that wd did this wi!hout re-estimating the model. Stata always remembe_ the last set of J.. estimates, ever_as we use n w datasets. It was not n_cessar_ to t're the double in predict double resid, residuals; but we wanted " " ' variable in front of the variable s name; see to remir_dyouI that you ca_ specify the type of [U] 14.4.2 List_ of new variSbles. We made the ne_ :variableresida doublerather than the defaul_ float_

i

If"you wantiyour residua to have a mean as clese to zero as possible, remember to request the extra precision of double.If we had not specified double, the mean of resid would have been , --s ! -14 -]4 sounds _ more precise than 10-s. the difference roughl) 10 rather than _) • Although 10 really does notimatter.

F,_rlinear rti,.zression, r diet can also calculate standardizedresiduals and studentized residuals • . i_ P with the ophoqs rstandar

and rstudent:

for examples

see JR] regression

diagnostics

576

predict -- Obtain predictions, residuals, etc., after estimation

Single-equation (SE) estimation If you have not read the discussion above on using predict after linear regression, please do so. Also note that predict's default calculation almost always produces a statistic in the same metric as the dependent variable of the estimated model e.g., predicted counts for Poisson regression. In any case, xb can always be specified to obtain the linear prediction. predict is also willing to calculate the standard error of the prediction, which is obtained by using the inverse-matrix-of-second-derivatives estimate for the covariance matrix of the estimators.

Example After most binary outcome models (e.g., logistic, legit, probit, cloglog, scobit), predict calculates the probability of a positive outcome if you do not tell it otherwise. You can specify the xb option if you want the linear prediction (also known as the legit or probit index), The odd abbreviation xb is meant to suggest XB. In legit and probit models, for example, the predicted probability is p -- F(XB), where F() is the logistic or normal cumulative distribution function respectively. . logistic foreign (output omitted ) predict (option

mpg

weight

phat

p assumed;

predict

idLhat,

• summarize

foreign

Pr(foreign)) xb phat

Variable

Obs

foreign phat idxhat

74 74 74

idxhat Mean

Std.

.2972973 .2972973 -1.678202

Dev.

.4601885 ,3052979 2.321509

Min

0 .000729 -7.223107

Since this is a legit model, we could obtain the predicted probabilities index gen

phat2

Max

1 .8980594 2.175845

ourselves from the predicted

= exp(idxhat)/(l+exp(idxhat))

but using predict without options is easier. q

Example For all models, predict attempts to produce a predicted value in the same metric as the dependent variable of the model. We have seen that for dichotomous outcome models, the default statistic produced by predict is the probability of a success, statistic produced by predict is the predicted count specify the xb option to obtain the linear combination of (the inner product of the coefficients and z values). For is the natural log of the count. poisson (output

injuries

omitted

predict (option

injhat n assumed;

predict gen

XYZowned

)

idx,

exp_idx

. summarize

predicted

number

of events)

xb = exp(idx)

injuries

injhat

exp_idx

idx

Similarly, tbr Poisson regression, the default for the dependent variable. You can always the coefficients with an observation's x values poisson (without an explicit exposure), this

)redict -- Obtainpredictions,residuals,etc.,after estimation Vari_le

1

I

Ot

Meam

Min

7.111ni .833333 7.11illt .833333_

injb t

exp__dx iidx injuries

i _

Std. De_.

1.955174 7. 111111

I

.122561 _ 5.48735_)

577

Max

66 7.666667 7.666667 1.791759

1

2.036882 19

We note that o_r "'hand-co_ _uted" prediction of the count (exp_idx)

exactly matches what was

i _

produced by th_ default oper _tion of predict. If our model! has an expo,,',ure-time variable, we can use predict to obtain the linear prediction with or without !the exposure. Let's verify what we are getting by obtaining the linear prediction with and without exl_osure, transfi)rming these predictions to count predictions, and comparing with the default count p_diction from predict. We must remember to multiply by the exposure time when

i

usin_ predict!

) !

i

. poisson

nooffset.

injuries

XY_ owned,

exposure(n)

(outputor__i.ed) . predict double injh_.t (option n assumed; predicted i

. predict

I

. gen dou)le

i

• predict

i

. s_mmari,_ei injuries njhat VariaIle Ob

!

• gen

double

of event_)

xb

exp_idx

double

double

idx,

number

exp(idx)

idxr

xb nooffset

exp_idxn

exp(idxn)*n

exp_idx Mean

exp_idxn idx idxn Std. Dev. Min

injuries

9

7.11t111

inj_at

_

7,111111

3.10936

!

exp_ _dx

9

7. 111111

;

exp_i_xn _dx

_ 9

7.111111 I.869722

i

i_xn

9

4. 18814

!

1

5.487359

Max 19

2.919621

12.06158

3. 1093_

2. 919621

12. 06158

3.1093_ .4671044

2.919621 I.0714_4

12.06158 2.490025

.190404_

4.061204

4.442013

| Looking at t_e identical m_ans and standard deviations for injhat,

exp_idx,

and exp_idxn,

we

! )

, ee that ]! _s possible to reproduce the default computations of predict for pozsson esnmatlons. We have also d_monstrated tlle relationship between the count predictions and the linear predictions

i

with and withodt exposure. q

! i i ! i

Multiple-equation (ME) estimation If you have lot read the _bove discussion on using predict after SE estimation, please do so. With ]he exception of the aNlity to select specific ettuations to predict from, the use of predict after ME model,_ follows almost exactly the same for£ as it does for SE models,

Example i i

The details c;f prediction statistics that are specific to particular ME models are documented wi_h the estimation c_)mmand. Users of ME commands that do not have separate discussions on obtaining predictions wou_d also be well-advised to read the predict section in [R] mlogit, even if their interest isnot in!multinomial _ogistic regression. As a general introduction to the ME models, we will

ii_

demonstrate pr!dict!

after slreg:

, _ureg

(price

foreign

disp1)

(weight

foreign

length)

;_._mlngly unrelated regression _jijation prJco w. lght

Obs

Parms

RMSE

"R-sq'

chi2

P

74 74

2 2

2202.447 245.5238

0.4348 0.8988

45.20554 658.8548

0.0000 0.0000

Coef.

Std. Err.

z

P>Jzl

[95_ Conf. Interval]

prlco foreign dl_placement _cons w_|ght foreign length _cons

_ut-g

3137.894 23.06938 680.8438

697.3805 3.443212 859.8142

-154.883 30.67594 -2699.498

75.3204 1.531981 302.3912

cstinmted two equations,

4.50 6.70 0.79

-2.06 20.02 -8.93

one called price

0.000 0.000 0.428

1771.054 16.32081 -1004.361

4504.735 29.81795 2366.049

0.040 0.000 0.000

-302.5082 27.67331 -3292.173

-7.257674 33.67856 -2106.822

and the other weight:

see [R] sureg.

predict prod_p, equation(price) (_q)tionxb assumed; fitted values) }n,odictprod_w, equation(weight) (option xb assumed; fitted values) • .u_arize

price pred_p weight pred_w

Variable

Obs

Mean

price pred_p weight pred_w

74 74 74 74

6165.257 6165.257 3019.459 3019.459

Std. Dev. 2949.496 1678.805 777.1936 726.0468

Min

Max

3291 2664.81 1760 1501.602

15906 10485.33 4840 4447.996

Y,m may Sln\'ifY the equation by name, as we did above, or by number: _;m," Ihing as equation(price) in this case.

equation(#1)

means

the

Methods and Formulas I)cnotc _h_-previously estimated

coefficient vector by b and its estimated

variance matrix by V.

pr,,l* ,'_ x__,Tksby recalling various aspects of the model, such as b, and combining that information witl_ the ,tau currently in memory. Let us write xj for the jth observation currently in memory. 'l'hc t _.......

d value (xb option) is defined _'j

TIw ,nv xf-_--derror of the prediction The

_:.n)_2c3 error o/" the difference

(stdp)

=

xjb

q- offset#

is defined spj = v/xjVx}

in linear predictions between equations

1 and 2 is defined

s% - V/(x_j,-x2j, o,..., 0) v (x_j,-x2j, o,..., o)' See _h_"".=Nvidual estimation Sl_ll isli,'._-

commands

for the computation

of command-specific

predict

i1 iJ

Also See Related:

_redict -- Obtain predi_ions, residuals, etc., after estimation [R] regress, [R] regression diagnostics [P] _wet [ict

Background:

[u] 23 E,,',timation and post-estinlationcommands

I

I

I I

I !

i i

i

! f

579

,

I _,

Title [ probit

--I Maximum-likelihood [

probit estimation

i

I

J

Syntax probit depvar

[indepvars ] [weight]

noconstant r_obust maximize_options

dprobit

cl_uster(varname)

exp]

[in range]

[, level(#)nocoef

score(newvarname)

asis

offset

(varname)

]

[ depvar indepvars

probit_options

[if

[weight]

[if

exp] [in range]]

[, at(matname)

classic

]

by ... : may be used with probit and dprobit; see [R] by. fweights, iweights, and pweights are allowed; see [U] 14.1.6 weight. These commandsshare the features of all estimation commands:see [U] 23 Estimation and post-estimation commands. probit may be used with sw to perform stepwise estimatlon; see [R] sw.

Syntaxfor predict predict

[type] newvarname

[if

exp]

[in range]

[.

{ p

xb

stdp}

rules asif nooffset ] These statistics are available both in and out of sample; type the estimation sample.

predict

...

if

e(sample)

...

if wanted only for

Description probit

estimates a maximum-likelihood

probit model.

dprobit estimates maximum-likelihood probit models and is an alternative to probit. Rather than reporting the coefficients, dprobit reports the change in the probability for an infinitesimal change in each independent, continuous variable and, by default, reports the discrete change in the probability for dummy variables. You may not specify the noconstant option with dprobit, probit may be typed without arguments after dprobit estimation to see the model in coefficient form. If estimating on grouped data, see the bprobit

command described in [R] glogit.

A number of auxiliary commands may be run after probit, for a description of these commands. See [R] logistic for a list of related estimation

commands.

580

togit,

or logistic;

see [R] logistic

......

,

probit --I_laxh_um_i_kelihoodprobe estimation

581

Options

i l

Options,for p,obit

level (#) spec ifies the confi,tence level, in percent, for confidence intervals. The default is level or as set by set level:see [U] 23.5 Specifying the width of confidence intervals

(95)

l

nocoef specifies that the c( efficient table is not to be displayed. This option is sometimes used by programmel but is of n( use interactively.

I

noconstmat s_ppresses the :onstant term (intercept)in the probit model. This option is not available for dprobi_.

!

robust species that the H _ber/'White/sandwich estimator of variance is to be used in place of the traditional dalculation; s_:e [U] 23.11 Obtaining robust variance estimates, robust combined

!

with cluster]

l

be independent between :lusters). If you specify pweights robust is implied; see [U] 23.13 Weighted estimation.

! i ! i

l !_ } i _ i }

I i {

() allows

bservations which are not independent within cluster (although they must

cluster(varn_me) specifi_m that the observations are independent across groups (cluSters) but not necessaily within gr)ups, varname specifies to which group each observation belongs; e.g.. cluster (p _rsonid) in data with repeated observations on individuals, cluster() affects the estimated s_andard error: and variance-covariance matrix of the estimators (vcH), but not the estimated c,)efficients; see [U] 23.11 Obtaining robust variance estimates, cluster() can be the unstratified cluster-_ampled data. but used with _weights to produce estimates for see svyprobitcommand in [R] s_Westimators for a command designed especially for survey data. 1 cluster()limplies by itself. score(newva,lname)

rob_tst; create

speci_ing

robust

cluster()

is equivalent to typing cluster()

newvar containing uj = OInLj/O(xjb)

for each observation j in the

sample. Th_ score vecto is _ OlnLj/Ob = _ u.jxj; i.e., the product of newvar with each covm-iate s_mmed over ,servations. See [u] 23.12 Obtaining scores. asis requests _,thatall spec ]ed variables and observations be retained in the maximization process. This option I is typically r_ot specified and may introduce numerical instability. Normally probit 1

drops variables that perf_ctly predict success or failure in the dependent variable. The associated observation_ are also dr(pped. In those cases, the effective coefficient on the dropped variables is infinity (_egative infinity) for variables that completely determine a success (failure). Dropping the variable, and perfectly predicted observations has no effect on the likelihood or estimates of the remaining cbefficients an,t increases the numericaI stability of the optimization process. Speci_ing this option _orces retenti_)n of perfect predictor _:ariables;and their associated perfectly predicted observationL offset (varmmu:) specifies that varname _sto be included in the model with the coefficient constrained to be 1. madmize_optWns specify thma.

control tt_e maximizalion process: _ee [R] maximize. You should never have to.

_----

l_lv_l_

Illlfli^lflllll_llllll--III11,_llllll//11,/lbl

I./ll./i./ll

1_6LIIII_I|IUI'|

Optionsfor dprobit at (matname) specifies ::i

the point around which the transformation of results is to be made. The default is to perform the transformation around _, the mean of the independent variables. If there are k independent variables, rnatname may be 1 × k or 1 x (k + 1), that is, it may optionally include final element 1 reflecting the constant, at () may be specified when the model is estimated or when results are redisplayed.

classic requests that the mean effects be calculated using the formula f(_b)b_ in all cases. If classic is not specified, f(x-b)bi is used for continuous variables, but the mean effects for dummy variables are calculated as ff(_lb) - _5(2ob). Here 51 = _ but with element i set to 1. 20 - _ but with element i set to 0, and _ is the mean of the independent variables or the vector specified by at(). classic may be specified at estimation time or when the results are redisplayed. Results calculated without classic may be redisplayed with classic and vice versa. probit_options

are any of the options allowed by probit;

see Options for probit, above.

Optionsfor predict p, the default, calculates

the probability

of a positive outcome.

xb calculates the linear prediction. strip calculates the standard error of the linear prediction. rules requests that Stata use any "rules" that were used to identify the model when making prediction. By default, Stata calculates missing for excluded observations. asif requests that Stata ignore the rules and the exclusion criteria and calculate predictions observations possible using the estimated parameter from the model. nooffset

is relevant only if you specified offset

(varv_ame) for probit.

the

for all

It modifies the calculations

made by predict so that they ignore the offset variable: the linear prediction rather than xjb + offsetj.

is treated as xjb

Remarks Remarks are presented under the headings Robust standard errors dprobit Model identification Obtainingpredicwd values Performinghypothesis tests probit performs maximum likelihood estimation of models with dichotomous hand-side) variables coded as 011 (or, more precisely, coded as 0 and not 0).

dependent

(left-

> Example You have data on the make, weight, and mileage rating of 22 foreign and 52 domestic automobiles. You wish to estimate a probit model explaining whether a ca" is foreign based on its weight and mileage. Here is an overview of your data:

r I

_'_

'

!dec I

Contain_

!

size:

_

i

variabl_

i

make mpg

I

data

obs: v_rs:

from

°,3

_uto.dta

7_ 'I 1,9911

name

1978 7 Jul (99,7Z

stora_ type

of

memory

display format

'/,-18s _8.Og

weight

!

int

_8.0gc

foreign

!

byte

_,8.0g

Data

free)

value label

strl int

Aatomobile 2000 13:51

variable

label

Make and Model Mileage (mpg) Weight origin

Car

(Ibs.)

type

S_rted _y: foreign No_e: ! . inspect

dataset

las

changed

since

last

saved

foreign

foreign: Car type

Numberof Observations

i

Total

!

#

*

Negative

# #

'_

#

#

r

# 0 (2 !

|

_ique

NonIntegers

Integers -

Zero Positlve

52 22

52 22

Total

74

74

Missing 1

74

values

f_reign

is

la_eled

and

all

values

ar_

documented

in

the

label.

The variable f_reign take_ on two unique values. 0 and 1. The value 0 denotes a domestic car and t denotes _ foreign car. l

The model _ou wish to e ;timate is Pr(:_oreign - I)= _(_o+ glweight+ g2mpg) where _ is the cumulative n )rmal distribution. i

To estimate his model, y _utype

ItezatioNO:

log likelihood=

Iteration_

log

1 :

lik#lihood

f outputo__i_ed ) Iteration5: Probit

Log

i

iI

i

|

log likJlihood= -26.844189

es _imates

! likellhood

fore!gn

-45.03321 -29.244141

I

=

-26. _4 4189

(

_pg [ -.10_: 503 _clns 8.27 464

Std,

Err,

.0515689 2.554142

z

-2.02', 3.24

Number LR chi2 of (2) obs

= =

Prob > R2 chi2 Pseudo

=

P>,zI

0.044 0.001

[95_

Conf.

-.2050235 3.269438

74 36,38

0.0000 0.4039

Interval]

-.0028772 13.28149

You find that heavier cars are less likely to be foreign and that cars yielding better gas mileage are also less likely to be foreign, at least holding the weight of the car constant. _

-IllaAltliUIIl'llRUil|lIJi_,i See [R]JJIIJIJIt maximize for an explanation _/IgiJl| of the_O|IIIICI|IU|I output.

<1

IIi

[] TechnicalNote Stata interprets a value of 0 as a negative outcome (failure) and treats all other values (except missing) as positive outcomes (successes). Thus, if your dependent variable takes on the values 0 and I, 0 is interpreted as failure and 1 as success. If your dependent variable takes on the values 0, t, and 2, 0 is still interpreted as failure, but both I and 2 are treated as successes. If you prefer a more formal mathematical statement, when you type probit the model Pr(yj

#

y z, Stata estimates

01x ) =

where _I,is the standard cumulative normal.

O

Robuststandard errors If you specify the robustoption, probit reports robust standard errors as described in [u] 23.11 Obtaining robust variance estimates. In the case of the model of foreign on weight and mpg, the robust calculation increases the standard error of the coefficient on mpg by almost 15 percent: probit foreign weight mpg, robust nolog Probit estimates

Number of obs Wald chi2 (2) Prob > chi2

= = =

74 30.26 0.0000

Log likelihood = -26.844189

Pseudo R2

=

0.4039

Robust foreign weight mpg _cons

Coef. -. 0023355 -. 1039503 8. 275464

Std. Err. .0004934 .0593548 2. 539176

z -4.73 -1.75 3.26

P>Iz I O. 000 0.080 O. 001

[95Y,Conf. Interval] -. 0033025 -. 2202836 3. 29877

-. 0013686 .0123829 :13. 25216

the standard error for the coefficient on mpg was reported to be .052 with a resulting confidence interval of [-.21.-.00]. Without

robust,

robust with the cluster () option has the ability to relax the independence assumption required by the probit estimator to being just independence between clusters. To demonstrate this. we will switch to a different dataset. You are studying unionization of women in the United States and have a dataset with 26,200 observations on 4.434 women between t970 and 1988. For our purposes, we will use the variables age (the women were 14-26 in 1968 and your data thus ._panthe age range of 16-46), grade (years of schooling completed, ranging from 0 to 18), not_smsa (28% of the person-time was spent living outside an SMSA standard metropolitan statistical area), south (4I% of the person-time was in the South), and southXt (south interacted with year, treating 1970 as year 0i. You also have variable union. Overall, 22% of the person-time is marked as time under union membership and 44% of these women have belonged to a union.

r

probit--! Maximum_likelihood probltestimation

You estimate the follow ag model ignoring that the women are observed an average of 5.9 times each in these lata: . probit

union

I_eraticn

0:

log li_elihood

age

fade not_smsa =

south

southXt

I_erati_n

i:

log l±_elihood

= -13548•436

Iteration

2:

log liCelihood

= -13547.308

-13864.23

Probit It_rati¢ 3timates _ 3: log li :elihood = -13547.308

i

Number

_ -

Log Ilk

ihood

= -13_47,308

u_ion i

:oef.

iage g_ade not__mse

i

585

s_uth

i

sou_hXt __ons "

i

Std.

Err.

z

=

26200

LR chi2(5) Prob > chi2

= =

633.84 0.0000

Pseudo

=

0.0229

P> Izl

of obs

R2

[95Z

Conf•

Interval]

•0015798 .0036651 .0202523

3•T6 7.20 -6.44

0.000 0.000 0•000

.0028496 .0192066 -.1700848

,0090425 .0335735 -•0906975

7254

.033989

-Ii.85

0.000

-.4693426

-.3361081

.00 3088 -1,1 3091

•0029253 .0657808

1.13 -16.92

0.258 0.000

-.0024247 -i•242019

.0090423 -.9841628

'

.00;9461 2639 -.13 3911

]

-.40

I I I

The reposed standard errors n this model are probably meaningless. Women are observed repeatedly and so the observations are not independent. Looking at the coefficients, you find a large southern effect against u}aionization a_d little time trend. The robust and cluster () options provide a way to estimate thistmodel and o gtain correct standard errors: • probit _nion

i

not_smsa

i:

log

2: 0: 3:

log lik _lihood = -13547.308 log lik log likelihood _lihood = -13547.308 -13864.23 Number of obs Wald chi2(5)

= =

26200 165.75

i

Prob

=

0.0000

= -135 17.308 I.

(standard

> chi2

Pseudo R2 = on idcode) 0.0229 _djusted for clustering

errors

Robust

un_on,

1

cluster(id)

_ i

i

I•

robust

ilk _lihood = -13548.436

estimates

i

i

south/t,

Iteratior Ite_atioI Iteratior

Log likelihood : i

i

south

Iteratior

Probit

i

age grlde

Cdef.

Std.

Err.

z

P>Izl

[957 Conf.

Interval]

.001327 ,0110282 -.209595

.0105651 .04i7518 -.0511873

_ge grade not_s_sa

.005R461 ,02_39 -.130_911

.0023567 ,0078378 .0404109

2.52 3.37 -3.23

0.012 0.001 0.001

so_th

-.4027_254

.0514458

-7.83

0.000

-.5035573

-.3018935

souz_Xt _c_ns

.003_)88 -I.1131)91

.0039793 .I188478

0.83 -9.37

0.406 0.000

-.0044904 -1.346028

.0111081 -.8801534

'

!

l

Thesestandard _ors arerou_hly50% larger thanth0sereported by theinappropriate conventional calculation. By Comparison. mother model we could estimate is an equal-correlation population-

i

a_eraged probit _odet:

i

i

I

Iteration : tolerance = .04796083 . xtprobitiunion age g:'ade no__smsa Iteration : tolerance = .00352657 Iteration tolerance = .00017886 IZer&_ion l_erat_on

_: tolerance _: tolerance

= 4.150e-07 = 8.654e-06

south

southXt,

i(id) pa

586

probit -- Maximum-likelihood probit estimation GEE population-averaged Group variable: Link: Family: Correlation:

model

Scale parameter: .

,

Number of obs Number of groups Obs per group: min avg max Wald chi2(5) Prob > chi2

idcode probit binomial exchangeable 1

union

Coef.

age grade not_smsa south southXt _cons

.0031597 .0329992 -.0721799 -.409029 .0081828 -I.184799

Std. Err, .0014678 .0062334 .0275189 .0372213 .002545 .0890117

z 2.15 5.29 -2.62 -10.99 3.22 -13.31

P>IzI 0.031 0.000 0,009 0.000 0.001 0.000

= = = = = = =

26200 4434 1 5.9 12 241.66 0.0000

[95_ Conf. Interval] .0002829 .020782 -.1261159 -.4819815 .0031946 -1.359259

.0060366 .0452163 -.0182439 -.3360765 .0131709 -1.01034

The coefficient estimates are similar but these standard errors are smaller than those produced by probit, robust cluster(). This is as we would expect. If the equal-correlation assumption is valid, the population-averaged probit estimator above should be more efficient. Is the assumption valid? That is a difficult correspond to an assumption of exchangeable to assume an AR(I) correlation within person that we do not wish to impose any structure.

question to answer. The population-averaged estimates correlation within person. It would not be unreasonable or to assume that the observations are correlated, but See [R] xtgee for full details.

What is important to understand is that probit, robust cluster () is robust to assumptions about within-cluster correlation. That is, it inefficiently sums within cluster for the standard error calculation rather than attempting to exploit what might be assumed about the within-cluster correlation.

dprobit A probit model is defined Pr(yj where 62 is the standard cumulative

# 0[xj)

= 62(xyb)

normal distribution and xjb

is called the probit score or index.

Since xjb has a normal distribution, interpreting probit coefficients requires thinking (normal quantile) metric. For instance, pretend we estimated the probit equation Pr(yj

# 0) = 62(.08233za

- 1.529x2

in the Z

- 3.139)

The interpretation of the xl coefficient is that each one-unit increase in :t,1 leads to increasing the probit index by 08233 standard deviations. Learning to think in the Z metric takes practice and, even if you do, communicating results to others who have not learned to think this way is difficult. A transformation of the results helps some people think about them. The change in the probability somehow feels more natural, but how big that change is depends on where we start. Why not choose as a starting point the mean of the data? If 51 - 21.29 and 52 = .42. then we would report something like .0257. meaning the change in the probability calculated at the mean. We could make the calculation as follows. The mean normal index is .08233 x 21.29 4- 1.529 × .42 -3.139 = -.7440 and the corresponding probability is _( .7440) = .2284. Adding our coefficient of .08233 to the index and recalculating the probability, we obtain 62(-.7440 + .08233) = .2541. Thus, the change in the probability is .2541 .2284 = .0257.

i

r

probit-- Maximum-likelihood probitestimation

587

In prattle. Feople mak _,this calculation somewhat differently and produce a slightly differcnl numb£r.Ratbi- than make the calculation for a one-unit change in z, they calculate tile slope of the probabilir_.'unction. D( ing a little calculus, they derive that the change in the probabiliU for a change_in.r: _,_,9"" (?.Q) is he height of the normal density multiplied by the zl coefficienu that is. 0@ ;

0X 1

Going throug_ The 'differe this ce between 0257 andobtain .0249.0249. is not much. they differ because the .0257 is the e_act calculat on, they answer Ifor a lone-unit incr _ase in ¢_ whereas .0249 is the answer for an infinitesimal chariot.

I0

extrapolated obt Ot.l[.

dpr_bit ,,lm_the ¢las 1ic option transfom_s results as an infinitesimal change exm_pokued Example .L.._ Consider the a_tomobile dat again:

!

. use a_ ;o, clear i

• gen gc (1978 A_ rdplus :omobile= Dat_ repi I>=4 if rep78-=. (5 missii_ values generated)

I i

dprobi foreign mpg goodplus, classic Iteratio: 0: log liielihood = -42.400729 Iteratio: I: log li_elihood = -27.643138

I

It_ratio: It4_ratio:2: 3:

1 i

log li_elihood -26.953126 log li_elihood == -26.942119

Pr@bit Iteratio: e_timates 4: log li :elihood = -26.942114 _

Number of obs = LR chi2(2) =

Log likelihood = -26.!142114

Prob > R2 Pseudo chi2

foreign

dF/dx

mpg goodplus _cons

.0249187 .46276 -.9499603

obs. P

.3043478

pred. P

.2286624

Std. Err. .0110853 .1187437 .2281006

z

P>Izl

2.30 3.81 -3.82

O.022 0.000 0.000

x-bar

[

69 30.92

= 0.3646 0.0000

95Y.C,I.

]

21.2899 .003192 .046646 .42029 .230027 .695493 1 -I,39703 -.502891

(at x-bar)

z and!P>Izl are t_,etest of the underlying coefficient being 0

Afterestimation with dpro )it, the untransformedcoefficientresults can be seen by typing probit • probit i

I

_ithoutProbit options:! estimates Log likelihood ! fore_n

}[ [.

Number LR cbi2 Prob > Pseudo

= -26.912114 t I I

pZg I __CO S good s ]

Co,._f, Std, Err.

.082 33 -3.138 37 1.528 _92

.0358292 .8209689 .4010866

z

2.30 -3.82 3,81

P>Izl

0.022 O.000 0.000

of obs (2) chi2 R2

= = = =

69 30.92 0.0000 0.3646

[95X Conf. Interval]

.0121091 -4..7428771 747807

.152557 -12.315108 . 5_9668

.,..,v

W,,..L,,t

--

r_ux.llU.l-,iKellnOOO

esUmat|on

proDIt

There is one case in which one can argue that the classic, infinitesimal-change based adjustment could be improved on, and that is in the case of a dummy variable. A dummy variable is a variable that takes on the values 0 and 1 only--1 indicates that something is true and 0 that it is not. goodplus is such a variable. It is natural to summarize its effect by asking how much goodplus being true changes the outcome probability over that of goodplus being false. That is, "'at the means", the predicted probability of foreign for a car with goodplus = 0 is q5(.08233_1 - 3.139) = .0829. For the same car with goodplus = 1, the probability is I'(.08233 E_ + 1.529 - 3.139) = .5569. The difference is thus .5569 -.0829 = .4740. When we do not specify the classic option, dprobit makes the calculation for dummy variables in this way. Even though we estimated the model with the classic option, we can redisplay results with it omitted: i f

dprobit Probit estimates

Log

likelihood

= -26.942114

foreign

dF/dx

Std.

Err.

z

21.2899

,0110853

2.30

O. 022

.4740077

.I 114816

3.81

O. 000

obs.

P

.3043478

pred.

P

.2286624

of dummy

variable

P>[zl

> chi2 R2

= 0.3646

[

.42029

69 30.92

957, C.I.

]

.003192

.046646

.255508

.692508

(at x-bar)

discrete are

= 0.0000

x-bar

.0249187

is for

Prob

P>tzJ

mpg

z and

= =

Pseudo

goodplus*

(*) dF/dx

Number of obs LR chi2(2)

the

change test

of the

underlying

from

0 to 1

coefficient

being

0

q

[3 Technical Note at (mamame) allows you to evaluate effects at points other than the means. Let's obtain the effects for the above model at mpg = 20 and goodplus = 1: • matrix .

myx

dprobit,

Probit

Log

= (20,1)

at(myx) estimates

likelihood

Number

= -26.942114

foreign

dF/dx

Std.

Err.

z

of obs

=

69

LR chi2(2) Prob > chi2

= 30.92 = 0.0000

Pseudo

= 0.3646

P>Izl

x

R2

[

95_

C.I.

]

mpg

.0328237

.0144157

2.30

0.022

20

.004569

.061078

goodplus*

.4468843

.1130835

3.81

0,000

1

.225245

.668524

obs,

P

.3043478

pred.

P

.2286624

(at x-bar)

pred.

P

.5147238

(at x)

(*) dF/dx

is for

z and P>Iz}

discrete are

the

change test

of dummy

of the

variable

underlying

from

0 to I

coefficient

being

0

Q

t

i

i

,

prObit-- Maximum-likelihoodprobit estimation

589

Model identification The probi_; command h s one more feature, and it is probably the most useful. It will automatically check the model for identification and, if it is underidentified, drop whatever variables and obser_ ations i

are necessary !or estimatior to proceed. 1

> Example

i

Have you ei'er estimated [a probit model where one or more of your independent variables perfectly ! _

predicted" one br the other _utcome? For instanc{e, th! following i consider " " small amount of data: Outcome ?4 Indepeiadent Variable z

I

0

J

!

o1

o (.)

t I

l.et's imagine _'e wish to pn dict the outcome on the basis of the independent variable. Notice that the

!

outcome, is ah_{ayszero whel,ever the independent variable is one. In our data Prty = 0 ix = 1) - 1, ,',,rcn ]n turn ;means that tire probit coefficient on x must be minus infinity with a corresponding infinite stand_d error. At this point, you may suspect we: have a problem. UnfortunatOly, not all suctt problems are so easily detected, especially if you have a lot of independent variables in yohr model. If ,,ou have ever had such difficulties, then you have experienced one of the

i_ } ! _i ! ! *

more unpleas@ aspects of _amputer optimization. The computer has no idea that it is trying to solve for an infinite i:oefficient as it begins its iterative process All it knows is that. at each step, making the coefficient }alittle bigge_, or a little smaller, works wonders. It continues on its merry way until either (1) the _,hole thing c _mes crashing to the grdund when a numerical overflow error occurs or _2) it reaches s_me predeterr tined cutoff that stops the process. Meanwhile, you have been waiting. In addition, the e_timates that ou finally receive, if an3;. may be nothing more than numerical roundoff. i

State watches for these s,)rts of problems, alerts you. fixes them, and then properly estimates the model. ; 1

|

i

Let's return _toour automobile data. Among the variables we have in the data is one called repair that takes on tDee values. 4_ value of 1 indicates that the car has a poor repair wcord, 2 indicates

!

an avera_ze rec+rd, and 3 indicates a better-than-average record. Here is a tabulation of our data:

{

Car

tyre

2

3

Total

repair

I

Do=

ti {

2r

}

3

9

30

18

Foreign

i

Tot_l '

9 12

,58

1 i i

Notice that all ithe cars with 3oor repair records (repair==l) are domestic. If we were to attempt _o predict foreign on the basis of the repair records, the predicted probability for the repair==l category :would!have to be zero. This in turn means thai the probit coeN cient must be minus infinity, and that Would!set most corr.puter programs buzzing,

t l

Let's try, State on this proglem, First. we make up two new variables, rep_is_l lhat indicate thi repair cat,.'gory.

and

rep_is_2,

590

probit -- Maximum-likelihood probit estimation • generate

rep_is_1

= repair==1

generate

rep_is_2

= repair==2

The statement generate rep_is_l=repair==l creates a new variable, rep_is_l, that takes on the value 1 when repair is 1 and zero otherwise. Similarly, the next generate statement creates rep_is__.2 that takes on the value 1 when repair is 2 and zero otherwise. We are now ready to estimate our model: • probit note:

rep_is_2 failure perfectly 10 obs not used

Iteration

O:

log

likelihood

= -26.992087

1:

log

likelihood

= -22.276479

Iteration

2:

log

likelihood

= -22.229184

Iteration

3:

log

likelihood

= -22.229138

Log

I'

estimates

likelihood

Number

= -22.229138

foreign

Coef.

rep_is_2

- t. 281552

_cons

'

rep_is_l

Iteration

Probit

L

for

rep_is_1~=O predicts rep_is_l dropped and

I,21e-I 6

Err.

.4297324 ,295409

48

Prob

=

0.9020

=

0.1765

z

P>lzl

-2.98

O. 003

O. O0

= =

Pseudo

Std.

of obs

LR chi2(1)

I. 000

> chi2 R2

[95_, Conf. -2,123812 -, 578991

9.53

Interval] -.4392916 .578991

Remember that alI the cars with poor repair records (rep_is_l) are domestic, so the model cannot be estimated, or at least it cannot be estimated if we restrict ourselves to finite coefficients. Stata noted that fact. It said, "Note: rep_is_l-=0 predicts failure perfectly". This is Stata's mathematically precise way of saying what we said in English. When rep_is_l is not equal to 0, the car is domestic. Stata then went on to say, "'rep_is_l dropped and 10 obs not used". This is Stata eliminating the problem. First, the variable rep_is_l had to be removed from the model because it would have an infinite coefficient. Then, the 10 observations that led to the problem had to be eliminated as well so as not to bias the remaining coefficients in the model. The 10 observations that are not used are the 10 domestic cars that have poor repair records. Finally, Stata estimated what was left of the model, which is all that can be estimated. q

Technical Note Stata is pretty smart about catching these problems. variable", as we demonstrated above.

It will catch "one-way causation by a dummy

Stata also watches for "two-way causation"; that is, a variable that perfectly determines the outcome, both successes and failures. In this case Stata says, "so-and-so predicts outcome perfectly" and stops. Statistics dictates that no model can be estimated. Stata also checks your data for collinear variables; it will say "so-and-so dropped due to collineari ty". No observations need to be eliminated in this case, and model estimation will proceed without the offending variable. It will estimating age, and perfectly". model.

also catch a subtle problem that can arise with continuous data. For instance, if we were the chances of surviving the first year after an operation, and if we included in our model if all the persons over 65 died within the year, Stata will say, "'age > 65 predicts failure It will then inform us about the fixup it takes and estimate what can be estimated of our

f

IF

i probit

_j

(_nd logit

note:

.

an_ logistic) 0 successes

4 failures

probit_ , -_-,M.... aximum-likelihood probitestimation

591

will also occasionally display messages such as completely

determined.

The. cause!of this mess; ge and what to do if you get it are described in [R] legit. Q

Obtainingpredictedvlues Once you !have estimat_d a probit model, you can obtain the predicted probabilities using the predictcorr[mand for bolh the estimation sample,and other samples: see [U] 23 Estimation and post-estimati4n command, and [R] predict. Here we will make only a few additional comments.

i ! i

predict

'4ithout argur_rots calculates the predicted probability of a positive outcome. With the

i i_ i

xb option, it ¢_alculatesthe linear combination xjb; where xj are the independent variables in the jth observatio_ and b is th,_ estimated parameter vector. This is known as the index function since the cumulatiw density inde_ed at this value is the probability of a positive outcome.

i.

In both cas ',s, Stata remctubers any "rules" used to identify the model and calculates missing for

i

excluded obse vations unle_ rules or asif is specified. This is covered in the following example. Withithe s ;dp option, _redict calculates the standard error of the prediction, which is not adjusted_forre31icated cova iate patterns in the data.

!'_ i

One can c_ culate the u_adjusted-tbr-repticated-covariate-patternsdiagonal elements of the hat matrix, or leverage, by typir_g } . . predic!

pred

• predic_

sgdp,

genera_e

hat

I stdp! = std_2*pred*(l-pred)

> Example V';

In the pre lqus example, _'e estimated the probit model probit "Ib obtain predicted probabililies,

!

(option • predicti

"

p! assumed; p

(aO:missi_g • smmmari_e

_

f

foreign

rep_is_l

rep_is_2.

Pr foreign))

values foreign generated) Pl

r

.2068966 25

.4086186 1956984

0 .1

1 .5

I

Stata remember8 any "rules" used to identi_' the model and sets predictions to missing for any

i

excluded the'previous example, rep_is_lfrom our model andobserv_ttions.In excluded lO obser_'ations.Thus. whenprohitdropped we typed predictthe p.variable those same 10 obser_ation._ v,ere aa,ain excldded and their predictions se_to missing. predic:t's r41es option rill use the rules in the prediction• During estimation, we were told "'rep_is_l-=0 predicts failure )effectly", so the rule is that when rep_is_lis not zero. one should predict 0 probability of succe_ or a positive outcome: • sulmuariz_ foreign . predict _2,• rules

p +

592

probit -- Maximum-likelihood probit estimation Variable

Obs

Mean

foreign p p2

58 48 58

.2068966 .25 .2068966

Std. Dev. .4086186 .1956984 .2016268

Min

Max

0 .1 0

I .5 .5

predict's asif option will ignore the rules and the exclusion criteria, and calculate for all observations possible using the estimated parameters from the model:

predictions

predict p3, asif • summarize for p p2 p3 Variable

Obs

Mean

foreign p p2 p3

58 48 58 58

.2068966 .25 ,2068966 .2931034

Std. Dev. .4086186 .1956984 .2016268 .2016268

Min

Max

0 .1 0 .1

1 .5 .5 o5

Which is right? By default, predict uses the most conservative approach. If a large number of observations had been excluded due to a simple rule, one could be reasonably certain that the rules prediction is correct. The asif prediction is only correct if the exclusion is a fluke and you would be willing to exclude the variable from the analysis anyway. In that case. however, you should re-estimate the model to include the excluded observations.

Performinghypothesistests After estimation with probit, commands; see [U] 23 Estimation

you can perform hypothesis tests using the test or testnl and post-estimation commands.

Saved Results ,,

probit saves in

e()"

Scalars e(N)

number

e(df__m)

model

of observations

e (r2_p)

pseudo R-squared

e (11)

log likelihood

degrees

of freedom

e(ll_0)

log likelihood,

e(N_clust)

number

e (chi2)

X2

e(clustvar)

name of cluster

e(vcetype)

covariance

constant-only

model

of clusters

l

Macros variable

e(cmd)

probit

e(depvar)

name of dependent

e(wtype)

weight

type

e(chi2type)

Weld or LK; type of model

X_ test

e(wexp)

weight

expression

e(predict)

program

predict

e (g)

variance-covariance estimators

variable

estimation

method

used to implement

Matrices e (b)

coefficient

vector

Functions e(sample)

marks

estimation

sample

matrix

of the

,rr i

probtt -- Maximum-likelihood probit estimation

dprrbit [

593

s_ves in e()"

Scalars e(l_)

number of _bservations

e (lq_clust)

number of clusters

!

e(df_m)

model deg_'es of freedom

e(¢hi2)

X"_

i

e(r2_p) e(lt) e(lt_0)

pseudo R-s. unfed log likeliho_t log likeliho d. constant-only model

e(pbar) e(xbar) e(offbar)

fraction of successes observed in data average probit score average offset

e (emd) e(depva/-) e(wt_e)

dprobit name of de rendent variable weight type

e (_ cetype) e(chi2type) e(predict)

covariance estimation methodx 2 test Watd or LK; type of model program used to implement predict

e(wexp) e (clustvart)

weight expression name of clu _tervariable

e(dummy)

string of blank-separated 0s and Is: 0 means corresponding independent

I

Macros

i

!

variable is not a dummy, I means that it is

e (b) Matrices

coefficient

ctor vt vafiance-co_afiance matrix of

e(_/)

marginal effects

e(se_dfdx)

standard errors of the marginal effects

the estimators !

Functions e(sample)

i

e(dfax)

marks estim4tion sample

MethodsandFormula Probit analysis

originate

in connection

with bioassay,

and the word probit,

a contraction

"probability unit", was suggested by Bliss (1934). FOr an introduction to probit and example, Aldrich and Nelsor_ (t984), Hamilton (1992). Johnston and DiNardo (t997), or Powers and
i ! I

of

logit, see, for Long (1997),

The log-like ihood functio | b for probit is

I

|_

!

lnL+-

i

. 'wjlnff(xjb)

13_s

}

I

where _ is the cumulative in [R] ma_Ximiz¢ If robust [Rf regress --i;+(xjb)/{

sta,dard

!

ff(xjb) t "

nor[hal and wj denotes the optional weights.

errors

.ire requested,

the calculation

described

In L is maximized in Methods

and Formutas

of

formul_

Yurnine to dprobit, whict is implemented as an ado-file, let b and V denote the coefficients and variance matrix Calculated bv _robit.Let b_ refer to the ith element of b. For continuous variables. or for all variables if classi is specified, dprobit reports

*-

= 6(_b)b_

! !.

as described

is _arried forwa d with uj = {(j(xjb_/_(xjb)}xj for the positive outcomes and t -_ _(xjb)}ixj for the negative outcomes, where e is the normal density, qc is given

by its asymptotiC-like ! i !

ln{tj_s

t

'!

+

The correspondifig

variance

rrtatrix is DvD'

I

where D - 6(2b){I

(_b)b_}.

594

probit -- Maximum-likelihood probit estimation

For dummy variables taking on values 0 and 1 when classic is not specified, dprobit makes the discrete calculation associated with the dummy changing from 0 to 1. b_ = _(glb) - ff(Xob) where T0 -- N1 - x except that the ith elements of x0 and xl are set to 0 and 1, respectively. The variance of bi is given by dVd' where d -- _b(_lb)_l - _(xob)No. Note that in all cases, dprobit

reports test statistics zi based on the underlying coefficients

bi.

References Aldrich, J. H. and F. D. Nelson. 1984. Linear Probability,Logit, and P;obit Models. Newbury Park, CA: Sage Publications. Berkson, J. 1944. Applicationof the logistic function to bio-assay,Journat of the America;_Statistical Association 39: 357-365. Bliss. C. I. 1934. The method of probits. Science 79: 38-39, 409-410. Hamilton, L. C. 1992, Regression with Graphics. Pacific Grove, CA: Brooks/Cole Publishing Company. Hilbe, J. 1996. sg54: Extended probit regression. Stata Technical Bulletin 32: 20-21. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 131-132. Johnston. J. and J. DiNardo. 1997. Econometric Methods. 4th ed. New York: McGraw-Hill. Judge, G. G., W. E. Griffiths, R. C. Hill, H. Liitkepohl,and T.-C.Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wiley & Sons. Long, J. S 1997. Regression Models for Categorical and Limited Dependent Variables.Thousand Oaks, CA: Sage Publications. Powers, D. A. and Y. Xie. 2000. Statistical Methods for CategoricalData Analysis. San Diego, CA: Academic Press.

Also See Complementary:

[R] adjust, [R] lincom, [R] linktest, [R] Irtest, JR] mfx, [R] predict, JR] roe, [R] sw, [R] test, [R] testnl, [R] vce, JR] xi

Related:

JR] biprobit, JR] dogit, [R] cusum, [R] gim, [R] glogit, [R] hetprob, [R] logistic, JR] logit, [R] scobit, [R] svy estimators, [R] xtclog, [R] xtgee, JR] xflogit, [R] xtprobit

Background:

[U] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and post-estimation commands. [U] 23.11 Obtaining robust variance estimates, [U] 23.12 Obtaining scores,

i !

[R] maximize

!

l

Title priest--iOneI ! . llll !

,

i

a

tw -sTple

i

testsi uof proportions Ii i lll

I

i

-

"

1

Syntax prtest ..dmame--#, if exp][i_range][.level(#)]

I

prtest vdrnamel-va name, [ifexp][inrange][.level(#)] prtest va_ame [if_ p] [inrange],by(gro_pvar)[.level(#)]

p.t.ti #pl

! ;

prtosti

,level(#) .o=t]

#busl #pl #obs! #p2 [, level(#)

coua% ]

by ... : may bd usedwithprt st (butnot prtestiK see[R] by.

{ [

Description

,

priest per_brms tests oJ the equality of proportions using large-sample statistics. In the first form, prtest tests that varname has a proportion of #p. In the second form, prtest tests that varn_me_ and var_ame2 have the same proportion, In the third form. prtest tests that varnamehas tile same propcrtion within the two groups defined by groupvar.

[ i I

prteSti

is !the immediat

form of prtest;

see [U] 22 Immediate commands.

The bitest!command is _ better version of the first form of prtest in that it gives exact p-values, Researchers are advised to e bitest when possible, especially for small samples; see [R] bitest.

I

Options

1

by (groupvar) s_ecifies a nu eric xaffable that contains the group information for a given observation. This variable!must have o_t3 two values, Do not confuse the by() option with the by... : prefix: both may be rspecified. !

level(#) speci_es the confid _ncelevel, in percent, for confidenco intervals. The default is level(95) or as set by _et level: :e [U] 23.5 Specifying the width of confidence intervals.

! I i !

count specifies!that integer -ounts instead of proportion,_ are being used in the immediate forms of prtest. Ila the first syr tax, prtesti expects #obsl and #pl to be counts. #px _<#obsl- and expects #p2 tb be a propc "tion. In the second syntax, prtesti expects all four numbers to be integer counts_,#obst >_#pl. and #obs2 -> #p2.

} I

i

Remarks The priest qutput followslthe output of ttest in providing a lot of information. Each proportion is presented alon_ with a cont_dence interval. The appropriate one- or two-sample test is performed and the two-sidell and both o_e-sided results are included at the bottom of the output, in lhe case of a two-sampleitest, the cal_ulated difference is also presented with its confidence interxal. This command may be used for bo_h large-sample testing and large-sample interval estimation.

i, i

1 I

595

596

prtest-

One- and two-sampie tests of proportions

D Example

i ,

In the first form, priest tests whether the mean of the sample is equal to a known constant. Assume you have a sample of 74 automobiles. You wish to test whether the proportion of automobiles that are foreign is different from 40 percent. . priest

foreign=.4

One-sample

test

of proportion

Variable

Mean

foreign

.29T2973

Std.

Err.

.0531331

Ho: Ha:

foreign:

z

P>lz[

5. 59533

O. 0000

proportion(foreign)

foreign < .4 z = -1.803

Ha:

P < z = 0.0357

of obs

[95Z

=

Conf.

74

Interval]

.1931583

.4014363

= .4

foreign-= .4 z = -1.803

Ha:

P > Izl = 0.0713

The test indicates that we cannot reject the hypothesis .40 at the 5% significance level.

Number

foreign > ,4 z = -I. 803

P > z = 0.9643

that the proportion

of foreign automobiles

is

<1

Example We have two headache remedies that we give to patients. Each remedy is recorded as 0 for failing to relieve the headache and I for relieving the headache. We wish to test the equality of the proportion of people relieved by the two _eatmen_. curel=cure2

•prtest Two-sample

test

of proportion

Variable

Mean

curel

.52

c11re2 diff

.7118844 -.1918644 under

Ho: Ha:

Ho:

Std.

curel: cure2:

Err.

You find that the proportions 3.9%.

[95_ Conf.

50 59

.0706541

7.3598

0.0000

,3815205

.6584795

.0589618

12.0733

0.0000

.5963013

.8274275

-.372229

-.0114998

-2.0605

0.0394

.0920245 .0931155

0

0,0197

= =

P>Izl

- proportion(cure2) Ha:

z = -2.060 P < z =

of obs of obs

z

proportion(curel)

diff<

Number Number

z P >

Izl

diff

~= 0

= -2.060 =

0.0394

are statistically

=diff

Interval]

= 0

Ha:

diff

> 0

z = -2.060 P > z =

0.9803

different from each other at any level greater than

_j

prtest -- One-;and two-sampletests of proportions

i

597

Immediate for m

i

Example

I ! !_

pr't;esti is like prtes" except that you specify summary statistics rather than variables as arguments. Foz instance, vo_ are reading an article Which reports the proportion of registered voters among 50 randomly, selecte_ eligible voters as .52. You wish to test whether the proportion is .7: prtesti

i

50 .52 .70

One-samp].e

test

of proportion

Variable I

Mean

x ;

,52

t

x: Number of obs =

Std. Err. .0706541

z

P>Izl

7.3598

O.0000

[95%Conf,

50 Interval]

.3815205

.6584795

!

I

, H_: x < .7

Ho: proportion(x) = ,7 Ha: x -= .7

Ha: x > .7

zl = -2.777

z = -2.777

z = -2.777

P <

!

0.0027

P > Iz[ =

0.0055

P > z =

0.9973

Example

i i

In order to jhdge teacher effectiveness, we wish to test whether the same proportion of people from two classds will answe_ an advanced question correctly. In the first classroom of 30 students.

_i

40% answered the question correctly, whereas in the second classroom of 45 students, 67% answered the question cofi'ectly. ! Two-sample test of pr . prtesti!30 .4 45 .6_

Variable

Mean

x y

.4 .67

ortion

x: Number of obs = y: }_umberof obs =

Std. Err.

z

P>[z_

30 45

[957,Conf. Interval]

<

i

diff

-.27

.0894427 .0700952 !

4.47214 9.55843

.1136368

}

z P -2.309 P < z ,_

-2.30885

-.0472759

0.0210

Ho: pro ortion(x) - proportion(y) = diff = 0

Ha: diff < 0

i

.5753045 .807384

,2246955 .532616 - ,4927241

under Ho: I .1169416 i

O.0000 O,0000

0.0105

Ha: diff ~= 0

z = -2.309 P > Izl

=

0.0210

Ha: diff>

0

z = -2.309 P > z =

0.9895

Saved Results •

I

prtest

Scalars saves_in :

r()

r(z)I

z statistic

r(P.__)

proportion

r(N__)

:or variable #

number of obser_'ations

_br variable #

598

prtest -- One- and two-sample tests of proportions

Methods and Formulas prtest and prtesti areimplemented A large-sample

(1 - a)100%

as ado-files.

confidence interval for a proportion p is

pq-"

and a (1 - a)100%

confidence for the difference of two proportions

(P'I

where _"= 1 -_

Zl_a/2v/P-----_

-- P2)

Jr- Zl_ol/2

is given by

/P'tqq + P'2q"2 V nl n2

and z is calculated from the inverse normal distribution.

The one-tailed and two-tailed statistic calculated as

test of a population

proportion

uses a normally distributed

test

_-po

z= ;_/_g_/,_ where P0 is the hypothesized proportion. A test of the difference normally distributed test statistic calculated as

of two proportions

also uses a

Z=

v%%(llnl+11n2) where _p

__. Xl

-}-X2

nl + n2

and xl and x2 are the total number of successes in the two populations.

References Sincich, T, I987. Statistics By Example 3d ed. San Francisco:Dellen Publishing Company.

Also See Related:

[R] bitest, [R] ci, [R] hotel, [R] oneway, [R] sdtest, [R] signrank,

Background:

[U] 22 Immediate

commands

[R] ttest

Stata Reference Su-Z Release 7

Read more

Stata Survey Data Reference Manual: Release 11

Read more

Stata Multiple-Imputation Reference Manual: Release 11

Read more

Stata Multivariate Statistics Reference Manual Release 10

Read more

Stata Data-Management Reference Manual: Release 11

Read more

Stata Time-Series Reference Manual: Release 11

Read more

Stata Multivariate Statistics Reference Manual: Release 11

Read more

Stata Programming Release 9

Read more

Stata User's Guide Release 11

Read more

Stata Longitudinal-Data Panel-Data Reference Manual: Release 11

Read more

Stata 11 Base Reference Manual

Read more

HP

Read more

JDK 7 Reference Card

Read more

Release

Read more

HP Certified: HP-UX System Administration

Read more

Release

Read more

BusinessObjects XI (Release 2): The Complete Reference

Read more

Release

Read more

Release

Read more

BusinessObjects XI (Release 2): The Complete Reference

Read more

BusinessObjects XI (Release 2): The Complete Reference

Read more

ANSYS, Inc. Theory Reference: ANSYS Release 9.0

Read more

JMP Design of Experiments, Release 7

Read more

Release

Read more

Release

Read more

HP-UX CSE: Official Study Guide and Desk Reference

Read more

by hp

Read more

HP-394, MB-1693

Read more

HP-678, MB-2182

Read more

HP ProLiant Servers AIS: Official Study Guide and Desk Reference

Read more

Recommend Documents

Stata Reference Su-Z Release 7

F iTitie summarize — Summary statistics Syntax summarize [var/rsf] |WigAf] [if exp\ [in rakge] [, [ detail | meanonly j...

Stata Survey Data Reference Manual: Release 11

STATA SURVEY DATA REFERENCE MANUAL RELEASE 11 A Stata Press Publication StataCorp LP College Station, Texas c 1985–20...

Stata Multiple-Imputation Reference Manual: Release 11

STATA MULTIPLE-IMPUTATION REFERENCE MANUAL RELEASE 11 A Stata Press Publication StataCorp LP College Station, Texas c...

Stata Multivariate Statistics Reference Manual Release 10

Title intro — Introduction to multivariate statistics manual Description This entry describes this manual and what has...

Stata Data-Management Reference Manual: Release 11

STATA DATA-MANAGEMENT REFERENCE MANUAL RELEASE 11 A Stata Press Publication StataCorp LP College Station, Texas c 198...

Stata Time-Series Reference Manual: Release 11

STATA TIME-SERIES REFERENCE MANUAL RELEASE 11 A Stata Press Publication StataCorp LP College Station, Texas c 1985–20...

Stata Multivariate Statistics Reference Manual: Release 11

STATA MULTIVARIATE STATISTICS REFERENCE MANUAL RELEASE 11 A Stata Press Publication StataCorp LP College Station, Texas...

Stata Programming Release 9

...

Stata User's Guide Release 11

STATA USER’S GUIDE RELEASE 11 A Stata Press Publication StataCorp LP College Station, Texas c 1985–2009 by StataCorp ...

Stata Longitudinal-Data Panel-Data Reference Manual: Release 11

STATA LONGITUDINAL-DATA/PANEL-DATA REFERENCE MANUAL RELEASE 11 A Stata Press Publication StataCorp LP College Station, ...