Fundamentals of Item Response Theory Ronald K. Hanlbleton H. Swaminathan D. Jane Rogers
.1
SAGE PUBLICATIONS The Inter...
104 downloads
1026 Views
6MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Fundamentals of Item Response Theory Ronald K. Hanlbleton H. Swaminathan D. Jane Rogers
.1
SAGE PUBLICATIONS The International Professional Publishers Newbury Park london New Deihl
Copyright © 1991 by Sage Publications, Inc. All rights reserved. No part of this hook may he reproduced or utilized in any form or by any means. electronic or mechanical, including photocopying. recording. or by any information storage and retrieval system. without permission in writing from the puhlisher.
..
For information address: SAGE Publications, Inc. 24SS Teller Road Newbury Park. California 91320 SAGE Publication. Ltd. 6 Bonhill Street London EC2A 4PU Uniled Kingdom SAGE Publications India Pvt. Ltd. M·32 Market Grellter Kailash I New Delhi 110048 India
Printed in the United Stales of America Library of Congress Cataloging-in-Publication Data
Hambleton, Ronald K. Fundamentals of item response theory I Ronald K. Hambleton. II. Swaminnthan, H. Jane Rogers. p. crn. - (Measurement methods for the socilliscien{:es ; v. 2) Includes bibliographical references and index. ISBN 0-8039-3646-X (cloth). ISBN 0·8039-3647-8 (phk.) I. Item response theory. I. Swami . II. Rogers, H. Jane. III. Title. IV. Series. ~"ERUN/Vfk BFI76JB4 1991 Srtr 91-22005 150'.28'7-dc20 5 JAN 1M
FIRST PRINTING, 1991 Sage Production Editor: Diane S. Fosler
Contents
Series Editor's Foreword Preface I.
vii ix
Background
2. Concepts, Models, and Features
7
3. Ability and Item Parameter ESlimation
32
4. Assessment of Model-Data Fit
53
5. The Ability Scale
77
6. Item and Test Information and Efficiency Functions
91
7. Test Construction
99
8. Identification of Potentially Biased Test Items
109
9. Test Score Equating
123
10. Computeri7.ed Adaptive Testing
145
1L Future Directions of hem Response Theory
153
Appendix A: Classical and JRT Parameter Estimates for the New Mexico Slate Proficiency Exam
156
Appendix B: Sources for IRT Computer Programs
159
References
161
Index
169
About the Authors
173
Series Editor's Foreword
In the last decade we have witnessed a revolution in educational and psychological measurement as the application of classical measurement theory has given way to the use of item response theory (lRT). Today, IRT is used commonly by the largest testing companies in the United States and Europe for design of tests, test assembly, test scaling anti calibration, construction of test item hanks, investigations of test item bias. and otlier common procedures in the test development process. Measurement researchers, public school systems, the military, and several civilian hnlllches of the federal governlllt~nt as well, have endorsed and employed IRT with increasing enthusiasm and frequency. This book provides a lucid hut rigorous introduction to the fundamental concepts of item response theory, followed by thorough. Rccessible descriptions of the application of IRT methods to problems in test construction, identification of potentially hiased test items, test equating, and computerized-adaptive testing. A summary of new directions in IRT research and deVelopment completes the hook. Hambleton. Swaminathan and Rogers have developed IRT theory and application through carefully wrought arguments. appeals to familiar concepts from classical measllfement methods and hasic statistics, and extensively described. step-by-step numerical examples. The book is well illustrated with tables containing the results of actllallRT analyses and figures that portray the influence on IRT results of such fundamental issues as models with differing numbers of item parameters, examinees at differing levels of ability. and varying degrees of congruence ("fit") between sets of data and IRT models. Although equations necessary to compute most IRT statistics are provided in the book, their mathematical derivations have been omilled. Nevertheless, this is not a "cookbook" on IRT methods. The reader will find thorough discussion of altcHHltivC" procedures for estimating IRT vii
viii
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
parameters-maximum likelihood estimation, marginal maximum likelihood estimation. Bayesian estimation. and so on. Knowledge of the underlying calculus is not required to understand the origins of these procedures and the distinctions among them. Hambleton et al. have been faithful to the goal of the Measurement Methods ror the Social Sciences series, to make complex measurement concep.ts, topics, and methods accessible to readers with limited mathematic~1 backgrounds but a keen desire to understand. as well as use, methods that are on the cutting edge of social science assessment. This book introduces powerful new measurement concepts and applications in ways that can be understood and used correctly by thousands for whom IRT heretofore has been no more than a fascinating mystery. RICHARD M. JAEGER
University of North Carolina at Greensboro
\
Preface
The popular (or classical) measurement models and procedures for constructing educational and psychological tests and interpreting test scores have served testing specialists well for a long time. A review of test catalogs and recent editions of the Mental Measurements Yearbook and Test Critiques would reveal that numerous achievement, aptitude, and personal ity tests have been constructed using these classical models and procedures. The ways in which educational and psychological tests usually are constructed, evaluated. and used have many well-documented shortcomings of. however (see, for example. Hambleton, 1989). These shortcomings include (a) use of item indices whose values depend on the particular group of examinees with which they are obtained, and (b) examinee ability estimates that depend on the particular choice of items selected for a test. Psychometricians have advanced II new measurement system. item response theory (IRT), to address these and other shortcomings of common measurement practices. In the 19808, item response theory was one of the dominant topics of study among measurement specialists. Many IRT models of current interest will be described in this book. Because item response theory provides a useful framework for solving a wide variety of measurement problems. many lest publishers, state and provincial departments of education. credentialing agencies, school districts. armed forces. and industries use item response theory to assist in building tesls. identifying potentiaIly biased test items, equating scores from different tests or different forms of the same test, and reporting test scores. Item response theory has many other promising applications as well. Several of these applications will be discussed in some detail in this book. Why publish an IRT book al Ihis lime? Interest in learning about this new measurement theory and in applying it is worldwide. and the need exists for practical instructional material. The purpose of this book, ix
J(
PllNDAMENTALS 01' ITEM RESPONSE TIIEORY
therefore, is to provide a comprehensive and practical illtroduction to the field of item response theory. The limitations of classicaimcaslllc, ment procedures arc addressed to provide a rationale for an alternalive psychometric model. The fundamentals of item respollse theory. il1clll
)
1 Background
Consider a Iypical measurement practitioner. Dr. Testmaker works for a company that specializes in the development and analysis of achievement and aptitude tests. The tests developed by Dr. Teslmaker's company are used in awarding high school diplomas, promoting students from one grade to the next, evaluating the quality of education, identifying workers in need of training, and cretientinling practitioners in a wide variety of professions. Dr. Testmaker knows that the company's clients expect high quality lests, tests that meet their needs and thaI can stand up technically to legal challenges. Dr. Testmaker refers to the AERA/APAINCME Stanf!ards for Educational and Psychological Testing (1985) and is familiar with the details of a number of lawsuits Ihat have arisen because of questions about test quality or tcst misuse. Dr. Testmaker's company uses classical test theory models and methods to address most of its technical problems (e.g., item selection, reliability assessment, test score equating), but recently its clients have been suggesting-and sometimes requiring--that item response theory (IRT) he lIsed wilh their lests. Dr. Tesllllaker has only a rudimentary knowledge of item response theory and no previous experience in appfying it. and consequently he has many questions, such as the following: I. What JRT models are available. ali(I which model should he used'! 2. Which of Ihe many availahle algorilhms should he lIsed to eslimale paramelers'! 3. Which IRT compuler program should he used to analp.e Ihe dala? 4. flow can the fit of the chosen IRT model to the tesl data he determined? 5. What is the relationship bel ween lest lenglh and Ihe precision of abilily estimales? /
2
'\ I
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
6. Ilow can IRT Hem statistics he used to COllslrul'llests 10 meet cOlllent and technical specifications? 7. Ilow can IRT be used to evaluate Ihe statistical consequences of changing items in a test? R. How can IRT be used to assess the relative utility of different tests thaI are measuring the same ability? , • 9. How can IRT be used to detect the presence of potentially biased lest items? 10. How can lRT be used to place test item statistics obtained from nonequivalent samples of examinees on a common scale? 1.
The purpose of this book is to provide an introduction to item response theory that will address the above questions and many others. Specifically. it will (a) introduce the basic concepts and most popular models of item response theory, (b) address parameter estimation and available computer programs, (c) demonstrate approaches to assessing model-data fit, (d) describe the scales on which abilities and item characteristics are reported, and (e) describe the application of IRT to test construction, detection of differential item functioning, equating, and adaptive testing. The book is intended to be oriented practically. and numerous examples are presented to highlight selected technical points.
Limitations of Classical Measurement Models Dr. Testmaker's clients are turning towards item response theory because dl!.~~ical testing methods and measurement procedures have a .!!.umber of shortcomings. Perhaps the most important shortcoming is ~'!..~t examinee characteristics and test characteristics cannot be sepa_rated: e_ach can be interpreted only in the context of the other. The examinee characteristic we are interested in is the "ability" measured by the test. What do we mean by ability? In the classical test theory framework, the notion of ability is expressed by the !!!~e .'H·ore, which is defined as "the expected value of observed performance on the te.rl .of interest." An examinee's ability is derined only in terms of a particular test. When the test is "hard," the examinee will appear to have low ability; when the test is "easy," the examinee will appear to have higher ability. What do we mean by "hard" and "easy" tests? The difficulty of a le.'il item is defined as "the proportion of examines ill a RrouP of
,.
/'
....
·
(;
...
·t·
,,{
~,
" ;;j
~.
'".: ~
cl-,
"
, .r 'fI .1
" v .,.{ ~: >.
illfl~re.~t who answer the item corredl y." Whether an item is hard or easy depends Oil the ability of the examinees being measured, and the ability of the examinees depends on whether Ihe Ie,.,t itellls arc hard or easy! Item discrimination and test score reliability and validity are also defined in terms of a particular group of examinees .. Test and item characteristics change as the examinee context changes, and examinee -characteristics change as the test context challges. Hence, it is very diHicull to compare examinees who take different tests and very difficult to compare items whose characteristics are obtained using different groups of examinees. (This is not to say that such comparisons are impossible: Measurement specialists have devised procedures to deal _wjth these problems in practice, but the conceptual problem remains.) T,et liS look at the practical consequences of item characteristics that depend on the group of examinees from which they art" obtained, that is, are Rroup-dependent. Group-dependent item indices are of limited use when constructing tests for examinee populations th .. t are dissimilar to the population of examinees with which the item indices were obtained. This limitation can be a major one for test developers, who often have great difficulty securing examinees for field tests of new instruments-especially examinees who can represent the population for whom the test is intended. Consider, for example, the problem of fi{~ld-testing items for a state proficiency test administered in the spring of each year. Examinees included in a field tcst in the fall will. necessarily. be less capahle than those examinees tested in the spring.llence, items will appear more difficult in the field tesl Ihan they will appear in the spring test administration:.~ variation on the same problem arises .with item banks. which are becoming widely used in test construction . Suppose the goal is to expand the bank hy adding a new set of test items along with their item indices. If these new item indices lire obtained on a group of examinees different from the groups who took the items already in the bank, the comparability of item indices must be questioned. What are the consequences of examinee scores that depend on Ihe particular sct of items administered, that is, arc 1('.\'I-dl'/)el/(It'fll?_Cl~early. it is difficult to compare examinees who take different tests: The scores ~m the two tesls arc on different scales. and no functional relationship l:xists betweellthc scales. Even if the examinees arc given the same or parnlleltcsts, II prohlem remains. When the examinees are of different ahility (i.e .• the test is more difficult for one group Ihan for the other), their test scores contain different amounts of error. To demonstrate this
4
FUNDAMENTALS 01' ITEM
IH~SI'ONSE TIH~OI{Y
point intuitively. consider an examinee who obtains a score (If I.ern: This score tells us that the examinee's ability is low but provides no information about exactly how low. On the other hand, when an examinee gets some items right and some wrong, Ihe test score contains information about what the examinee can and cannot do, and thus gives a more precise measure of ability. If the test scores for two examInees are not equally precise measures of ability, how may comparisons between the test scores be made? To obtain scores for two examinees that conlain equal amounts of error (i.e., scores Ihat are equally reliable), we can match test difficulty with the approximate ability levels of the examinees; yet, when several forms of a test that differ substantially in difficulty are used, test scores are, again, not comparable. Consider two examinees who perform at the 50% level on two tests that differ substantially in difficulty: These examinees cannot be considered equivalent in ability. How different are they? How may two examinees .--be compared when they receive different scores on lesls that differ in difficulty but measure the same ability? These prohlems are diffieull to resolve within the framework of classical measurement theory. Two more sources of dissatisfaction with classical test theory lie in the definition of reliability and what may be thought of as its conceptual converse, the standard error of measurement. !!eliahility, in a classical _. .test theory framework, is defined as "the correlation between lest scores 'f .-= on parallel forms of a test." In practice, satisfying the definition of .;.-' parallel tests is difficult, if ';oiimpossible. The various reliabilily coefficients available provide either lower bound estimales of reliability or reliability estimates with unknown biases (Hambleton & van der Linden, 1982). The problem wilh the standard error of measuremenl, which is a function of test score reliability and variance, is thai it is assumed to be the same for all examinees. But as pointed out above, _~cores on any test are unequally precise measures for examinees of _different ability. Hence, the assumplion of equal errors of measurement for all examinees is implausible (Lord, 19R4). --A final limitation of classical tesl theory is Ihat it is test oriented _rather than item oriented. The classical trlle score model provides no consideration of how examinees respond 10 a given ilem. Hence. no basis exists for determining how well a particular examinee might do when confronted with a test item. More specificully, classical lest theory does not enable us to make predictions about how an individual or a group of examinees will perform on a given item. Such queslions
as, What is the prohability of an ('xaminl'c answering a given item correclly? arc important in It nUlllher of h'sting applications. Snch information is necessary, for example, if a lest designer wants to predict test ~core characteristics for onc or more populalions of examinees or 10 (ksign tests with particular characteristics for certain populations of examinees. For example, a test intended to discriminate well among scho-Iarship candidates may be desired. In addition to the limitations mentioned above, classical measurement models and procedures have provided less-than-ideal solutions to many testing problems~for example, the design of tests (Lord, 19RO), the identification of biased items (Lord. 1980), adaptive testing (Weiss, 1983), and the equating of test scores (Cook & Eignor, 1983, 1989). For Ihese reasons, psychometricians have sought alternative theories and models of mental measurement. The desirable features of an alternative test theory would include (a) item characteristics that are not group-depelllient, (b) scores describing examinee proficiency that are 110t test-dependent. (c) a modellhat is expressed at the item level rather than at the test level, (d) a modellhat does !lot require strictly parallel tests for assessing reliability, and (e) a model that provides a measure of precision for each ability score. It has been shown that these features can be obtained within the framework of an alternative test theory known as item response theory (Hambleton, 1983; Hambleton & Swaminathan, 1985; Lord, 1980; Wright & Stone, 1979).
Exercises for Chapter I I. Identify four of the limitations of classical test theory that have stimulated measurement specialists to pursue alternative measurement models. 2. Item responses on a tesl item and total test scores for 30 examinees are given in Table 1.1. The first 15 examinees were classified as "low ability" based on their tOlal scorell; the second 15 examinees were classified as "high ability." a. Calculate the proportion of examinees in each group who answered the item correctly (this is the classical item difficulty index in each group). b. Compute the item-total correlation in each group (this is the classical item discrimination index in each group). c. What can you conclude regarding the invariance of the classical item indices?
6
FUNDAMENTALS OF ITEM RESPONSE TllEORY
TABLE l.l Hiflh-Ability Group
Low-Ability Group
__
_ _ _ ••
_w
_ _ ~.
_~ _ _ _ _ _ _ _
----~ -----~~--
tr~m
Examinee
I
2 3 4 :5 6
7 8 9
Responu
0 0 0 0 0 0 0 0 I
to II
0 I
12
1 0 0
13
14 1:5
0
Total Score
8 12 6 12 8
8 8 II
13 4 14 13 10
9 8
Examinu
16 17 18 19 20 21 22 23 24 2S
26 21 28 29 30
tt~m
1'0101
Re,fponse
Score
I
0 I I
0
1 0
..
33 28 29 30 29 28 33 32 32 33 34 35 34 38 37
Answers to Exercises for Chapter I I. hem-dependent ability scores, sample-dependent item statistics, no prob· ability information available about how e)(aminees of specific abilities might perform on certain test items, restriction of equal measurement errors for all examinees. 2. a. Low-scoring group: p = 0.2. lIigh-scoring group:" ::: OJ!. b. Low-scoring group: r 0.68. IIigh-scoring group: r ::: 0.39. c. Classical item indices are not invariant across subpopulations.
2 Concepts, Models, and Features
Basic Ideas
.-
Item response theory (lRT) rests on two basic postulates: (a) The perfonnance of an examinee on a test item can be predicted (or explained) by a set of factors called traits, latent traits, or abilities; and (b) the relationship between examinees' item performance and the set of traits underlying item perfonnance can be described by a monotonically increasing function called an item characteri.uic function or item characteri.ftic curve (ICC). This function specifies that as the level of the trait increases, the probability of a correct response to an item increa~es. Figure 2. t shows an item charucleristic function for the case when only one trait underlies perfonnance on the item, together with distributions of ability for two groups of examinees. Observe that examinees with higher values on the trait have higher probabilities of answering the item correctly than do examinees with lower values on the trait, regardless of group membership. Many possible item response models exist, differing in the mathematical form of the item characteristic function and/or the number of parameters specified in the model. AIlIRT models contain one or more . parameters describing the item and one or more parameters describing the examinee. The first step in any tRT application is to estimate these parameters. Procedures for parameter estimation are discu~sed in chapter 3. Item response models, unlike the classical true score model, are falsifiable models. A given item response model mayor may not be appropriate for a particular set of test data; that is, the model may not adequately predict or explain the data. In any IRT application, it is .... essential to assess the fit of the model to the dala. Procedures for assessing model-data fit Are discussed in chapter 4.
l
7
FUNDAMENTALS OF ITEM RESPONSE THEORY
8 1.0
P r 0
b
•
b I
0,'
,
.
I I
, Y 0
'=~====-:_'-------T'--"-'-'"
'[=';';"";~=,,;:";F
Ability
2
Figure 2.t. An Item Characteristic Curve and Distribulions of Ability for Two Groups of Examinees
When a given IRT model fits the test data of interest, several desirable features lire obtained. Examinee ahility estimates IIrc 110t tcst· dependent, and item indices are not group-dependent. Ability estimates obtained from different sets of items will be the same (except for measurement errors), and item parameter estimates obtained in differ· ent groups of examinees will be the s
...e-
~
-I
COllcel'l.r.
Modl'!". "nd Ftaillru
a single estimate of error for all examinees, as is the case in classical test theory.
The mathematical models employed in IRT specify that an examinee's probability of answering a given item correctly depends on the examin~e's ability or abilities and the characteristics of the item. IRT models include a set of assumptions about the data to which the model is applied. Although the viability of assumptions cannot be determined directly. some indirect evidence can be collected and assessed. and the overall fit of the model to the test data can be assessed as well (see chapter 4). An assumption common to the IRT models most widely used is that only one ability is measured by the items that make up the test. This is called the assumption of unidimcflsionllliry. A concept related to unidimensionality is that of local ;ndelJendl'flce. Unidimensionality and local independence are discussed in the next section. Another assumption made in all IRT models is that the item characteristic function specified reflects the trlle relationship among the unobservahle variables (abilities) and observahle variables (item responses). Assumptions are made also llhout the item characteristics that are relevant to an examinee's performance on nn item. The major distinction among the JRT models in common use is in the number and type of item d1l\racteristics assumcd to affect examinee perfornulIlce. These assumptions will be discussed shortly. Unhlimens;of/aUry
As stated above, a common assumption of IRT models is that only one ubility is measured by a set of items ill II test. This assumption cannot be strictly met because several cognitive, personality, and testtaking factors always affect test performance. at least to some extent. These factors might include level of motivation, test anxiety, ability to work quickly, tendency to guess when in douht about answers, and cognitive skills in addition to the dominant one measured by the set of test itcms\WIHlt is required for the unidimcnsiollality assumption to be met ade(luately by a set of test data is the prcsence of II "dominant" component or factor that influences test performance. This dominant
,--""",
!
10
\
FUNDAMENTALS OF ITEM RESPONSE THEORY
component or factor is referred to as the ahility meajiured hy the test; it should be noled, however, thaI ability is not Ilecellsarily inherent or unchangeable. Ability scores may be expected to change over time because of learning, forgetting, and other factors. ifem response models in which a single dominant ability is presumed sufficient to explain or account for examinee performance are referred to as unidimensional models. Models in which it is assu'o,'ed that more than one ability is necessary to account for examinee test performance are referred to as multidimensional. These latter models are more complex and, to date, have nol been well developed (McDonald, 1981).
Locollndependence /
/
Local independence means tha( when the abilities influencing lest performance are held constant, eXaminees' responses to any pair of items are statistically independent. In other words, after taking examinees' abilities into account, no relationship exists between examinees' responses to different items. Simply put, this means that the ahilities specified in the model are the only factors influencing examinees' responses to test items. This set of abilities represents the complete latent .fpace. When the assumption of unidimensionality holds, the complete latent space consists of only one ability. To state the definition of local independence more formally. let 9 he the complete set of abilities assumed to influence the performance of an examinee on the test. Let Vj be the response of a randomly chosen examinee to item i (i = 1,2 •...• n). Let P(V j , 9) denote the prohability of the response of a randomly chosen examinee with ability 6; P(V; = I 10)1 denotes the probability of a correct response. and P(V; = 0 I e denotes Ihe probability of an incorrect response. The property of local independence can be stated mathematically in the following way: Prob(V,. V 2 ••
•••
V" 19)
= P(V, 19) P(V2 16) ... P(V" 19) "
=np(UjIO) ;~
,
The properly of local independence means that for a given examinee (or all examinees at a given ability value) the probability of a response
II
ConceT'U, Mm/e/s, atld Ftalurl'S
pallern on a set of items is equal to the product of prohabilities associflted wilh the examinee's reSpOI\Sl~S to the individual items, For example. if the response pattern for an examinee on three items is (I t 1.0). that is. VI = I. V 2 I, and U. = 0, then the assumption of local of independence implies that
=
I,U) = 018) = P(U I == Ile)p(lh
P(U I = I,U 2
j
where
~
~
Pi
~
,... ,..~
i
1.1 ,
.:J ~ ~
'-oJ .-.l
t ~
g
119)P(th = Ole)
~
= P( V = I I 9) j
and
Qj = I - Pi
The notion of local independence described above may seem counterintuitive. An examinee's responses to several test items cannot be expected to be uncorrelatcd; that is, the responses nrc unlikely to be independent. In what sense, then, can local independence hold? When variables are correlated, they have some traits in common. When these traits are "partialled out" or "held constant," the variables become uncorrelated. This is the basic principle underlying factor analY5is. Similarly, in item response theory, the relationships among an examinee's responses to several test items are due to the traits (abilities) influencing performance on the items. After "partialling out" the abilities (i.e., conditioning on ability). the examinee's responses to the items lire 1ik.e1y to he independent. For Ihis rellson. the aSsllmlltion of local independence can also be referred to as the assumption of conditional independence . When the assumption of utlidimensionality is true, local independence is obtained: in this sense, the two concepls are equivalent (Lord. 1980; Lord & Novick. 1968), Local independence can be obtained, however, ('ven when the data set is nnt unidimensional Local independence will be obtained whenever the complete lalent space has been specified; that is, when all the ability dimensions innuencing performAnce have been taken into account. Conversely. local independence does not hold when the complete latent space has not been specified. For example, on a mathematics test item that requires a high level of reading sk.ill, examinees with poor reading skills will not answer the item correctly regardless of their
L-
12
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
mathematical proficicncy. lIence. a dimension other than mathl'tnatical proficicnc.:y will influence performance on the item; if a unidimensional IRT lIIodel is filled to the datil, 10(,:111 indepcn
¥
Popular Models in Item Response Theory An item characteristic runction or item characteristic curve (lCC) is a mathematical expression that relates the probahility of success (i.e., giving a correct response) on an ilem to the ahility measured hy the tcst and Ihe characteristics of the itelll. While it is possihle 10 conceive of an infinite nllmber or IRT models, only a few models are in current liSe'. A primary distinction among the most popular IInidil1le'nsional item response models is in the number of parameters used to describe items. The choice of model is up to the user, hut this choice involves assumptions about the data that can be verified later by examining how well the model "explains" the ohserved test results. The three most popular unidimensional IRT models are the one-, two-, and three-parameter logistic models, so named hecause of the number of item parameters each incorporates. These models are appropriate for dichotomous item response data.
One-Parameter
Log;,~tic
I
Model
The one-parameter logistic model is one of the most widely used IRT models. Item characteristic curves ror the one-parameter logistic model are given by the equation
i
P i (9)
:;".p!:./"c-.J/~,.)
= 1,2, ...
,.,...
7{
C{')..,u'·Fr-?-' £~. ~.,,~ .
,n
12.1 )
-::ft'
II
~-~--..-.::.- ,"'"
•
e:
2·1\"5
13
('''''''('I''S. Model.f. and Fl'olur,·,f
where
r,(O) is Ihe prohability Ihlll n rllllliomly chosen naminee with IIhilily (} answers item i correctly, hi
is Ihe item i difficulty parameter,
/I
is the number of items ill Ihe lest.
e
i,.; a Irnnscendental number (like It) whose value is 2.71 R (correct to . three decimals), and _ _ "h _ _ _ _ _ ~ _ _ _ _
r,«(}) is nn S-shllped curve wilh
._~ _ _ _ _
~,.
____
values between 0 and lover Ihe ability
scale.
The hi parameter for an item is the point on the ability scale where .the probability of a correct respOll.~e is 0.5. Thi ... parameter is a location' parameter, indicating the position of the ICC in relation to the ability scale. Jhe greater the value of the hi parameter, the greater the ability that is required for an examinee to have a 50% chance of gelling the item right; hence,the harder the item. Difficult items arc located to the right or the higher end of the ability scale; easy items are located to the .Ieft or the lower end of the ability scale. When the ability values of a group are transformed so that their mean is 0 and their standard deviation is I, the values of b i vary (typically) from about -2.0 to +2.0. Value.~ of hi near·2.0 correspond to items that are very easy, and values of bi near 2.0 correspond to items that are very difficult for the group of examinees. Some sample ICCs for the one-parameter model are shown in Figure 2.2. The item parameters are as follows: for Item I, hI = 1.0; for Item 2. h2 = 2.0; for Item 3, b J =-1.0; and for Item 4, h4 0.0. Note that the curves differ only by their location on the nhility scale. In the oneparameter model. it is assumed tha!.item.~lifficulty is the only iteT characteristic that influences examinee performance. No item parameter corresponds to the classical test theory item discrimination index; in effect, this is equivalent to the assumption that all items are equally discriminating. Note also Ihal the lower asymptote of the ICC is zero: this specifies that examinees of very low ahility IHlVe 7.ero probability of correctly answering the item. Thus, no allowance is made for the possibility that low-ability examinees may guess, as they are likely to do on multiple-choice items. Clearly, the one-parameter mode I is based on restrictive assumplions. The appropriateness of these assumplions depends on the nature of the data and the importance of the intended application. For example. the assumptions may be quite acceptable for relatively easy tests
14
FUNDAMENTALS OF ITEM RESPONSE TI1E!ORY
1
,
~----~-----------.~
------.
p
o
·
0.9 -
b b !
! !
0.8
I
0.7 -
r o
0.6
I
C o
• ••
0.5 OAl
0.3 R
•• •• p
0
•
0.2 .
~~~...
0.1 -0
t..-=::::;:,,-==::::;:,,-==::::,:::,=~_....::::::::::L~ ."." .• __.. ___. j
-4
-3
-2
-1
o
2
.• _.
3
4
Ability
Figure 2.2. One-Parameter lIem Characteristic Curves for Pour Typkalltems
constructed from a homogeneous bank of test items. Such situ,\lions may arise with some criterion-referenced tests following effective instruction. The one-parameter logistic model isofte!lcill!~_ti the Rasch model, in ,honor of its developer. While the form of Rasch's model is different from that presented here, the one-parameter logistic model is mathematically equivalent to Rasch's model. Por details of the development of the Rasch model, refer to Rasch (1960) and Wright and Stone (1979). Two-Parameter LOKistit' Model
Lord (1952) was the first to develop 11 two-parameter item response model, hased on the cumulative normal distribution (normal ogive). Birnbaum (19M\) slIhstituted the two-parameter logistic function f(H the two-parameter normal ogive functioIlllS the form of the item dlllflH.:ter·· istic function. Logistic functions have the important advantage of heing more convenient to work with than normal ogive functions. The logistic model is more mathematically tractable than the normal ogivc model hecause the latter involves integration, whereas the former is lIn explicit function of item and ability parameters and also has important statistical properties.
) 15
('oncepts, Models, Gild fetltltl'e,V
Itelll characteristk curves for thc two-panllllctcr logistic model dcveloped hy Rirnblllllll are given hy th(~ eqlwtion
ena, (6 - b,' P,(O) =
N I
e ~,
II
:Jc
~ .,... <~
~'
1 ~
+
eOn,(O
/0,)
1,2, . , _,1/
12,21
where the parameters Pi (0) and hi are ddined just as ill E(luation 2, I. As is easily seen, the two-parameter logistic__m()del_~semb,-~~ .,the one-parameter model except for the presence of two additional elements, The factor lJ is a scaling factor introdllced 10 make the logistic function as close as possible to the normal ogive function. It has been shown that when D =: 1.7, values of P j (0) for the two-parameter normal ogive and the two-raram~terJpgistic models'dTffcfill7ibsollite"vafue'by less than 0.0 I for all values of O. Thc second additional element of the two-parametcr model is the parameter ai' which is called the item discrimination parameter. The aj 'parameter is proportional to the slope of the I~C al the point hi on the ahility scale. Items with steeper slopes are more useful for separating exam'inees into different ahility levels than are items with less steep slopcs. In fact, the usefulness of an itcm for discriminating among cxaminees near an ability level e (separating examinees with abilities ~ e from examinces with abilities> 0) is proportional to the slope of the ICC at 0, The item discrimination parameter is dcfined, theoretically, 011 the scale (_00, +00). Negatively discriminating items are discarded from ahility tests, however, because something is wrong with all item (such as miskeying) if the probability of answering it correctly decreases as examinee ability increases. Also, it is unusual to ohtain Of values larger than 2. ,Hence, the usual range for itcm discrimination parameters is .\.!h1.Llligh values of al result in itcm characteristic functions that are vcry "sleep," and low values of tli lead to item characteristic functions that increase gradually as a fUllction of ahilily. Readcrs interested in experimenting hy changing values of item parameters to determine their dTcl'ls on ICCs arc referred to some computer software for the IBM PC and the APPLE computers by Raker (19K5), and to 1111 introdut'tory article on logistic models hy Harris (I9!N). The two-parameter model is ohviously a generalization of the oneparameter model that allows for differently discriminating items. Some sample ICC's for the two-paramcter model are shown in Figure 2,), For
FUNDAMENTALS OF ITEM RESPONSE TIIEOI{Y
16
.., 0
1
0.9
b
•
" I I
0.8 -
•
0.7 .
0
0.6
c
0.15 .
,
0.4
t II t
- _.-
~
--
0
r
..• I
0.3
A
•• "
0.2
0
ft
••
o \",.lV
?'\"'"~ ~ W
Abllltyo::::
1
e
234
Figure 2.3. Two-Parameler lIem Characteristic Curves for Four Typical Hems
Item), b, = 1.0 and al = 1.0; for Item 2, b z == 1.0 and a2 = 0.5; for Item 3, b] = -).0 and a3 = ).5; for Item 4, b4 0.0 and a4 = 1.2. The ICes are not parallel, as they were for the one-parameter model. Each ICC has a different slope, renecting the fal:t that the discrimination parameter values are different. Note again that the lower asymptote of each curve is zero; hence, the two-parameter model. like the one-parameter model. makes no allowance for guessing behavior. The assumption of no guessing is most plausible with free-response items. but it often can be met approximately with multiple-choice itcms whcn a test is not 100 difficult for the examinees. For example. this assumption Illay be met when competency tests arc administered to students following effective instruction. An alternative and somewhat more convenient way to write Pi (0) for the two-parameter logistic model (and the three-parameter model. too) is this: If the numerator and denominator of Equation 2.2 are divided by eDa,(9 - h, >, then Pi (9) becomes
"J 17
which
nlll
be wrillen more c.:mnpaclly as Pi(O) = I I .
e/la,lf)
/")1 1
The mathemutkal expression for the three-parameter logistic is
nlO{k~1
I
= 1,2, ... , It
[2.3]
where Pica), hi. ai, and D are defined as for the two-parameter model. The additional parameter in the model, Ci, is <:alledthe p.w.'udo-(·hancelevel parameter. This parameter provides a (possibly) nonzero lower asymptote for-the item characteristic curve and represents the probability of examinees with low ability answering the item correctly. The parameter Ci is incorporated into the model to take into account perfprmance at the low end of the ability continuum, where guessing is a factor in test performance on selected-response (e.g., multiple choice) test items. Typically. ('; assumes values that are smaller than the value that would result if examinees guessed randomly on the item. As Lord (1974) has noted, this phenomenon probably can be attributed 10 the ingenuity of item workers~11I developing attract"lvebut i:ect choices. For this reason, ~.L should no~_ be called tlie~ue::~srng parameter. " ," Six typical three-parameter logistic ICCs arc displayed in figure 2.4. The corresponding item parameters are displayed in Table 2.1. The l~(Jmparison of Items 1 to 3 with Items 4 to 6 (but especially the comparison of Items I and 4) highlights the role of the item difficulty parameter in the location of ICCs. More difficult items (Items I, 2, 3) arc shifted to the higher end of the ability scale, while easier items are shifted to the lower end of the ability scale. The comparison of lIems I (\11(1 2 (or Items I, 3, and 4 with Items 2, 5, and 6) highlights.!!!.!L influellce of the item discriminali
f
FUNDAMENTALS 01; ITEM RESPONSE TllEOny
18
I. ~ ..~J
J
p
'-'
r
•b
•
b j
0.8
I I
, , I
0
,
0.6
,
c
•r r
0•
I
R
••
p 0
•• • 2
3
4
Figure 2.4. Three-Parameter lIem Characteristic Curves for Six Typical hems
The Property of Invariance The property of invarianceof item and ability parameters is the cornerstone of JRT and its major distinction from classical test theory, This property implies that the parameters that charll~~rize an ite,!,"~~_ not depend on the ability distribution of the exami,!~_es. ~l1d_th~_pl!!.~m: .eter that characterizes an examinee does not depend on the set of test items.
TABLE 2.1 Item Parameters (or Six Typical Test flems It!!m rest Item
b,
I 2
1.00 1.00 1.00
3 4
-I.~O
5
-0,50
()
O,~O
Paraltl('trr aj
CI
1.80 (J,80
0.00 0,00
1.80
O.2~
I.RO 1.20 0.40
n.lo
0.00 015 -----~--,--.
l
..i
v:
)
19
j !
As noled carlier. Ihe properly of invariallce of ilem paramelers can hc ohserved in rigure 2.1. When thc IRT modd fils Ihe dalli. the sume ICC is ohtained for Ihe tesl ilem regardless of Ihe distrihlltion of abililY in the group of exumillces used to estimate Ih<: item parameters. Ilence, ;) Ihe ICC is invariant across the two populations. () To some researchers, the property of item invariallt:e may seem § surprising. The property, however, is a well-known feature of the linear regression model. In the linear reglcssioll model. the rcgression line for predicting a variable Y from a variahle X is obtained as Ihe line joining Ihe means or the Y vllriuble for each value of lhe X variahle. When the rcgression model holds. the same regression line will he obtained within lilly restricted range of Ihe X varhlhle, that is, in any subpopulation on t.I) X. meaning that the slope and inlercept of the line will be Ihe SlIme in :S lllly sUbpopulation 011 X. A derived index sllch as the correlation coef.;J ficient, which is not a paramcter Ihnt characterizes the Icgressioilline, ~ '" :> is flot invariant across suopopulalions. The difference between Ihe slope I I parameter and the correlalion coefficient istl~at the slope parameler docs not depend on Ihe characleristics of Ihe subpopuhllion, such as its ~ variability, whereas Ihe correlation cocffident docs (noll', however, Ihat Ihe proper ('.~till/(ft;(lll of the line docs require it heterogeneous sample). '{" The sahle con(~cpts also apply ill item response models, which can be r$ regarded as nonlinear regression models. 1To illustrate the property of invariance of item parameters and to ~ understand the conditions under which invariance holds, consider the lfollowing example, in which the responses of 90 ex llminees to 1I 40-item ~ I, lesl were generated to fit a two-parameter logisti\: item response model (see Equation 2.2). 1\ summary of Ihe responses of the examinees (10 at each of 9 ability levels)-thcir responses to a parlicular ilem on the lest and their total scores on the tesl--is given in Table 2.2. The corresponding plot of probaoility of sliccess on the selected item against ability, e, is given in Figure 2.5. The classical item difficulty. or p-value. for the item of interesl is 0.5, and the classical ilem discrimination, or p(lint-bi~erial£C:l!.rela~~'!....b~ tween the item score and tolal score. is 0.65. As It demonstration of the
~
~
· l I
i
lack of invarTnnc'e (;r"-~la-ssfcaT--iienl indices, let us consider the exami ..
nees as forming two ability groups: examinees at e of -1.72, --1.1 J, and -0.72 forming the low-ability group and the examinees at e of 0.52. 0.92, and 1.52 forming the high-ability group. For Ihe low-ability examinces the ,,-value (based 011 .30 examinees) is 0.2 ami Ihe pointhiseri,11 correl41tion is 0.56. For the high-ahility examinees the ,,-value
2()
FUNDAMENTALS Or: ITEM RESPONSE THEORY
/" *
I.HI
-0.11
-0.40 -0,'0- 0.10
0.8t
0.8'
1,00'
Lit
Ability
Figure 2.S. Relatiollship Between Ability and Probahility of Success
IIIl
an Item
is 0.8 and the point-biserial correlation is 0.47. These calculations demonstrate that the classical difficulty and discrimination indices change when the examinee ability distrihution dtanges (ohviously. restriction of range results in lower point-hiserial correlations for each subgroup than for the total group). Let us now fit separately a two-parameter item response model for the entire group and for the high- and low-ability groups. If invariance holds, the parameters obtained should be identical. Since in the twoparameter model the probability of success for an examinee with ability 9 is given by
P
and
P I-P
I
21
e'I/Wrl".!. Motirls. atl,l Feature.s
TAnLE 2.2 Ahility Level 0, Pruhahilily of Success on all Item, Response 10 the Itelll, and lllial St'OIC for <)() Ex aminces F,uminff
0
---
..
1.71 fl
!
..
2
1'(0) -~
J
4
5
{j
7
If
9
10
.....
0
(l
(I
0
0
0
(I
()
12
I)
12
R
8
8
II
13
4
0
(I
()
0
(I
()
9
8
10
II
IJ
12
7
0 7
()
14
0 13
15
15
0 J3
0 II
15
0 13
0 12
III
12
I 17
10
I 16
15
0 12
19
0
()
0
17
21
25
25
21
19
18
19
20
0 15
(I
I
0
I
TOlul Score:
21
19
21'>
22
25
22
24
24
I 28
0 19
hem Response: Tolal Score:
(I
(I
0
I
1
27
26
25
24
24
JII
28
24
29
29
hem Response: Tolal Score:
:n
211
29
.lO
2()
2X
lJ
.12
.12
33
0.9 hem Response: Tolal Score:
0 34
J5
:'14
:IX
.17
.l7
:'16
.l~
J7
:W
hem Response: T\.lal Score:
0 8
0.2 Irem Response: TollIl Score:
10
14
0.3 Ilem Response: Tolal S('ore:
0 II
15
hem ResponRe: Tolal Score:
13
0.5 hem Response: Total Score:
0.1
"
c·U29
" -fU2J
-0.398
0.100
0.191\
IUD
0,4
0
0.7
)
O.R
()
0
0.6 hem Response:
}
11.919
(I
(I
I
n
(I
()
()
I
()
~~'.
1511'>
I
it follows that
p
In --_.
I-P
=
= [)a(O
b)
where a [)(J lind ~ = -Dah. The ahove relaliortship is a linear function of 0 with two unknowns, a and P (the slope and inlcrcc:-pt of the line. respectively), and, hence, their values call be dClerlllincd exactly if
22
FUNDAMENTALS OF ITEM
Rl~SPONSE
TlIEORY
P and () are known at two points. (In reality. determination of item parameters cannot be carried out in this way since 0 will never he known; this procedure is used here for pedagogical purposes only.) To determine the item parameters based on the rnf;rr range o/ahilify. we can choose (arbitrarily) () = -1.716 and (} = 1.516 with corresponding P values of 0.1 and 0.9. Thu~, the two c(luations to he solved arc
~:~
In
+ [}
= cx(-1.716)
and
Inp>:i = a(1.516) + [}
Subtracting the first equation from the second, we have
In~:~
-
'nK~ = cx(l.5l6)
u(-1.716)
Solving for u, we oblain a.:= 1.360.
Substituting this value in the second equation gives
Il = n. no. The vnlllcs of a and h now clln he determincd: (/ := n.R and h = --0.1. In the low-ahility suhgroup, a. lind Il clln he (ktcrlllincu using the two points (} = -1.716 and (} -0.723 with the corrcsponding r values of (l.1 and 0.3. The equations to be solved arc
=
In
~:!
::
u(-1.716)
+ [}
and
Solving these equations in the same manner as previously. we obtain a::: 1.359 and 13 = 0.136, which in turn yield a = O.R and b -0.1. In the high-ability group. we determine a and /l using the p()illt~ 0::::: 0.523 and 0 = 1.516 with corresponding P values or 0.7 and 0.9. The equations to he solved in this case are In
~:~ = cx(0.523)
+ [}
and
In ()..:~ 0.1
cx( 1.516)
+~
2] Sol ving these equations, we ohtain (1 I. :\59 ami P= 0.1.16. which yield the same (l and b values as hefore. What we have demonstrated is the simple fact thaI (( and 13 are the slope and intercept of the line that relates In P/( I P), the log odds ratio, to O. In any langc of 0, the line is the salllc and hence (land p, and therefore (I and h, must be the same. This example shows that, in contrast with the classical item difficulty and discrimination indices, the parameters of the item response model are invariant across ability slIbpopulations. We must, however, note several points in relation to the property of invariance. Referring back to Figure 2.5, we see lhat an exact relationship exists between the probabilities of success and the 9 values. Furthermore, from Table 2.2 we see that at each 9 level the observed probability of success (observed proportion correct on the item) is exactly equal to P; that is, the model fits the data exactly ill the population. If the model does not fit the data exactly in the population, In P/( I - P) will not be an exact linear function of 9, and, hence, different a and p will be obtained when different sets of points are chosen. In other words, invariance only holds when the fit of the model to the data is exact in the population. This situation is identical to that in linear regression, where the regression coefficients are invariant only when the linear model fits the data exactly in the population. Iv. second point to be noted is thatinvariance is a properly of the I?opulation. Oy definition. the item characteristic curve is the regression of item response on ahility,
P .,. '£(Ul 9) where '£ is the expected value. Hence. P (for a given_9) is the average of all item responses in the subpopulation of examinees with the specified ability value 9. In the low-ability and high-ability suhpopulations described in the example. the observed probability of success at each 9 was exactly equal to 'E(U 19). Therefore, the exact linear relationship between In P/( I P) and 9 held; in other words, the invariance property was observed. On the other hand. if a sample is ohtained from the sUhpopulation of examinees with the specified ahility value 9, it is extremely unlikely that the average of the item responses, or the observed probability of a correct response, will be exactly equal to .£( U 19). Even if. by some chance. rhe observed probahility was equal 10 '£(lJ 19) at one value of e, it would almost never occur at all v allies of 9. Hence, in ~amplcs, em exact linear relationship hetween In P/( t - P) and e will not
24
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
he observed. Therefore, we callnot expect to ohserve illvariance, in the strict sense, in "salllpics even whcn the model fits the data CXlldly in the popUlation from which the samp"lehasbeen drawn. 'his "pr()hlein-{i; furthercxacerblltcdhy Ihe errors intr(),luced when ihc it{:111 iln(i cx",lIIiinee-paramefers are estimated. Ney~.!illereSs._irlsimportant to determine whether illvariance h(.ll
p In I
P
Da(e - /J)
) 25
IIJIU·('(JH, Model.f. and Fe,aura
3
S
a
m
2
II I
e 2
,,
D I
o
..
I
c
u I t
-2
r •. 98
Y -3 "____ -3 -2
~_"L"
_ _ """""i"
-1
__ "_"" _ _ L _______ L __
o
-"--
""-,
•..
2
.1 3
Sample 1 Difficulty
Figure 2.6. Plot of 3P Item Difficulty Values Based on Two Groups of Examinees
if we consider a and b to be variables, then the log odds ratio is a linear function of a and b with 9 being the slope with respect to the variable 0. As (J changes (as we consider items with different discrimination parameters), 0 remains the same, showing that. no mailer which items are used, Ihe ability 0 remains invariant. This is the same argument as was lIsed to explain the invariance of item parameters. The demonstration of invuriam:c of itelll and ahility parameters is ohviously not restril:led to the two-parameter model. Since the oncparameter model is a special case of the two-parameter model. at least rnathcmatic
/
j
26
FUNDAMENTALS OF ITEM RESPONSE THEORY
Other Promising Models In addition to the one-, two-, and three-parameter logistic models, many other IRT models have been developed, including several models that can be applied to nondichotomous test data (see, ror example. Andrich, 1978a, 197Rb, 1978c. 1982; Ma!;ters, 19R2: Masters. & Wright, 1984; McDonald, 1989; Spray, 1990). For example, Bock (1972) developed a two-parameter logistic model that can be applied to all of the answer choices in a multiple-choice test item. The purpose of his nominal response model was to maximize the precision of ability estimates by using all the information contained in the examinees' responses, not just whether the item was answered correctly. Bock (1972) assumed that the probability that an examinee would select a particular item option k (from m available options) to item i could he represented as
P ik (9)
e":' (9 - h;.)
=--m
L
e":' (O·
1,2, ... ,n; k
=:
I, 2, ~ .. , m 12.4 J
II:.)
h=1
At each Il, the sum of probabilities across the m options, L:~ I Pift.. is one. The quantities (hi/;. aik) are item parameters related to the kth option. The model assumes no a priori ordering of the options. The graded response model or Samejima (1969) assumes, in addition to the usual assumptions. that the availahle catagories to which an examinee responds can be ordered. Examples would include a 5-point Likert rating scale or, say, a 4-point rating scale for grading essay.s, or other scales representing levels of accomplishment or partial credit. This model, like the Bock model, attempts to obtain more information from examinees' responses than simply whether they give correct or incorrect answers. With the current interest in polytomous scoring models. Samejima's extension of the two-parameter logistic model to polyloll10llS ordered categori('s is likely to receive inereasing lItient ion. (Also, see Masters & Wright. 19H4, for various extensions of Ihe one-parameter model to handle polylomolls response data.) Suppose the scoring categories for an item arc arranged in order fwm low 10 high and denoted Xi = 0, I, ... , mi where (m; + I) is the !lumber of scoring categories for the ith item. The prob'lbility of an examinee
, Conn~/'H, Modt'h.
(//1(/
27
Ft'llwrt!.t
\'
I
i
I
responding to an item in a particular category or "j~lrn can be given hy a minor extension of the two-parameter logistic model: COa,11)
+
n.,!
enfl,(1l
12.51
h.,)
where b" is the "difficulty level" for category mi' Other parameters in the model were defined earlier. With (mi + I) categories, mi difficulty values need to be estimated for each item, plus one item discrimination parameter. The actual probability of un examinee receiving a score of x, is given by the expression [2.61 With, say, 50 items in a test, and a 5-point proficiency scale for each item, a total of (50 x 4) + 50 = 250 item parameter values would need to he estimated. The field of psychomotor assessment, 100, has been influenced by item response models, and this influence has spawned new applications of relatively unknown IRT models (sec Safrit, COShl, & Cohen. 1989; Spray, 1990). Instead of ability variables such as numerical ability and reading comprehension, variables sut:h as physical fitness, basketball shooting ability. and abdominal strength are of interest in psychomotor assessment. In the simple binomial trials model, for example. P(X
=
= xlD)
= ( ; ) P(O)' Q(O)"
r
[2.7]
where P(X x I 0) represents the probability that an examinee completes x of n trials (e.g., shoots 8 out of 10 haskets). This probability nmld he represented by any of the logistic test models; however, the item parameters in the logistic model that would dcserihe the trials. that is. items, would be equal for clleh trial. and. helice, item-parameter ('slimntion would he t:omliderllbly simplified. Trials would Ilced to be independent and scored as PIISS or fail for this model to he IIpplicllble. If. for example. the binomial trials model is applied to basketball shooting data (e.g .• number of successful shots), !J would he hasketl);11I shooting "bility. A .. with all IRT applications, parallR'tcr invariance
2R
FUNDAMENTALS OF ITEM RESPONSE THEORY
would be critical. Task (item) difficulty should be invariant across different groups of cxaminees. and abilitics should bc invariant across lasks that vary in difficulty. Another IRT model that has been applied successfully is the Poisson counts model: , e-r(O- h) P(X:::::xI9,b)= (9-h) 12.8) x!ce
.
where x is the number of (say) sit-ups or push-ups completed in a minute and b represents the difficulty of the task. These and other IRT models aimed at handling polytomous response data can be expected to receive increasing use in the future, as fewer assessments are based on dichotomously scored data.
Exercises for Chapter 2 I. Item parameter values for six items are given in 'lilble 2.3.
TABLE 2.3 -----IUm
b
a
('
I
1.0 1,0 1.0 -0.5 0.5 0.0
1.11 0.7 1.11 1.2 L2
0.00 0.00
o.n!)
n.s
0.111
2
:\
4 5 6
O.2~
tUO
a. For each item. compute P(O) at I) = -3. -2. -1. O. I. 2. and ). Plot the item characteristic curves. b. Which item is the easiest? c. Which item is the least discriminating'! d. Which item does an examinee with an ability of I) = 0 have the highest probability of answering correctly? What is the examinee's probability of getting this item wrong? 2. Use the ICCs in Figure 2.4 to answer the following questions: a. Which item is the easiest at e = -I.O? b. Which item is the hardest at e = O.O?
29
(O/III'pl.1. M()(Jd.~. IIl1d /-'{'a/lIr(!.<
un
Which two Itell\s are equally dillklllllli 0 = -
1:.
d. Whidl item is most discriminating al 0
~
2.0'!
\. lise Ihe fOllr two-paramekr ICes in Figure 2 ..1 tn answcr fhe rollowing 111I('stiol1l>: Whal is Ihe value of p\(O;:
iI.
L(1)'1
h. Which item is the least discriminating?
now do the ICes in Figure 2 ..1 differ from those in Figure 2.4'!
1:.
4. For the three-parameter lIlodel, show Ihal Ihe probability or a correcl response P(O) al 0 = b is
+ (.
P(9)
2 '{
5. The probability of 8 correel response al certain values of 0 ror Ihree items is given in Table 2.4.
TARLE 1.4 9 . .j.() -2.5
2IJ -1.5-1.0
0.5
()
O..~
10
1.5
10
L~
3.0
hem I 0.01 (}.Ol lUll 0.04 0,07 (I.n 0.22 O.J:'i U.~O 0.6:'i 0.7R 0.R7 0.9J 2 0.00 0.00 O.ot 0.04 (l.11 0.26 O.:'iO n.74 (I.R!) 0.96 0.99 n.99 (1.99 :~ 0.20 0.20 n.20 0.20 0.20 0.21 0.23 0.26 (1.]2 0.44 O.(i(l 0.76 O.RR
Piol Ihe ICCs for the three items.
i
I
a. Fqr Items I and 2, c = O. Determine frolll the plollhe h values for these two ilems.
h. For Itcl1l 3, ('
=
0.2. Determine from thc plot the" value for Ihis item.
c. now would you determine the a vallie of an item from a plot of the ICC?
Use this procedure to determine the a value for each of the three items. 6. Responses of 40 examinees at a KiI'en ability IndIO two in Table 2.5.
item~
are given
! I
.!• ~
I
TAnu: 1.5 Item
E.mminee Rnpo/l.ve.s
I 2
OII{)()OOlll' 10(lOOIIIIIIIIIII IIOOOt)IIOOIIII
ot)ooo I! 000000 111100 I 000 IIIOOOO()() II 00 I 10 I 0 I -------~
~-
- - - - -..- - - -
......
-----
I I I
FUNDAMENTALS OF ITEM IWSI'ONSE TIIIH>RY
30
Constnlcl a 2 x 21able of correct lind incorrect responses (III the Iwo items, Using n chi'slllIllrc lesl fur irHkpcndCllcc. deknninc il' hll'lIl imkpclhklll'(' holds for Ihese Iwo ilems III this ahility level.
Answers to ft:xercises ror Chapter 2 • <
I. a. See Table 2.6.
TARLE 2.6
e lIem I 2 3 4 5 6
-3
··2
-/
0
0.000 0.0011 0.250 0.205 0.000 0.165
0.000 0.027 0.250 0.236 0.006 0.239
0.002 0.085 0.252 0.412 0.045 0.369
O,IWi
0.2H 0.2114 0.1811 0.265 0.550
2
0.500 0.500 0.625 0.964 0.135 0.131
0.955 0.167 0.966 0.995 0.955 0.861
3
0,9Q1I 0.91~
O.9Q1I 0.999 0.994 0.935
b. Item 4. c. Item 6. d. Item 4. P(failure) = I ~ P(I}) = I O.7R8 =0.212. 2. a. Item 4. b. Item I. c. Items 5 and 6. d. Item 2. 3. a. Approximately 0.50. b. Item 2. c. In Figure 2.3. the lower asymptotes of the ICCs are all zero; in Figure 2.4. the lower asymptotes of the ICCs are not all zero. 4. P(I} b) c + (I - c) 1 [I + e-Da(b-b)1 =c + (I - c) 1 (I + eO) =c +(1 c)/(1 + I) = c + (I c) 12 = (2c + I c) 1 2 = (I + c) 1 2 5. a. Item I: b = 1.0. Item 2: b =0.0 b. (I + c)/2 = (I + 0.2)/2 = 0.6 b = I} value at which p(e) = 0.6; b 2.0
= =
=
=
c. a slope of ICC al h. Draw Ihe langent 10 Ihe curve al I} h and determine its slope by laking any Iwo points on Ihe tangent and dividing Ihe y increment by Ihe x increment. 6. See Table 2.7.
=
)I
'l'AULE 2.7 II .. ", 2
Illtorrec! Incorrect
('orrltcl 20 (If)
2)1
4 W)
12
IIt'1Il I
Hi
'1.. 2 .", N (AI)
/lC)2 ; (It +- IJ}(/ll /))(/) I (')«('1 It)
'" 40(H x 4, 20 x R}2; (R + 20)(2() +- 4)(4 + R)(R + R)
:;:: 5J}R > X~,M
"I:
Sint'c the compnled exceeds the laoulated value, we clIn reject the hypothesis of independence, Local independence does not hold at this ability level. We would, therefore, conclude that a unidimensional model does not fit the data.
Note I. Fur convenience, P( U; 10) will be wrillen as P,(O); Ihis nolalion will be spedfying item characteristic functions.
u~d
in
3 Ability and Item Parameter Estiniation
The first and most important step in applying item response theory to test data is that of estimating the parameters that characterize the chosen item response model. In fact, the successful application of item response theory hinges on the availability of satisfactory procedures for estimating the parameters of the model. In item response models, the probability of a correct response depends on the examinee's ability, e, and Ihe parameters that characlerize the item. Both abilily and item parameters are unknown; what is known are the responses of the l~xllminees to Ihe lest items. The prohlem of estimation is to determine the value of e for eadl examinee and the ilem parameters from Ihe item responses. This problem is similar to thai encountered in regression analysis where, from observed responses 10 a variable, the parameters that characterize the regression model·~the regression coefficients-must be estimated. Two major differences distinguish regression models and item response models. First, the regression model is usually linear, while item response models are nonlinear. Second, and most important, the regressor (independent) variable in regression analysis is observable; that is, scores on this variable can be observed. In item response models the "regressor variable" e is unobservable. If e were observable or known, the problem of estimation of item parameters, or the "regression coefficients," would simplify considerably, although we would still be dealing with a nonlinear regression model. Similarly, if the item parllll1eters are known, the estimation of nhility is reasonahly straightforward. Estimation of parameters can be accomplished in several ways. In the unlikely event that the model fits the data exactly, and when 0 is known, the procedure demonstrated in the section on parameter invariance could be used. In this case, only as many points as there are item parameters in the model are needed to solve for the unknown 32
).
33
IIhilitv ali(I 111'111 P"/(III/ell'/" E,~li/ll(/li()"
parameters. When a sample is obtained, the above procedure cannot be used becaLise the model will 1I0t fil the data exactly. III this case, our strategy is to find the parameter v:llues that will produce the "best filling" curve. In linear regression, best fit is defined often in terms of the least squares criterion. In IRT models the least squares criterion is not used because it is difficult to determine the properties of least squares estimates in nonlinear models. Alternatively, the parameters could be estimated using a maximum likelihood criterion. The sampling distributions of maximum likelihood estimates arc known in large samples. and this information can be used in a variety of ways in IRT applications. We shall first describe the maximum likelihood procedure for estimating ability when the item parameters are known, and then describe the procedures for estimating item parameters.
Rstimation of Ahility Suppose that II randomly chosen examinee responds to a set of II items with response pattern (V t , V 2 , .•• , U,' ... , (/,,) where: IIi is either I (a correct response) or 0 (an incorrect response) Oil item I By the assumptioll of local independence, the joinl probability of ohserving the response pattern is the product of the prohahilities of observing each item response, thai is. P(U I .U20 •• ·,lIi ,····UIl IO):::: p(Vlle) P(lh I 0) ... P(lJil 0) ... P(U" 19),
which may be expressed more compactly as
,
1/
P(V"V 2 ,
...
,U"IO) =
n
p(U,le)
(0 I
Since Vj is either I or O. Ihis can he taken into account hy writing Ihe likelihood fUllction as
n n
P(V I ,
lh,···, Un I I)
i
or simply as
I
P( If, I 0)11, II -- PWj 10)
II
til
,,, \ I,
Ii
:~ I I
)4
FUNDAMENTALS OF ITEM RESPONSE THEORY
P(U.,V 2,···,v"rO) =
n"
pll, QI J
{Ii
J
13.11
)= 1
=
where Pj P(V} r 9) ami QJ = I - P(Vj r 9). Equation 3.1 is an expression of the joinl prohahilily of a response Ii): the probapallern. When the response pallern is observed, Vj bilistic interpretation is no longer appropriate; the expression for the joint probability is now called the like/iIIOOt! function and is denoted as L(II" /12,' .. , Ui"'" 1/" I 0) where Ii) is the observed response to item}. Thus,
L(II"
1I 2,···,II,,rO)
=
n"
Pl//Q' J
J
"i
13.21
)=1
Since Pj and Qj are functions of 0 and Ihe item parameters, the likelihood function is also a function of these parameters. As an example, consider the responses of five examinees to five items with known item parameter values, given in Table 3.1. The likelihood function for any examinee may he written usillg Ihe general expression above. For Examinee 3, for example, II, 0,112 0, U, = 0, /14 I , 1I~ I. Hence, the likelihood function for Ihis examinee is
Since P (and hence Q) arc item response funclions whose forms depend on the item parameters, and the item parameters are known in this example, the exact values of the likelihood function for a given 0 elm be computed. In particular, a graph of the likelihood fUllctioll as e varies can be plolted. Since Ihe likelihood function is a product of quantities, each bounded between 0 and I, its value will be very small. A beller scaling of the likelihood funclion can be ohtained by transforming il using logarithms. Furthermore, because (If the following properties of logarithms, Inx)'
=:
Inx + Iny
,.
35
I\hilill' flll,111i'1I/ I'"rall/('Ier 1i.~li/llal/lm
TAIlLE 3.1 Ilclll Parameters lind Rt'SpoJlsc Pal!erns for Five Examinees on Five Te.~1 heIlIs Ilem l'aralll('lt'r.f Ifrlll
(Ii
--------. I 2
3 4 :)
",
I':wlllilll'l' Ilrm Rr.ll'on.fI'I
,
.I
4
5
1
()
()
0
0 0
0 0 0 0
C'·
2
1.27 1..14 1.14
1.19 0.59 0.15
0.10 0.15 0.15
1.00 0.61
-0.59 -2.00
0.20 0.01
{)
0
1 0 0
\!
and
Inx"
a Inx
using logarithms simplifies the computations (and, as we shall see, computation of Ihe first derivative) considerahly. Using the ahove two properties, the general expression for the logarithm of the likelihood function (log-likelihood, for short) may be written as
InL(uI9)
= L"
fujlnPj
+ (I - uj)ln(1 - P) I
j=1
Here, u is the vector of item responses. Graphs of the logarithms of the likelihood for Examinees 3, 4, and .5 are given in Figure 3.1. The log-likelihood for Examinee 3 peaks at 9 =-0.5, while for Examinee 4 the log-likelihood peaks at 9 = I. For Examinee 5 the peak is at 9 = -1.5. The value of 9 that makes the likelihood function (or, correspondingly, the log-likelihood) for an examinee a maximum is defined as the maximum likelihood estimate of 9 for that examinee. The problem of findihg the maximum value of a function is not a trivial one. The graphical procedure described ahove was used for illustration and is not feasible when many examinees and many items are used. The value that maximi7.es the function may be found using a search procedure with a computer. More efficient procedures use the fllct that. at the point where the function reaches a mllximum. the slope of the function (the first derivative) is ?em. Thus, the maximum likelihood estimate may he determined by solving Ihc equation ohtaincd hy
I
:1
.!
:1I
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
36 0
....
L 0
g
~'~,
."mIMe 4 " -10 .
L I k
-15
I I h
-20 .
e
""",:
-5
'~
EX8mlnee 3 ~
"\
Examinee 5 ---- \
\
0 0
d
-25 -30 -8
-7
-8
-5
-4
-3
-2 -1 Ability
0
2
3
4
Figure 3.1. Log-Likelihood Functions for Three Examinees setting the first derivative of the likelihood or log-likelihood function equal to zero. Again, this equation cannot be solved directly. and approximation methods must be used. The most popular of the approximation methods is the Newton-Raphson procedure described in detail in Hambleton and Swaminathan (l9R5). Unfortunately. the likelihood (or log-likelihood) function might not have a finite value as its maximum. as when an examinee answers 1111 items correctly or all items incorrectly. In this case, the maximum likelihood estimate will be 0 = +00 or e = _.c><>. SOlJle peculiar response pallerns (which cannot be discerned as such a priori) Illay result also in likelihood functions that do not have a finite absolute maximum. The log-likelihood functions for the first two examinees from Table 3.1 are shown in Figure 3.2. For Examinee 2, the log-likelihood function appears to have a maximum at the point 9 = 0.9; however, the function has a higher value at 9 = - 0 0 (values of the function are shown in the figure only to e : :.; -6). For Examinee I. too, the maximum likelihood function has its maximum at e : : -00, I-Icnee, for both exmninees, maximum likelihood estimates do not exist. The reason for this situation is Ihal the response patterns of these two examinees are aberrant: The
.n
AM/ity lind /tl'm !'(/mml'll'r EstimatioN
EKllmlnee 2 l
o g
"
l I
k e t
-16
I
h
o o d
-20 .
-25 -6
1
.. t
-5
-4
-3
-2
-1
o
I
2
3
Ability
Figure 3.2. Log-Likelihood Functions for Two Examinees wilh Aberrant Responses examinees answered some relatively difficult and discriminating items correctly and answered some of the easier items incorrectly. In cases like this the n1lmerical procedures used 10 find the maximum usually will diverge. The problem noted above with aberrant responses occurs only wilh the three-parameter model and not with the one- or twoparameter models (sec Hambleton & SWllminath:m 119H51. and Yen, Tlurkct, & Syk(~s I in preslI! for discllssions of this issue), and llIay occur even for tests wilh as many as 40 ilellls. The maximum likelihood estimates (MLEs), when they exist, have well-known asymptotic (i.e., large sample) properties. Since we arc dealing with an ("x"minee, asymptotic rcrers to illcrca~ing test length. As test length incre!lses, tile MLE of 0, denoted as 0, is distributed normally with mean 6. This implies that the asymptotic distribution of h h o is centered on Ihe true value of 0; hence, the MLE 0 is unbiased in h long tests. The standard deviation of 0, or the standard error, denoted h as SE (6), is a function of 9 and is given as h I SE (6) = -- --"/(9)
I'
!
3101
FUNDAMENTALS OF ITEM RESPONSE THEORY
where 1(0) i~ what is called IIIl' inftllllllllioll .Iilflll;on Since () is" not known, the inform:ltloll functioll must be computed by substituting 0 in the above expression. Computation of the inforlllation fUllction, its properties, and its role in test construction are descrihed in detail in chapter 6. /\ The normality of 9 can be IIsed to construct a confidcnc,c )nterval for 9. The (I - a)% confidence interval for e is given hy
rOl'
o
( e"
Z/1/2
/\/\
SE (9), 0
+
ZW2
"
SE (0) ) /\
/\
where SE (0) is the standard error evaluated at 0, and
Zrr/2 is the upper (I al2) percentile point of the normal distribution. For the 95% confidence interval, a"" 0.05 and ZW2 = 1.96. The problem of not finding maximum likelihood estimates in some situations can be overcome if a Bayesian estimation procedure is used. The basic idea is to modify the likelihood function to incorporate any prior information we mny have ahout the ability parameters. For example, we may be ahle to say, based on some previous experience, that 0 is distributed normally with mean 11 and standard deviation 0". In this case, the prior information can he expressed in the form of a density function and denoted asl(O). Bayes' theorem states thaI the prohahility of an event A given B is
peA I B)
oc
PCB I A) peA)
where peA) is the prior probahility of event A occurring. The "bove relationship is also true for density fUllctions, where A is 0 and B is the ohserved item response p"ttern, It. Bayes' theorem can be written then as f{O III)
<X
f(FlIO)f(O)
Now'/(u I e) is, in fact, the likelihood function and, hence, f(f:) III)
oc
L(1I1 O)f(O)
The revised likelihood function/CO III) is called the posterior density and its mode is the "most probahle" value for 0, and can be taken as an
) 39
Ahili/v mull/em r"ral/!('/er fi..,imotion c,~lilllaw
uf O. Note thaI if we IIssume II IIlIiform prior distrihution for 0 (I.e ../(O) =: k, a constant) then /(9Iu)
0<:
L(a19)
In this case the Bayesian estimate is numerically identical 10 the maximum likelihood estimate. We emphasize numerically because the philosophical basis underlying the Bayesian procedure is very different from the classical or relative frequency notion of probability (see Kendall & Stuart [19611 for details on this issue). Using a Bayesian approach solves some of the difficulties encountered with the maximum likelihood approach. Bayesian estimates of 9 can be obtained for zero items correct and perfect response patterns. and for "aberrant" response patterns. The posterior distribution of 9 may be described in many ways. The mode of the distribution, the Bayesian modal estimate. provides only olle description. The mean of the distribution also may be used as an estimate. The mean can be computed by approximating the posterior distribution of 0 in a fillite interval with a histogram, that is. forming a frequency distribution with k values of O. The frequency at the point OJ (j = I, .... k) is f(9 J I u). The mean can then he obtained in the IIsual way:
J.l(91 u)
13.3)
Bock and MisJevy (1982) have called this estimate the Expecled A Posleriori (EAP) estimate.
Estimation or Item Parameters In describing the procedures for estimating e, we assumed that the item parameters were known. At some point, we have to face the fact that the item parameters also must be estimated! For estimating the
40
FUNDAMENTALS OF ITEM RESPONSE THEORY
ability of an examinee when item parameters arc known, we administer many items to the examinee and obtain the likelihood function for the responses of the e1laminee to II items. Conversely, if we want to estimate item parameters when 0 is known for each examinee, we administer Ihe item of interest to many cxaminees and ohll\in the likelihood function for the responses of N examinees to the item, that is, "
n P'/IQ! N
L(u,. 112.""
liN
18,0, b, r)
"1
;= I
where a, h, and c are the item parameters (assuming a three-parameter model). The difference between the likelihood function for an examinee and that for an item is that, for an item, the assumption of local independence need not be invoked; we merely assume that the responses of N examinees to an item are independent, a standard assumption in statistics. The assumption of local independence is more stringent in that we must assume that the responses of an examinee to two or more items are independent. When the 8 values are known, the estimation of item parameters is straightforward and is comparable to the procedure described in the previous section. The difference is that the likelihood function for an item, unlike that for an examinee, is multidimensional for the item parameters; that is, it is a function of three parameters. Thus, to find the MLE of the parameters o. b, and c, we must find the values of a, b, and c that correspond to the maximum value of a surface in three dimensions. This is accomplished by rinding the first derivative of the likelihood function with respect to each of the parameters a, 11, and c, setting these derivatives to zero, and solving simultaneously the resulting system of nonlinear equations in three unknowns. Obviously, we solve for two unknowns when the two-parameter model if; used, and solve for only one unknown when the one-parameter model is used. Again, the Newton-Raphson procedure, in its mu Itivariate form. is used commonly to solve these equations. When the ability of each examinee is known, each item may he considered separately without reference to the other items. Thus, the estimation procedure must he repealed n times, once for each item.
AI1i/ill' (l1I.l1/('m
1'(//(1"'('/('1
41
f;,\Iim(ll;oll
Joint Estilllation of Hem and Ability Parameters II is ilpparent that at some point ncither 0 nor the item parameters will he known, This is the most common situation and presents the most diflieu" problem. In Ihis case the responses of all Ihe cJlaminces \0 all the ilems must he considered simulllllleolisly. The likelihood function when N examinees respond 10 II items, using the assumption of local independence, is
n n" i' N
L(u"
1l2' .•• ,UN
19, a, b, c)
=
j= I
where Ili is the response pattern of examinee i 10 n items; 0 is Ihe vector of N ahility parameters; a, b, and care Ihe vectors of item parameters for the ,,·item test. The number of item parameters is 3" in the threeparameter model (211 for the I w()- and II for Ihe one-pammeler model, respectively). Local independence mllsl be assumed since 9s arc not known. The number of ability parameters is N and, hence, for the three-parameler model a total of 3" + N parameters is to be eslimated, Before the estimation can procced, however, the problem of indeterminacy must be addressed. In the likelihood function given above, Ihe item and ahility parame .. lers 1H'e nol uniquely (Ietermined. In Ihe ilem response fundion for, say, Ihe three-parameter model (see Equation 2.3), if we replace 9 by !r :::: (to + fl, "by b' :::; uh + J3, and a by 0' :::; a/u, the prohahilily of a correct response remains unchanged, /'(9)
I
\
= 1'(0')
Since u and pare arhitrary scaling constants, the likelihood function will not have a unique maximum. Any numerical procedurc employed 10 find the maximum of the likelihood function will fail hecause of this indeterminacy. This problem does nol arise in the estimation ofO when item parameters ure known or in the parallel situation in which item parameters arc estimated in the pre,~ence of known ability parameters, because there is no indeterminacy in Illest" sillllliions.
42
FtJNDAMENTALS 01' IrEM RESPONSE THEORY
The problem of indelerminacy may he climinaled by choosing an nrhilrnry scale for the ahility values (or tltt'!J values); usually,lltc mean and siandard deviation of the N ability values (or Ihe " item difficulty valUes) are set 10 be 0 and I, respectively. I\.~ we shall ,~ee later, Ihis scaling musl be taken inlo account when comparing eslimales of ilcm parameters for Iwo or more groups. Once the indeterminacy is eliminated, tbe value. bf the item and ability parameters that maximize the likelihood function can be determined. In the simullaneous or joinl ma"\imum likp./i/wod estimatiof/ procedure, this determinalion must be done in two Ilages. In Ihe firsl stage, initial values for the ability parameters are ch~n. The logarithm of the ratio of number-right score to numher-wrong score for each examinee provides good starting values. These values are then standardized (to eliminate the indeterminacy) and, treating the ability values as known, the item parameters are estimated. In the second stage. treating the item parameters as known, the ability parameters are estimated. This procedure is repeated until the values of the estimates do not change between two successive estimation stages. This joint maximum likelihood procedure is implemented in LOGIST (Wingcrl'ky. 1983) for the one-, two-, and three-parameter models. and in BICAL (Wright. Mead. & Bell. 1979) and BlGSCALE (Wright, Schulz. & Lillacre, 19R9) for the one-parameter model. The joint maximum likelihood procedure. while conceptually appealing. has some disadvantages. First, ability estimates with perfect and zero scores do not exisl. Second, item parameter estimates for items that are answered correctly (or incorrectly) by all examinee ... do not exisl. Items and examinees exhibiting these pallerns must he el iminated before eSlimation can proceed. Third. in the two- and three-parameter models the joint maximum likelihood procedure does not yield consistent estimates of item and ability parameters. (Swaminathan & Gifford [ 1983] have shown empirically that consistent estimates may he obtained for item and abililY parameters if both the numher of examinees and the number of items become large.) Fourth, in the three-parameter model, unless restrictions are placed on the values the item and ability parameters take. the numerical procedure for finding the estimates may fail.
Ahilily
43
Alternative approa<.:hes to estimation are available. One approach is to obtain Dayesian estimates of the paramctcrs using prior distributions. Swami nathan and Gifford (l9R2, 19R5, 1986) have developed Bayesian procedures for the onc-, two-, and three-parameter models in which prior distributions arc placed on the item and ability parameters. This procedure eliminates the problems encountered in the joint maximum likelihood procedure, namely that of improper estimates for certain response patterns. The problem of inconsistent joint maximum likelihood estimates occurs because both the item and ability parameters are estimated simultaneously. This problem disappears if the item parameters can be estimated without any reference to the ability parameters. If we consider the examinees as having been selected randomly from a population, then, by specifying a distribution for the ability parameters, we can integrate them out of the likelihood function (integrating out the ability parameters has the same effect as "running" over the ability distribution to obtain a marginal likelihood function in terms of the item parameters). The resulting "marginal maximum likelihood estimates" do have desirable asymptotic properties; thaI is, the item parameter estimates are consistent as the number of examinees increases. This marginal maximum likelihood estimation procedure was developed by Bock and Lieberman (1970), refined by Bod and Aitkin (1981), and implemented in the computer program BILOG hy Mislevy and Bock (1984). The marginal maximum likelihood procedure is computationally more intensive than the joint maximum likelihood procedure because of the integration that is required. Moreover, in order to obtain the marginal likelihood function of the item parameters, it is necessary to approximate the distribution of ability. For a good approximation of the ability distribution, the availability of a large number of examinees is important. Hence, the marginal maximum likelihood procedure should be carried out only with sufficiently large numbers of examinees. Once the item parameters have been estimated using the marginal maximum likelihood procedure, the item parameter estimates may be treated as known and the abilities of the examinees can be estimated using the method outlined earlier in this ch .. ptcr. Again, the larger the number of items, the better the ability parameter estimates. Either the
44
FUNDAMENTALS OF ITEM RESPONSE TlIEORY
maximum likelihood eslimates of ability or, ir desired. the EAP csti~ mates of ability may be obtained. In some situations, even the marginal maximum likelihood procedure may fail; that is, the numerical procedure may fail to yield a satisfactory result even after a large number of iterations. This failure happens primarily in the estimation of the c parameter in the Jhree~parallleter model. Poor estimates of (', in turn, degrade estimates of other item parameters and of ability (Swaminath:lO & Girrord. 1985). Bayesian estimation (Mislevy. 1986) solves this prohlem (in fact, within the BILOG computer program, a prior distribution is placed on the c parameter values as the default option).
Standard Errors or Item Parameter Estimates The concept of the information function, briefly introduced earlier, is a generic concept that relates to the variance of a max imum likelihood estimator. When the maximum likelihood estimate of the ability parameter is obtained, its variance is given as the reciprocal of the corresponding information function. Similarly, when mludmum likelihood estimates or'item parameters are obtained, the variance·covariance matrix of the estimates is given as the inverse of the information matrix of item parameter estimates (since, in the case of the two- and three-parameter models, each item is characterized by two and three parameters, respectively). The elements of the information matrix for the joint maximum likelihood estimates for each item are arranged in the following manner (since the matrix is symmetric. only the upper triangle elements are given):
i = 1.2, .... n
The expressions for the elements are given below (Hamhleton & Swaminathan, 1985; Lord, 1980).
45
IIhl/i I\' alld IINII l'amm('It'f f.llima/imt
h,) (Pi) -
('i)
qiJ P Ij
Simple expressions for the variance-covariance matri1t of marginal maximum likelihood estimates are not availahle. hut a description of the procedure for obtaining them is given in Mislevy and Rock (1984) and Mislevy (19&6). The variance-covariance matrix of the item parameter estimates is important when comparing the item parameters in two groups. a problem that arises in bias or differential item functioning studies (Lord, 1980).
Summary or Parameter Estimation Methods In the preceding sections. maximum likelihood. marginal maximum likelihood, and Bayesian estimation procedures were described. These are the most widely used estimation procedures. For reviews of current procedures, refer to Baker (1987) and Swaminathan (1983). Several other approaches to estimation were not described in this chapter. A list
46
FUNDAMENTALS
or
ITEM RESPONSE TlIEORY
of' the availahle estimation procedures wilh hlid lIesel iplions is given below.
• Joint maximum likelihood procedure (Lord, 1974, 19RO), applicahle to the one-, IWO-, and three-parameler models. The ahilily and item parameters are estimated simultaneously. Marginal maximum likelihood procedure (Bock & Aitkin, 1(81). applicable to the one-, two-, and three-parameter models. The ability parameters are integrated out, and the item parameters are estimated. With the item parameter estimates determined, the ability parameters are estimated. • Conditional maximum likelihood procedure (Andersen, 1972, 1973; Rasch, 1960), applicable only to the one-parameter model. Here, the likelihood function is conditioned On the number right score. • Joint and marginal Baytsian tstimation proC'tdurts (Mislevy, 19R6; Swaminathan & Gifford, 1982. 1985, 1986). applicable to the one-, two-, and three-parameter models. Prior distributions are placed on the item and ability parameters, eliminating some of the problems. such as improper estimalion of parameters and nonconvergence, encountered with joint and marginal maximum likelihood procedures. • Hturistic estimation procedure (Urry. 1974, 1978), applicable primarily to the two- and three-parameter models. • Method based on nonlinear factor analysi.~ lJrllC(!(lllres (McDonald, 1967, 1989), applicable to the two-parameter and a modified case of the threeparameter model in which the c-values are fixed. In addition. When item parameters are known, estimalion of ability can be carried out by obtaining the mode of the likelihood function, or, in the case of Bayesian procedures, either the mean or Ihe mode of the posterior density function of O. The procedures summArized above are implemented in computer programs described in the next section.
Computer Programs ror Parameter Estimation Until recently, few computer programs were available for estimation of the parameters of the lRT models introduced earlier. In the 1970s the most widely known and used programs were BICAL (Wright et aI., 1979) and LOGIST (Wingersky, Barton, & Lord, 1982). BlCAL fits the one-parameter model; LOGIST fits the one-, two-, and three-parameter models. Both programs use joint maximum likelihood estimation pro-
47 t:et!urcs, and both remain widely IIscd. LOG 1ST remains the standard by which ncw estimation programs arc judged. Other programs available in the 1970s were I'ML (Guswfsson, IYKOa) lind ANCILLES (Urry, 1974, 1(78). PML fils the one-parameter lIlodei using the conditional maximum likelihood procedure. while ANCILLES fits the three-parametcr model using an heuristic procedure. PML has not been used widely in the United Slates, and ANCILLES is not used often because ils estimation procedure is not well grounded theoretically and other programs have been shown to produce better estimates. In the 1980s several new estimation programs were introduced. Most notable of these were BILOG (Mislevy & Bock, 1984) and ASCAL (Assessment Systems Corporation, 1988). BILoe; fits the one-, two-, and three-parameter models using marginal maximum likelihood procedures with oplional Bayesian procedures; ASCAL fils the threeparameter model using Bayesian procedures. BILOG is available in both mainframe and microcomputer versions. while ASCAL is a microcomputer program. Also available in the early 1980s was the program NOH ARM (Fraser & McDonald, 1988). which fits two- and three-parameter models (with fixed c-values) using a nonlinear factor analysis approach. NOHARM has not received much attention in the United Stales. Other developments included microcomputer programs for fitting the one-parameter modt'l. MICROSCALE (Mediax Interactive Technologies, 19H6) and RASCAL (Assessment Systems Corporation. 1988). A microcompul<:r version of LOGIST is being developed and is expected to be released in 1991 or 1992. RIDA (Glas, 1990) is a new microcomputer program for analyzing dichotomolls data using the one-parometer model. Both marginal and conditional maximum likelihood estimation procedures are available. A special feature is the capability of analyzing various incomplete lest designs that often arise in lest equaling (see chapter 9). Most recently. interest in IRT programs that handle polytomous data (Thissen, 1986; Wright el aI., 1989) and multidimensional dala (Carlson, 1987) has developed, but work on the laller lopic is only just beginning and considerable amounts of research are needed before Carlson's program can be lIsed operationally. A summary of the programs listed above and their advantages and key features is given in Table 3.2. Sources for the progr
'i
"
~
FUNDAMENTALS (W II EM HESI'ONS!; TIIEOI{Y
4H
TAnLE .1.2 Currently Availahle IRT Parameler Estil1lllti{)11 Prograllls -~------
-------
/''''.1' ( ~ ). CO"S ( ). Pml/rill"
Soura
Mml,,/
F:.IIII11(1/;On
Com/l/I/t'I'
al/(I
P"(lcct/llrt'
Rfqllil t'1II/'lIIs
P",/tll/",f (+ J
Unconditional Muimum Likelihood
,+~ Ine~pensive Most mainframes + Gives standard errors + (jives graphical! slatistical fit analysis
BICAL (Replaced by BIGSCALE)
Wright elal. (1979); Wright et al. (19119)
MICROSCALE
Uncondiliollal MedillX InIP leraclive Mulli- Maximum Technolcalegory Likelihood ogies ( 1986)
PML
Guslafsson (I 980a)
II'
Assessmenl Systems Corp. ( 19811)
II'
Unconditional Mnimum Likelihood
PC
+ Includes analyses of fit • Incorporated in the. MicroCAT package
GIns
If>
Conditional m Marginal MaximulIl Uk("lihnod
PC
+ Provides a
RASCAL
RIDA
(1990)
IP
Condillonlll Maximum Likelihood
PC
Unknowll
•
PC versioll of BICAL • Data can be input in II spreadshel't
+ Estimates are cnnsistenl COl11putlltionally intensive • Not widely used in th(" U.S.
l:olllplete analysis of ex;uuinees and items + Illmdles incomplete designs for tesl ("{Iuatin!! + Inclu"~s "il analysis
49
IIbililY (/millflll 1'a/llllll'/('r cSliTIIllli,l/I
TAnLE .tZ (Continued) l'mJ (+).
Con.vH. fJtimalitm
Model
Procedure
ANCILLES
Urry (1974. 1978)
3P
f1eurislic
ASCAL
Assessmenl Systerns Corp.
II' 2P 3P
Modified Bayesian
II' 21' 3P
Unconditional Maximum Likelihood
(1988)
LOGIST
Wingersky (1983);
Wingersky elal. ( 1982)
('ompl/lfl Rl'qllirl'T11l'lIis
0111/ Feolllrn (.)
MOSI + Inexpensive mainframes Oflen de· leles ilems! examinees ESlimation procedure nOI well grounded theoretically • NOI widely used PC
+ Includes anal-
ysis or fit + Incorporated in the MicroCAT packaBe • U 5es fhyesilln procedures
IBM/CDC + LOGIST V gives slImdard Mainframes errors (Version IV) + Flexible, many 0Plions + Allows omits/ nol reached Inpul specificalions are complex EXJl("nsive to run
Di ffkult 10 converl for non-IBM etluipmenl Places many conslrainl5 on Ihe parameleu 10 oblain convergence (C,mfimu:d)
50
FUNDAMENTALS
or:
ITEM RESPONSE THEORY
TAULE 3.2 (Continued) Pm .• (+), (,(lI/,f
Proxram BILOO
NOlfARM
Estimation Pmadllre
Complttrr Requirement,f
(-I,
alld
FeaWre" (. J
SOl/ree
Model
Mislevy Itt Bock (19114)
IP 2P 3P
Marginal Maximum Likelihood
IBM mainframe PC Venin"
..:' <1ptional Bayes 's est inlllte~ + Priors prevent extreme estimates .- Expen!live to run on main· frame Wrong prinrs may give bad estimates
II' 21'
Least Squares
Most mainframes PC
+ FilS a multi·
Fra.~er
Itt
McDonald (1988)
3P
dimensional model
+
Include~
residual analysis - c parameter is fixed .. Not widely used in the U,S, MULTILOG
Thissen ( 1986)
MIRTE
Carlson (1987)
Mulli· category
II' 2P 3P
Unconditional Maximum Likelihood
IBM mainframe
.. Generalization of BlLOG to handle multi· category data
IBM mainframe PC
+ Fits a multi·
dimensional model + Gives sian· dard errors
• residual
IlIdudt"~
analysis - c parameter is fixed
51
AI"lily (//1(111('11/ Paramel('l E.f/imation
Exercises ror Chapter J I. For the five items given in Table 3.1, the responses of an examinee are 10 0 0 1 II. What is the likelihood function for this examinee? State the assumption thlll mU9t be made in detcnnining the likelihood function.
II.
b. Plot the likelihood function at {I values from I to 0 in increments of 0.1. Rased on rhe graph, determine the maximum likelihood estimate of O.
2. The item parameters (obtained using a two-parameter model) for four items are given in Table 3.3. TABLE 3.3 Item
b
a
I 2
0.0 1.0 1.0 1.5
1.0 1.0 2.0 2.0
J 4
The max imum likelihood estimate of an examinee who takes this four-item test is 1.5. a. Determine the standard error of the estimate. b. Construct a 95% conridence interval for O. 3. Consider three examinees with ability values {I ""' -1,0, I. The re!'lponses of the three examinees to an item are O. O. I, respectively. Assume that the one-parameter model with a certain (unknown) b value fits the item. a. Write down the likelihood fuhetion in terms of the unknown b value, and slate the assumptions that are made. b. Plot the likelihood function at b values of 0 to 1 in increments of 0.1. Based on the plot, detennine the maximum likelihood estimate of b. 4.
For the olle-pnrametcr model, write down the illfHrlllutiolllllld !ltllndord error of the item difficulty estimate. b. Compute the standard error of the difficulty parameter estimate for the data given in Exercise 3.
II.
52
FUNDAMENTALS OF ITEM RESPONSE TIlEORY
Answers to Exercises for Chapter 3
Since we are looking IIllhe response of one clIuminee on five items, we make the assumption that local independence hold!'!. See Table 3.4.
TABLE 3.4
o
-\.O ...().9 ...().8 -0.7 -0.6 ...().5 .-0.4 -0.3 -0.2 ...().t 0 0.201 0.213 0.225 0.234 0.241 0.244 0.243 0.238 0.2211 0.213 0.195
L
(L
=Likelihood)
1\
b.o = -0.45 2. a. 1(0)
= D2 I (0; Pi Qj) = 5.19. SEeS) = 1\
I 1."J5j9 = 0.44.
=
•
b. 95% confidence interval forO= o±(I.96)SE 1.5 ±(I.96)(O.44)::: 1.5 ± 0.86 ::: (0.64, 2.36) 3. a. Since the responses of different examinees are independent. and el • el • and e) are given, P( (! I.Ch. U31 e" 02. eJ) = P(U ,10, )P(U21 (2)r(lfJI9.,). The likelihood function is, therefore. L(II" "2. ".lIOt. O2, 0.\) = QI Q2 P,
=
L
+el.!H
-h) J[ I +e,11(O h)][ I :1:'(~(I~)h)]
b. See Table 3.5.
TABLE 3.5 b L
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.357 0.386 0.411 0.432 0.447 0.455 0.458 0.454 0.444 0.429 0.409
The maximum value of the likelihood occurs at " ::: 0.6. Therefore. the maximum likelihood estimale of h is 0.6. N
4. a. 1(8) = D2
2,
P(Oi)Q(Oj); SECt) = II -J/(g)
;=1
b. 1(8) ::: 2.89 (0.062 x 0.938 + 0.265 x 0.735 + 0.644 x 0.336) = 1.376; SE(8) = 0.85
- .-....--_. --..
~.,..-,-.--,-----"->,
)
I
4 Assessment of Model-Data Fit
Item response theory (IRT) has great potential for solving many problems in tesling and measurement. The success of specific IRT applications is not assured, however, simply by processing test data through one of the computer programs described in Table 3.2. The advantages of item response models can be obtained ollly when the fit between the model and the test data of interest is satisfuctory. A poorly fitting IRT model will nol yield invariant item ~lnd ability parameters. III many IRT applications reported in the literature. model-data fit and tht' cOllsequences of misfit have 1I0t heen invcstignh.·d adequately. A~ a result, less is known aboulthe llppropriatencss of pari ;cular IRT lIlodels for various application~ than might he assumed from the voluminous IRT literature. In some case~ goodness-of.fit studies have heen conducted using what now appear to he inappropriate statistics (see, for example. Divgi, 1986; Rogers & Hallie, 1987), which may hnve resulted in erroneous decisions about the appropriateness of the model applied. A fUrther problem with Illany IRT goodness-of-fit studies is that too lIIuch reliltnce hus heen pl/lccd Oil statistical tests ul' modcl fit. These tests have 11 well-known and serious flaw: their sensitivity to examinee sample size. Almost any empirical deparlure from the model under considerat ion will lead to rejection of the null hypothesis of model-data fir if the sample Sil.l' is suffidently lar~e. If sample sizes are small, t'ven large model-data discrepancies may not he detected due to the low statistical power associated with significance tests. Morenvtf. rammeler estimates hased on small ~"Imples will he of limited usefulness because they will have large standard errors. In addition, the sampling distributions of some IRT goodness-of-fit statistics are not what they have been claimed to be; errors may be made when these statistics are interpreted in light of tabulated values of known distrihutions (see, for example, Divgi, 19R6; Rogers & Jlallie. 1987).
54
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
TABLE 4.1 Number of Misfilling Itellls Detected Using the QI Statistic Slight
Mi,~fi'
(2% /(1 J%'
.'lamp I" Siu
150 300 600 1200 2400
"
0 9, I, I
3
5 10 II (22%)
Mitl"!
Mi,~fi'
(4%'" 5%J a:OH.I,l
4 6 II
t8 (36%)
The sensitivity of goodness-of-fit statistics to sample size is illustrated in Table 4.1. A computer program, DATAGEN (Hambleton & Rovinelli. 1973), was used to simulate the item responses of 2400 examinees on a 50-item test. The items were described by twoparameter logistic ICCs, and examinee ability was simulated to have a standard normal distribution. Two simulated tests were generated: In the first. item discrimination parameters were set at either 0.90 or 1.10, with equal numbers of items at each value. This difference corresponds to a 2% or 3% difference, on the average, in the ICCs, for ability scores over the interval [-3, 31. In the second simulated test, item discrimination parameters were set at either O.RO or 1.20, again with equal numbers of items at each value. This difference corresponds to a 4% or 5% difference, on the average, in the ICCs, over the interval 1-3.31. With these item discrimination values, the test data represented "slight" and "minor" departures from t.he assumptions of the one-parameter model. Item difficulty values were chosen to be similar to those commonly found in practice (-2.00 to +2.00). The one-parameter model was filled to the generated two-parameter data, and ability estimates were obtained for five overlapping samples of examinees: the first 150, the first 400, the first 600, the first 1200, and the lotal sample of 24{)O examinees. Then. with each of the five dala sets, the Q, statistic (Yc-n, 198 I) was used to determine the nllmber of misfitling test items nt the 0.05 level of si~llificance. The statislics in Table 4.1 clenrly show Ihe influence of sample size Oil detection of model-data misfit. With small samples. almost no items were detected or identified as misfitling the model; considerably more items were detected with large samples. With large sample sizes. however, even minor empirical departures rrom the model will result ill many items
A,t,I ....umnll
of Model·Data Fit
55
heing identified as llIisfining. although in practice they would function quite acceptably. f.'ortunately, an alternative exists to placing undue emphasis on the results of significance tesls in choosing IRT modds. Hamhklon and Swaminathan (1985) have recommended that judgments about the fit of the model to the test data be based on three types of evidence: I. Validity of the assumptions of the model for the test data 2. Extent to which the expected properties of the model (e.g., invariance of item and ability parameters) are obtained 3. Accuracy of model predictions using real aod, if appropriate. simulated test data
Some promising goodness-of-fit analyses for amassing the three types of useful evidence have been described by Hambleton (1989) and Hambleton and Swaminathan (J985) and are summarized in Table 4.2. Evidence of the first type, bearing on the assumptions of the model, orten can be helpful in selecting IRT models for use in investigating the second and third Iypes of evidence. Evidence of the second type, involving invesligations of parameter invariance, is essential regardless of the intended application, since all IRT applications depend on this property. Evidence of the third type involves assessment of the extent to which the IRT llIodel accounts for, or explains, the actual test results and helps in understanding the nature or model-data discrepancies and their consequences. Fitting more than one model 10 the test data and comparing the results to the results obillined with simulated data that were generated to fit the model of interest nre especially helpful activilie ... in choosing an appropriate model (see, for example, Hambleton & Rogers, in press).
Checking Assumptions Model selectioll can be "ided hy lUi investiglltinll of the principal lIsslimptions underlying the populnr unidimensionnl item response models. Two Iissumptions common to all these models are that the data are unidimensional and the test administration was lIot speeded. An additional assumption of the two-parameter model is that guessing is minimal; a further assumption for the one-parameter lIlotlcJ is that all item discrimination indices are equal.
. '~.......~.
56
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
TABLE 4.2 Approaches for Assessing
Goodne~s
of Fit
Checking Model AUIIIII"'innJ I.
Unidimensiollnlity • Eigenvalue plot (frol11 largest to smallest) of the interitem correlation matrix (tetrachoric correlations are usually preferable to phi correlado~s). The plot of eigenvalues is studied to determine whether a dominant first factor is present (Reckase, 1979). • Com pari Non of the plots of eigenvalues from the interitem correlation matrix using the test data. and an interitem correlation matrix of random data (the mndolO data consist of random normal deviates in a data set with the same sal11ple size and with the same number of vRriables as the test data). The two eigenvalue plots are compared. If the unidimensionality assumption is met in the test data, the two plots should be similar except for the first eigenvalue of the plot of eigenvalues for the real data. The first eigenvalue should be substantially lalger than its counterpart in the random data plot (110m, 1965). Recent modifi<.:ations and exampleN of this method can be found in the work of Drasgow and Lissak
(1983). • Investigation of the assumption of local independence by checking the variance .. covariance or correlation matril( for el(aminees wilhin different intervals on the ability or test score scale (McDonald, 1981; Tucker, Humphreys, & ROl.l1owski, 1986). The entries in the off-diagonal elements of the matrices will be small and close to zero when the unidimensionality assumption is (approximately) me!. • Filling a nonlinear one-factor analysis model to the interitem correlation matrix and studying the residuals (Hallie, 1985; McDonald, 1981). Promising results from this approach were obtained by Hambleton and Rovinelli (\986). •
Using a method of factor analysis based directly on IRT (Bock, Gibbons, & Muraki, 1988): A multidimensional version of the three-parameter normal ogive model is assumed 10 account for the vector of ilem responses. Estimation of model parameters is time-consuming and complicnted but results can be oblained, and the results to date have been promising. Of special interest is the fit of a one-dimensional solution to the test data. hems that appear likely to vioillte the assumption are checked to see whether they function differently. The b-values for these items are calibrated separately as a subtest and then again in the fulltes!. The context of item calibration is unimportant if model assumptions are me!. If the plot of b-values calibrated in the two contexts is linear with the spread comparable with the standard errors associated with the item parameter estimates, the unidimensionality assumption is viable (Bejar, 1980).
2. Equal Discrimination Indices • The distribution of item test score correlations (biserial or point-biserial correlations) from a standard item analysis can be reviewed_ When the distribution is reasonably homogeneous, the selection of a model that assumes equlIl item discrimination may be viable.
,
-" A.I.\'('HIlIt'/11
"l "'"II('I-O"la
57
hI
TABLE 4.2 (Conlinued) - - - - - - - -•.- - - - ['''.uiM,' M"lh"d,
J. Minimal Guessing The lJerfOriHAUCe of low~ahililY ~ludl'llls nn the rHosf difficult il~n1s call hoe, ch"cketl. If performnnce levels are dose to zero, the asslImption is viahle.
Plots of item·test score regressions can he helpful (Baker, 1964, 1965). Nearzero item performance for low-scoring eXllminees will le/HI support for the viahility of the as,~umption. The test diffk-lIlty. time (jmits, Hntl it('111 fOUIlllt shollid he reviewed to assess the possible role of guessing in test performance, 4.
Nonspeeded (power) Test Administration The variance of numher of omitted item~ .~hould he compared to the variance of number of items answered incorrectly (Gulliksen, 1(50). The assumption is nlet when the ratio is close to 1.ero. • The test scores of examinees under the specified time limit and without a time limit are compared. High overlap in pl'rformance is evidence for the viahility of the assumption. • The percentage of euminees completing the test, percenwge of examinees completing 75% of the test, and the number of items completed hy RO% of the examinees are reviewed. When nearly all e~aminees compkte nearly all of the items. speed is assllmed to be lin unimportant fllctor in test performance.
Cht'cKing Expected Model Feature., I.
Invarillnce of Ahility Parameter Estimates • Ability estimates are compared for different samples of test items (for example, hard and easy items; or tests reOet·ting differelll content categories within the item pool of interest). Invariance is estahlished when the estimates do nOl differ in e.~cess of the measurement errors associated with the estimates (Wright, 1968).
2.
Invftriance of Item Parameter Estimates • Comparisons of model item parameter estimates (e.g., h-volues, a-values, and/or c-values) obtained in two or more suhgroups of the population for whom the test is intended (for example, males and fenlllles; hlacks, whites, and Hispanics; instructional groups; high- and low-test performers; examinees in different geographic regions). When the estimates are invariant, the plot should be linear with the amount of scalier renecting errors due to the sample size only. Baseline plots can be obrained by using randomly equivalent samples (Shepard, Camilli, &. Williams, 19R4).
Checking Modt'l Predictions of Actual a"d Simlliatt'd Te.ft Re.wlls Investigation of residuals and standardized residuals of model fit to a data sel. Determination of the nallIre of model misfit is of value in choosing a satisfactory IRT model (see Hambleton &. Swami nathan, 19R5; Ludlow, 1985, 19R6; Wright &. Stone, 1979).
(Contilllud)
5R
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
TAIU.F. 4.2 (Conlinued) f'ilHihie Mnh"tI.1
• Comparisons of ohserved and predicted tesl SCOff disuihution, ohtained ff(lm assuming all model parameter estimales Are corn~~cl. (,hi square sllltisti«'s (or other statistics) or graphical methods can be used 10 report the resulls (lIalll' bleton & Traub, 1973). ' • • Investigations of the effects of item placement (Kingston & Dorans, 19R4; Yen. 1980). practice effects, test speededness and cheating (Drasgow. Levine. & McLaughlin, 1987). boredom (Wright & Slone. 1(71). cllrri(:ulum (Phillips & Mehrens, 1987), poor choke of model (Wainer & Thissell. 1(87). recency of inslruction (Cook. Eignor. & Taft. 19118). cognitive processing variAbles (Talsuoka. 1987), and other Ihreals to Ihe validity of IRT results can be carried out and used to provide evidence appropriate for addressing particular IRT model use~. • Scallerplol of ability eSlimalt~s and corresponding test scores. The relationship should be slrong with scalier around Ihe lesl characteristic curve (reflecting measurement error) when the fit is acceptable (Lord, 1(74). • Application of a myriad of Malislical tests to determine overall model fit. ilem fit, and person fil (see. for example. Andersen, 1973; GUSlafsson. 1980b; Lud· low, 1985. 1986; Traub & Wolfe. 1981; Wrighl & Slone, 1979; Yen. 19R I) • Comparisons of true and eSlimaled ilelll lind abililY paramelers using computer simulnlion melhods (/lamhlelon & Cook, 19R'l). • Investigalions of model robustness using compuler simulalion methods. For e~· amrle. Ihe impliCAtions of filling one-dimensionallRT models 10 mulli,lil1lcnsionnl data can be siudied (Ansley & Forsylh. 19H5; Drasgow & Parsons. 191\3),
Methods of studying these assumptions are summari7.ed in Table 4.2. Regarding the assumption of unidimensionality, Hattie (1985) provided a comprehensive review of 88 indices for assessing unidimensionality and concluded that many of the methods in the older psychometric literature are unsatisfactory; methods based on nonlinear factor analysis and the analysis of residuals are the most successful. The methods described in Table 4.2 for assessing unidimensionality appear to be among the most promising at this lime. Considerable research remains to be conducted on this topic. however. The checks of other model assumptions are more straightforwllrd. The methods in Table 4.2 use descriptive evidence provided by classical item statistics, but they still can be informative. For example, in an analysis of NAEP mathematics items (Hambleton & Swaminathan, 1985), it was learned that the item biserial correlations ranged from 0,02
",I 1\,1.11''-1'111('1/1
or
Mllt/d·[)(}/(!
.'i9
FII
to 0.1)71 This infonmJlioll indicated that iI was highly unlikely that a parameter model would fif the test data.
Oil!'
Checkill~
Invariance
The invariance of model parameters can be assessed by means of several straightforward methods. Two of these methods are highlighted in the next section. The invariance of ability parameters can be studied by IIdministering examinees tW() (or mort") item sets in which the items in each set vary widely ill difficulty. The item sets are constructed from the pool of test items over which ability is defined (Wright. 1968). It is common to conduct this type of study by administering both sets of lesl items 10 examinees within the SlIllIe les!. Ahility eslimates nre obtained for each examinee. one from each set of items. Then the pairs of ability estimates are ploUed on a graph. This plot should define a straight line with a slope of I because the expected ability score for each examinee docs not depend on the choice of test items (provided the item response model under investigation fits the test data). Some scalier of points about the line is to be expected, however, because of measurement error. When a linear relationship with a slope of I and an intercept of 0 is nol obtained, or the scalier exceeds Ihat eXI)ected from knowledge of the standard errors of the ability eSlimates, one or more of the assumptions underlying the item response model may not hold for the data set.
Checking Model Predictions Several methods of checking model predictions are described in Table 4.2, One of the most promising of these me1hods involves the analysis of item residuals. In this method, an item response model is chosen. item and ability parameters are estimated. and predictions about the performance of various ability groups are made. assuming the validity of the chosen model. Predicted results are compared rhen with actllal results (see, for example, Hambleton & Swaminathan, 1985; Kingston & Dorans, 1985). --"~ A reSidua.1 rij (s.omelimes c. H.lled a raw residual) is rhe difference belween_oJ?~'i!:rved H<:,m.. per.(2r!l,lanCe for a subgroup of examinees and the subgroup's expecled item performance:
~
r
" . _ . ~_",.,
H
.......
~""'"
FUNDAMENTALS OF ITEM
IH~SPONSE
TIIHJHY
'1:(1',,)
where i denotes the item . .i denotes the ahility category (subgroup). 1'" is the observed proportion or ('oneci responses 011 itelll i illiheilh ability category, and 'E(P ij ) is the expected proportion of correct responses obtained using the hypothesized item response lIIodel. The p"l'lllllcters of the hypothesized model are estimated, and the estimat~; are used to calculate the probability of a correct response. This probability is taken as the expected proportion correct for the ability category. tn practice, the ability continuum usually is divided into intervals of equal width (10 to 15) for the purpose of computing residuals. The intervals should be wide enough that the number of examinees in each interval is not too small, since statistics may be ullstable in small samples. On the other hand, the intervals should be narrow enough that the examinees within each category are homogeneolls in terms of ability. The observed proportion correct is obtained by counting the number of examinees in an ability category who got the item right and dividing by the number of examinees in the category. To determine the expected proportion correct in an ability category, a a-value is needed. One approach is to use the midpoint of the ability category as a representative ability value for the category and to compute the probability of a correct response using this value. Alternatively, the prohahility of a correct response for each examinee within the ability category can be obtained, and the average of these probabilities can be used as the expected proportion. A limitation of the raw residual is that it does not take into account the sampling error associated with the expected proportion-correct score within an ability category. To take this sampling error into account, the standardized residual Zij is computed by dividing the raw residual by the standard error of the ex pected proport ion correct, that is,
where Nj is the number of examinees in ability category';. When choosing an IRT mudel, a study of residuals, standardi/,ed residuals (residuals divided by their standard error .. ), or both, ohtain~.. d
AUl'J,\II/I'1l/
of Mmlrll>lIw Fil
61
for scvewl llIodels, can provide valuahle information, liS will he demonstraled in the next section. Statistical tests, usually chi-slluare tests. also are applied to determine model-data fit. An extensive review of goodness-of-fit statistics is provided by Traub and Lam (19R5) and Traub and Wolfe (19R I). The Q, chi-square statistic (Yen. 1981) is typical of the chi-square statistics proposed by researchers for addressing model fit. The Q, statistic for item i is given as
(4.1 J
where examinees are divided into m ability categories on the basis of their ability estimates. P ij and <£ (Pi) were defined earlier. TIle statistic Q, is distributed as a chi-square with degrees of freedom equal to m - Ie, where It is the number of parameters in the IRT model. If the observed value of the statistic exceeds the critical value (obtained from the chi-square table), the null hypothesis that the ICC fits the data is rejected and a beller fitting model mllst be found.
Examples or Goodness-or-Fit Analyses The purpose of this section is to provide an example of procedures for investigating model-data fit using 75 items from the 1982 version of the New Mexico High School Proficiency Test. The items on this test are multiple-choice items with four choices. and we had access to the item responses of 2000 examinees. Normally, the first steps in the investigation would be as follows: I, Conduct a classical item analysis. 2. Detennine the dominance of the first factor, anti check other IRT model assumptions. 3. Make a preliminary selection of promising IRT models. 4. Obtain hem and ability parameter estimates for the models of inleresl.
62
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
The results of the item analysis arc reported in the Appendix A. If we had found subst.mtial variation in the item point-niserial correlations. our interest in the one-parameter model would have been low. If all of the items were relatively easy, or if the tt'st had consisted of short free-response items, we probably would not have worked with the three-parameter model, at least at the outsel. The item afl~lysis reveals that the variation in item difficulties and discrimination indices is substantial and, therefore. the one-parameter model may not be appropriate. Nevertheless, for illustrative purposes, we will fit the one-, two-, and three-parameter models and compare the results. In general, comparisons of the fits of different models witt facilitate the choice of an appropriate model. Figure 4.1 clearly shows the dominance of the first factor. The largest eigenvalue of the correlation matrix for the 75 items is over five times larger than the second largest, and the second largest eigenvalue is hardly distinguishable from the smaller ones. Had the plot of eigenvalues produced a less conclusive result, the method of Drasgow and Lissak (1983) should have been used. In this method the plot of eigenvalues resulting from a correlation matrix derived from (uncorrelated) normal deviates is obtained and is used to provide an indication of the eigenvalues that result from chance factors alone. This plot serves as a baseline for interpreting eigenvalues and (ultimately) the dimensionality of the real data. Appendix A contains the item parameter estimates obtained from fitting one-, two-, and three-parameter logistic models. These statistics were obtained by using LOOIST and scaling the ability scores to a mean of 0 and a standard deviation of I. The next activity was to investigate the invariance of the item parameters for the three-parameter model. (Similar analyses were carried out with the one- and two-parameter models but are not reported here.) The sample of 2000 examinees was split into two randomly equivalent groups of 1000. In a second split, two ability groups were formed: the top half of the distribution and the bottom half of the distribution. A total of four groups of 1000 was available for subsequent analyses. Through use of the ability scores obtained from the total-group analysis, four three-parameter-model LOOIST analyses were conducted, one with each group, to obtain item parameter estimates. Figure 2.6 provides the baseline information for the h parameter. This figure provides an indication of the variability that could he expected in the item parameter estimates due to the use of randomly equivalent groups of
6]
ilue.Hmelll of Model·/)ata Fit
6·
M
a
g n I t u d
4
e
2
o-t~~+~t--+--+-i-~-I·-t---f-·t---t~I ..
2
3
4
G
8
7 8 9 Eigenvalue
ID
ft
~
q
--t---
w m
Figure 4.1. Plo. (If the Largest 15 Eigenvalues size 11 = 1000, that is, due to sampling error. If item difficulty parameter invariance has been obLained, a scatlerplot similar to that shown in Figtlrc 2.6 should he ohtained from the high-and-low-performing groups of examinees. In fact, Figures 2.6 and 4.2 are quite similar, indicating Ihat item parameter invariance is present. What also is revealed in Figure 4.2 is that item parameters for easy items arc not well estimated in the high-performing group or the hard items in the low-performing group, as demOnSlrlHCd by the "dumbbell" shaped scallerplms. The implications for parameter estimation are clear: Heterogeneous samples of examinees are needed to obtain stahle item parameter estimates (see. for example, Stocking, 1990). Invariance of ahility parameters across different samples of items was investigated next. Invariance of ahility parameters over randomly equivalent forms (e.g., ability estimates based on examinee perforIlHlnCe on the odd-numbered items and on the even-numbered items) indicates the variability due to the sampling of lesl items. A more rigorous test of in variance would be a comparison of ahility estimates over (say) tests consisting of the easiest and hardest items in the item bank.
FUNOAMENTALS OF ITEM RRSPONSE TIIEORY
64
3 . - - - - - - - - - - - - - ---~---. - - - -..-l
o
w
2
G r
.,
o
u
P D I f
O~--------------~~rr---~-----·---~_i
f I
-1
y
-2
o
r •. 113
t
Y
-2
-1
o
2
3
High Group Difficulty
Figure 4.2. Plot of 3P Item Difficuhy Values Based on Samples of Differing Ability
Figures 4.3 and 4.4 provide comparisons between ability estimates obtained with the randomly equivalent subtests and the hard versus easy subtests for the three-parameter model. Item parameter estimates used in the calculation of ability scores were obtained from the total sample (N = 2000) and are reported in the Appendix. These comparisons provide evidence of the invariance of ability parameters over tests of varying difficulty (note that the two plots are similar and scattered about the line with slope I). These plots also show the generally large errors in ability estimation for low- and high-ability examinees (Figure 4.3) and even larger errors in ability estimation for low-ability examinees on hard tests and for high-ability examinees on easy tests (Figure 4.4). These findings may have more to say about improper test design than parameter invariance. however. Based on the plots, it appears that item and ability parameter invariance was obtained with the three-parameter model. The plots also indicate that satisfactory ability estimation requires that examinees be administered test items that are matched with their ability levels and that satisfactory item parameter estimation requires heterogeneous ability distributions.
) A.UI'.f.tmfllt of M(Jdrl·Data Fit
6 A
b
I I I
t Y 0
n
E v
•
n
•
m
a
-6
-<4
-3
~2
-1 0 Ability on Odd lIema
2
3
.'Igun! 4.3. Plot of 3P Ability Estimates Based on Equivalent Test Halves (Odd vs. Even Ilem~)
A
5
b I
4
I
3
I t Y
2 .
0
n
0
H
a
-1
d
-2 .
r
-3
e
m 8
r· .80
-4
-.
-5
L.
.
, 2
3
Ability on Easy Items
Figure 4.4. Piol of JP Ability Estimates Based on Tests of Differing Difficiliry (lianiesl vs. Ellsiest Items)
4
66
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
p 0
0.8
p 0
r t
I
...
0.6
0
n C
0.4
0
r r
e
0.2
c t
0 -3
----L.- - - - '.....
-2
-f
0
2
__.. 3
Ability
Figure 4.5. Observed and Expected Proportion Correct (lP Model) for 11em 6
Perhaps the most valuable goodness-of-fit data of all are provided by residuals (and/or standardized residuals). Normally. these are best interpreted with the aid of graphs. Figures 4.5 to 4.7 provide the residuals (computed in 12 equally spaced ability categories between -3.0 and +3.0) obtained with the one-, two-, and three-parameter models, respectively, for Item 6. The best fitting ICCs using the item parameter estimates given in the Appendix also appear in the figures. When the residuals are small and randomly distributed about the ICC. we can conclude that the ICC fits the item performance data. Figllre 4.5 dearly shows that the one-parameter model does not match examinee performance data at the low and high ends of the ability scale. The fit is improved with the two-parameter model (Figure 4.6) because the discrimination parameter adjusts the slope of the ICC. The fit is further improved with the three-pan,meter model (Figure 4.7) hecause the c parameter takes into account the performance of the low-ability examinees. An analysis of residuals. as reOected in Figures 4.5 to 4.7, is helpful in judging model-data fit. Figures 4.8 to 4.10 show the plots of standardized residuals against ability levels obtained with the one-, two-, and three-parameter models. respectively. for lIem 6. The observed pattern of standardized residuals shown in Figure 4.8 is due to the fact
-----.-----.---..-
.•..
67
A,ut',umt'llf tlf M{)(/r/·/)o/(/ Fit
-----
..
-~----
..
---_._-_..._---_._-----_ ..-
p r 0
0.8
p 0
r I
I
0.6
0
n
C
0.4
/'"
0
r
r
e c
~/
0.2 .
/
---------
t
-3
-2
-1
/ .
o
2
3
Ability
Figure 4.6. Ohserved and Expe{'lecl Proportion Correel (21' Model) for Item 6
p r
o
0.8
p
o r
:
0.6
o n
c
0.4
r • c
0.2
o r
.~ --.
t
o -3
....L
-2
-1
•
o
...L._~ •....L......._____.....
2
3
Ability
Figure 4.7. Observed and Expecled Proporlion Correel (31' Model) for hem 6
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
68 6
S
5
•n
4
r d I
2
t
..d
z e
d
R
•• I d u
• I
3·
, ,
0 -1
-2 -3
-<4 -5 -6
-3
-2
-1
o
2
3
2
3
Ability
Figure 4.8. IP Standardized Residuals for Item 6
6 S
5
•n d •r
<4
t
3 2
d I
z 41
d
R 41
•I
d u
•I
0 -1
-2 -3
-<4 -5 -6 -3
-2
-1
0 Ability
Figure 4.9. 2P Standardized Residuals for Item 6
69
A.ut'umrllf of MmJe/·Doto Fit
6
S
6
a n d a
4
t
r
d I
z
e
d R
3 2
.
0
-
e
-2
I d
-3
•
u
-4
I
-6
a
-6 -3
-2
-1
o
3
Ability
Figure 4.10. 3P Standardized Residuals for Item 6 that the item is less discriminating than the average level of discrimi- . nation adopted for all items in the one-parameter model. Clear improvements are gained by using the two-parameter model. The gains from using the three-parameter over the two-parameter model are much smaller but noticeable. Item 6 WllS selected for emphasis because of its pedagogical value, but in general the two- and three-parameter models fit data for the 75 test items better than the one-parameter model. With 12 ability categories and a 75-item test, 900 standardized residuals were available for analysis. The expected distribution of standardized residuals under the null hypothesis that the model fits the test data is unknown, although one might expect the distribution of standardized residuals to be (approximately) normal with mean 0 and standard deviation l. Rather than make the aSllumption of a normal distribution, however, it ill possible to use computer simulation methods to generate a distribution of standardized residuals under the null hypothesis that the model fits the data, and use this distribution as a basis for interpreting the actual distribution. To generate the distribution of standardized residuals for the oneparameter model when the model fits the test data, item and ability
70
FUNDAMENTALS OF ITEM RESPONSE THEORY
R
•I
a
0.12 0.1
t I
v
• F r
e
Q
0.08
0.06 0.04 .
u
•n c y
0.02 .
o~l.d.d -3
-2'
-1
0
2'
3
.
Standardized Residual _
Real
e::%2l Simulated
Flxure 4.11. Prequency Distribulions of Siandanlil.ed Residuals for I P Real and Simulated Datu
parameter estimates for the model (reported in the Appendix) are assumed to be true. Item response data then can be generated (llambleton & Rovinelli, 1973) using these parameter values, and a one-parameter model filled to the data. Standardized residuals are obtained, and the distribution of standardized residuals is formed. This distribution serves as the empirically generated "sampling distribution" of standardized residuals under the null hypothesis that the model fits the data. This distribution serves as the baseline for interpreting the distribution of standardized residuals obtained with the real test data. In Figure 4.11, the real and simulated distributions of standardized residuals for the one-parameter model are very different. The simulated data were distributed normally: the real data were distributed more uniformly. Clearly. since the distributions arc very different, the oneparameter model does not fit the data. Figures 4.12 and 4.13 show the real and simulated distributions of standardized residuals obtained with the two- and three-parameter models. respectively. The evidence is clear that substantial improvements in fit are obtained with the more general models, with the three-parameter model filling the data very well. The real and simulated distributions for the three-parameter model are nearly identical.
A.'.,,',n""'11/ of Mmld
R
1),,/(/
71
Fi/
0,12
e I
8
t I y
O.t
0.08
e
F r e Q
0.06 0,0"
u
e
n
c
y
0.02
o ,... ,1 ·4
-3
-2
-,
0
,
............. 3
4
Standardized Residual _
Real
!7/2J Simulated
..·i~ure 4.12. Freljuency Distributions of Siandardil,ed Residuals for 2(' Real and Simulated Dalll
R
e
a t
I
0.1
.,
I
0.08
y
F
e q
0,06 0.0"
U It
n
c y
0.02
o U"uooI~"'J"-"_"_ -4
-2
-I
0
I
3
4
Standardized R.aldual _
Real
~ Simulated
Figure 4.13. Frequency Distributions or Standardized Residuals ror JP Real and Simulated Data
FUNDAMENTALS OF ITEM RESrONSE TtIF.ORY
72
3.8~-----------------------------------------1
A y
3
•
2.8
•r II
• A
··.. . .·... .. .. ... . ....-.. ., ... ,
2
b
•
o
Ui
I
...
-, .
. .. .
t
•
S R
,
.
\
0.5
o o
L -_ _- ' -_ _ ~
0.1
0.2 0.3 0." 0.5 Item Polnt-BI •• rla. Correlation
~ ..-
0.6
0.7
Figure 4.14. Plot of I P Average Absolute Standardized Residuals Againsl Point-Biserial Correlations
Other types of goodness-of-fit evidence also can be obtained. Figures 4.14, 4.15, and 4.16 show the relationship between item misfit statistics and item point-biserial correlations for the one-, two-, and three-parameter models, respectively. In this analysis, item misfit was determined by averaging the absolute values of standardized residuals ohtained after fitting the model of interest to the item data. Figure 4.14 shows the inadequacy of the one-parameter m()tlel in fitting items with high or low discrimination indices. Figure 4.15 shows that the pattern of item misfit changes substantially with the twoparameter model. Figure 4.16, for the three-parameter model, is similar to Figure 4.15 except the sizes of the item misfit statistics are generally a bit smaller. The complete set of analyses described here (and others that were nOlo described because of space limitations) are helpful in choosing an IRT model. For these test data, evidence was found that the test was unidimensional and that the fit of the three-parameter model was very good and substantially better than that of the one-parameter model and somewhat better than thaI of the two-parameter model. Many of the baseline results were especially helpful in Judging model-dl.lta fit.
73 3.5 A
v e
3
a
2.6
r
0
e
A b
a
0
2 1.6 "
t
e
S
.•
.. ,
I u
~.
0
0
0.6 -
.:~ • .. .. . '" i- . . .
. ''''
0"
I
0
"
"
R 0
0
0.1
0.2 0.3 0.4 0.6 Item Polnt-Blaerlal Correlallon
0.8
0.7
Figure 4.15. Plot of 2P Average Absolute Standardized Residuals Against Point-Biserial Correlations
3.6
A
v e
3
a
2.5
r
g
• A
2
b
a o
.
u t
• S R
.
1.5
I
~
..
flIP" • - ,
..I
0.5
•
o
0.1
.:-
_.1 I
oL---__ ____-L____-L____ ~
..
. ....- ..
..,:
~
., :
.. .
••••
______L __ _ _ _
0.2 0.3 0.4 0.6 Item Polnt-Blaerla' Correlation
~
0.8
_ _~
0.7
Figure 4.16. Plot of 3P Average Absolute Standardized Residuals Against Poinl-Biserial Correlations
74
FUNDAMENTALS OF ITEM RESPONSE THEORY
Summary
In asr;cssing mmlcl·dalll fil, Ihe bcsl approach involves (a) designing and conducling a variety of anlllyscs dcr;igncd 10 (Il-Irct rxpcl.:lrd Iypes of misfit, (b) considering the full sct of results carefully. and (c) milking a judgment about the suitability of the model for the intended applica· tion. Analyses should include investigations of lIlodel ass'uinptions, of the extent to which desired model features are oh\:lined, and of differences between model predictions and actual dlila. Statislkal tests Illay be carried out, but care must be taken in iOierprcring the slatislical informntioll. The nUlIlocr of investigations that Illay he comillclt,t! is almost limitless. The amounl of effort and money expended ill collecting, analyzing, atHI interpreting results should he consistent wilh th(~ importance and nature of the intended application.
Exercises for Chapter 4 I. Suppose that a three-parameter model was filled to a set of test data. The item parameter estimates for a particular item were a 1.23; h '" 0.76; c
0.25. In order to assess the fit of the model to this item, the examinees were divided into five ability groups on the basis of their ability estimates, with 20 examinees in each group. The item responses for the examinees in each ability group are given in Table 4.3.
TARLE 4.3 9 Lel'e/
lrem R esp01lses
-2.0 -1.0 0.0 1.0 2.0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I 0 I 0 0 I 0 0 I 0 0 0 0 0 0 0 I 0 0 () 0 0 0 0 0 0 0 0 0 0 0 0 I 0 I I 0 0 0 0 I 0 0 0
a. Calculate the observed proportion correct 1.11 eHeh ability level. b. Calculate the expected proportion correct at eaeh ability level (using Ihe parameter estimates given). c. Calculate the QI goodness-of-fit statistic for this item. What are the degrees of freedom for the chi-square test? d. Does the three-parameter model appear 10 fil Ihe data f(lr this item?
.1,\\".<.1'1111'111
75
of Model,/).lftJ Fir
2, SIIPPOSI' thaI OIlC- and two'parameter models nlsH wert.' filled 10 Ihe ,Iala, The ilt~1Il Ilaramctcr estimates for the Iwo models arc giVl~f1 helow; ()ne-pllraml'ICr nlodel: h Two-parameter model:"
a. Calt:ulate the
0 1 statislics
n.17 0.1 H;
(/"" n,5t}
for ;tssessing tlit' fit of the onc- and Iwo-
parameter models (assume that the ahility intervals are Ihe same). h. Does the one- Of two-parameter model appear to fit the data? (". !lased on YOUf fesults, which IRT model for the dala ~iven7
appear.~ 10
he most appropriate
Answers to Exercises for Chapter 4 I,
0::: -2: p = 0.20; 0:: I: I' "" 0,25; e 0; I' : 0.40; (I I: I' := 0.75; 0:; 2: I' 0.90. b. P(O = -2) (},25; P(O = -I) 0.27; P{O == 0) 0.38; p(O = I) 0.72; P(O:= 2) := 0.95.
II.
NjIP] - ,£{Pj )J2
n/
---,
......
'£ (I'j)
--~~-.---,
II
'£ (Pj )
I
20@.~~ - 0.2si + ~9JO~b'S~~J!·~?l~ + fQtO"40___ Q:~!: 0.25x 0.75
0.27 x 0.73
O.38x 0.62
1.48 degrees of freedom:: 5 - 3
=2
d. Xt05 = 5.99. Since the calculated value does nol exceed the tahulated value, we can conclude that the Ihree 'parameter model fits the data for thit! itcrt.. Iff
2. a. QI ::
j~
NjlPj _ ,£(Pj )/2
-l;(Pj}TJ="'[(pj ) J
76
I'IJNDAMENTALS OF ITEM RESPONSE TIIEORY
ror the one-parameter model,
Q, = 20 !Q:~g-=-():!)2J: + ?i'_(0.:.??_=_ 0)2~2 +2()(Q.'*'!>_.Jk~})2 0.02 x 0.98 0.12 x 0.88 0.43 x 0.57 + .~0.75 - 0.80)2 + '2~iQ.:~():2~l 0.80x O~20
0.96>< 0.04
' '"
= 38.52 For the two-parameter model.
Q, = ~0(0.20 -()~112 + ~(0.22-""JJ.~_~5i + ~()_({).40__~:i§)~ O.lIxO.89 0.25xO.75 0.46 x 0.54
= 2.67
b. The one-parameter model does not fit the data, but the two-parameter model does. c. While the three-parameter model fits the data beller than the twoparameter model, the two-parameter model fits the data almost as well. In the interests of parsimony. the model of choice would be the two-parameter model.
5 The Ability Scale
The ultimate purpose of testing is to assign a "score" to an examiflee that reflects the examinee's level of attainment of a skill (or level of acquisition of an attribute) 85 measured by the test. Needless to say, the assigned score must be interpreted with care and must be valid for its intended use (Linn. 1990). In typical achievement measurement, the examinee is assigned a score on the basis of his or her responses to a set of dichotomously scored items. In the classical test theory framework, the assigned score is the number of correct responses. This number-right score is an unbiased estimate of the examinee's true score on the test. In item response theory it is assumed that an examinee has an underlying "ability" e that determines his or her probability of giving a correct response to an item. Again, based on the responses of the examinee to a set of items. an ability score e is assigned to an examinee. Unfortunately. e is not as easily determined as the number-right score. Through use of the methods described in chapter 3, an estimate of ability e is obtained for an examinee. As we shall see later, [he ability 4} of an examinee is monotonically related to the examinee's true score (assuming that the item response model fits the data); that is, the relationship between 4} and the true score is nonlinear and strictly increasing. The general scheme for determining an examinee's ability is as follows: I. An e:JIarninee's responses to a set of items are obtained and coded as I (for correct answers) or 0 (for incorrect answers). 2. When the item parameters that chamcteri1.e an item lIrt" assumed to be known (as happens when item parameters are availahle for a bank of items). the ahility e is eslimatcd using one of Ihe methods indicated in chapter 3.
77
711
,'UN[)AMENTALS OF ITEM RESPONSE TlIEOln
3. Whcn (he ilem paramclers Ihal characterize Ihe ilellls are not known, Ihe ilem and ahility paranwlel's mllsl he eslinmlcd from the sallie respollse dala, and one of the procedures descrihed in chaptCl' :; must he employed. ~s is, or is transformed using a lincar or a nonlincar transformation to II Illore convenient !':l'llie (e.g., without ncgalivcs or dedl1l(1ls) In uid in Ihe inlnprelaljon of Ihe Sl·ore. The SAT and NAEP reporting scales are well-known examples of scales ohlained by transforming original score scales. . •
4. The cstimatell ability vallie is reporled
At all stages of analysis and interpretation, the issue of the validity of the ahility score must be considered (Linn, 1990). Every attempt must he made to validate the ability score. The validity information available for the number-right score or any transformation of it may not he relevant or appropriate for the ability score and, hence, a validity study specifically designed for the ahility score may he needed. Refer to Hambleton and Swaminathan (1985, chapter 4) for more details on this important issue. What is the nature of the ability score? On what scale is it measured'! What transformations are meaningful? These important issues are discussed next.
The Nature of the Ahility Scale As mentioned above. the number-right score, denoted as X, is an unbiased estimate of the true score, 'to By definition, 'E(X) ::::: 't
The number-right score, X, may be divided hy the number of items (i.c., linearly transformed) to yield a proportion-correct score. The proportion-correct score is meaningful and appropriate when the test is subdivided into subtests, each with different numbers of items, measuring (say) a numher of different objectives. This is usually the practice with criterion-referenced tests. When the test is norm-referenced, other linear transformations may he used to yield standard scores. In addition, when it is necessary to compare examinees, the score X may be trans·· formed nonlinearly to yield slanines, percentiles, and so on. While the above transformations greatly facilitate the interpretation of the score X. its major drawhack remains. The SC(lf(.~ X is not indepen-
", 79
The Anility Smle
dent of the item!l to which the examinee responds, and the transformed scores are not independent of the group of examinees to which Ihey are referenced. The ahility score 9, on the other hand, possesses sllch independence. As described previollsly, is independent of the particular set of items administered to the examinees, and the population to which the examinee belongs. This invariance propcrty is what distinguishes the a score from the score X. Since it is possible to compare examinees who respond to different sets of items when using the a score, the a scale mny be thought of as an absolute scale with respect to the trait or ahility that is being measured. It is important, at this point, to discllss the natllle or meaning of the term ahility or trait. Clearly, these are lahels that descrihe what the set of test items measures. An ability or trait may be defined broadly as
e
Transformation of the O-Scale In item response llIodels, the prohabil ity of a correct response is given by the item response function, P(O). If, in Equations 2.2 or 2.3, a is replaced hy a' = aO + ~, b by h· = ah + ~, and a by iI" = a/a, then P(O')
=
p(a)
Thus. a, h. and a may be transformed linearly without altering the probability of a correct response (the implications of this "indeterminacy" will be discussed further in later chllpters), mellning that
RO
FUNDAMENTALS OF ITEM RESPONSE THEORY
the O-scille may be transformed linearly tis I(JIlS as the item pnrameter values also are transformed appropriately. Recall that 0 is defined in the interval (-00, (0). Woodcock (1978), in defining the scale for the Woodcock-Johnson Psycho-Educational Battery, employed the one-parameter model and the scale ,
1
that is, used a logarithmic scale to the base 9. Since
and
then We :.: 9.10 + 500
Thus, the Woodcock-Johnson scale is a linear scale. The item difficulties were transformed in the same manner,
W" = 9.lb
+ 500
The We scale has the property that the differences (wo - Wh) = 20, 10, 0, -10, -20 correspond to the probabilities of correct responses of 0.90, 0.75, 0.50, 0.25, and 0.10, respectively. Wright (1977) modified this scale as
w=9.IO+100 and termed it the WITs scale. The transformations of the O-scale described above are linear. Nonlinear transformations of the O-scale may be more useful in some cases. Consider the nonlinear transformation
Th~
HI
AIliTity ,kalt'
and Ihe corresponding Iransfonllalion of Ihe difriculty parameter
Then, for the one-parameter model
o·
,/"+ e; Hence, , P(O)
0' = h""-+ 0"
It is of interest to note that Rasch first developed the one-parameter model lIsing the form given above for the probability of success. The probability of an incorrect response on the 0' scale, Q(O') = I P(O'), is
The odds 0 for success, defined as P(O') I Q(O'). nre then
Consider two examinees with ability 0; and 9; responding to an item, and denote their odds for success as 0, and O 2 , Then
01
:;
0; h'
and
O2
""
e:•
h'
82
FUNDAMENTALS OF ITEM RESPONSE THEORY
The ratio of their odds for success is
0; = 0; 0; 01
Thus, an examinee with twice the ability of another examinee:. measured on the O'-scale, has twice the odds of successfully answering the item. In this sense. the O'-scale has the properties of a ratio scale. The same property also holds for the item; for an examinee responding to two items with difficulty values h~ and hi (measured on the h*-scalc), the odds for success are 0 1 = 0' I b; and (h == O' I hi. The ratio of the odds for the examinee is
If h; == 2h; (i.e., the first item is twice as easy as the second item), the odds for successfully answering the easy item are twice those for successfully answering the harder item. The ratio-scale property for 0'· and b·-scales. as defined above, holds only for the one-parameter model. For the two- and three-parameter models the scale must be defined differently (see Hambleton & Swaminathan. 1985). Another nonlinear transformation that is meaningful for the oneparameter model is the "log-odds" transformation. Since. for two examinees responding to the same item
01
0;
0; = 0; In O. .... -
O2
e DIl ,
eO(II,
e /J92 D(OI
92 ) ,
OJ)
where In is the natural logarithm (to the base e). Typically, one-parameter model, [) is omitted so that e(ll-h)
P(O) ==
[I
+
e(Il··· h»)
III
the
-') Rl Omitting /) in "log-odds" expressions. we have
If abililies dillcr hy one poinl.
then 2.718 Thus, a difference of one point on the ability scale corresponds to a r'Ktm or 2.72 in odds for success on the O-seale. Similarly, if 1111 examinee responds to two items with difficulty vlllues b l and h2'
As before, a difference of one unit in item difficulties corresponds to a factor of 2.72 in the odds for success. The units on the log-odds SCIlIe are called /ORits, The logit units can be obtained directly as follows: Since e(O - h)
P(O) -
and
<-«-----
+ e(O - h)
Q(O)
thus P(O)
== c(O h)
Q(O)
lienee In "(0)
,,<
Q(O)
0- h
84
FUNDAMENTALS OF ITEM RESPONSE THEORY
Transrormalion to the True-Score Scale The most important transformation of the 9-scale is to the true-score scale. Let X. the number-right score, be defined 8S
"
,
.
where Vj is the I or 0 response of an examinee to item j. If we denote the true score by t, then
t = 'E (X) ::: 'E
(1:" j~
Vj
)
I
By the linear nature of the expectation operator, n
'E
(1:
n
Vj
L
)
jd
'E (Vj )
j~1
Now. if a random variable Y takes on values YI and Y2 with probabilities PI and P2 • then
Since Vj takes on value I with probability Pj(O) and value 0 with probability Qj(O) := I "j(O), it follows that
Thus. n
Ihal is, the true score of an examinee with ability 0 is the sum of the item characteristic curvc~. The trlle score, 't ill this case, is called the
85
The Ability Scale
test characteristic CUrl'(' because it is the sum of the item characteristic curves. In the stri<:t sense, the llhove relationship holds only when the item response model fits the data. To emphasize this, the trut' score 't is indicated as 't I 6. thaI is.
't 10
=L
Pj (9).
j=1
When no ambiguity ex iSIS, the notation 't I I} will be shorlened to 'to The true score 't and 0 are monotonically related; thai is, the true score may be considered to be a nonlinear transformation of O. Since Pj (0) is between 0 and I, t is between 0 and n. Hence, t is on the same scale as the number-right score, except thai t can assume non ·integer as well as integer values. The transformation from 9 to 't is useful in reporting ability values; instead of the e values, t values that lie in the range 0 to n are reported. Alternatively, ft, the true proportion correct or domain score, obtained by dividing 't by the number of items fl. can be reported. In this case,
While -00 < 9 < 00, 1t lies between 0 and I (or, in terms of percentages, between 0% and 100%). The lower limit for 1t for the one- and two-parameter models is l.ero. For the three-parameter model, however, as 0 approaches -00, Pj(O) approaches ('j. the lower asymptote. Thus, the lower limit for 1t is r, Cj I fl. Correspondingly, the lower limit for 't is 1: (J< The transformation of 9 to the true score or the domain score has important implications. First, and most obvious, is that negative scores are eliminated. Second, the transformation yields a scale that ranges from 0 to n (orO% to 100% if the domain score is used), which is readily interprelable. When pass-fail decisions must be made, it is often difficult to set a cut-off score on the O-scale. Since the domain-score scale is familiar, a cut-otT score (such as 80% mastery) is Iypically set on the domain-score scale. The domain score is plotted against 9, and the 9 value corresponding to the domain score value is identified as Ihe
86
FUNDAMENTALS OF ITEM RESPONSE THEORY
TARLE ,!;.I Item Parall1elrrs for Five '('cst Ilems
I,,,,,,
rllr(/mt~/rJ'.~
",
--<.----.--~
I
·2,on
0,110
2
·,1.00
1 4
() no
I. on U()
5
('
lI,
,
.. -.
0.00 <
0.10
n, I ~
2,O(J --...
,
~~--~-~--
oon
un
1,00 2.00
...
{),2!1 ,-~----.---~
.. ~---- .. --~--"
-~
CUI-off score
on the O-scale (see, for example, lIamhleton & de Gruit,ier, 1983). Alternatively, all 0 values Clill be converted to domain-score values and the pass-fail decision made with respect to the domain-score scale. To illustrate the conversion of 0 values to domain-score values, we shall consider a test with five items. The item rarameter values for these items are given in Table 5.1. For eaeh item, I. The probahility of a correct response is computed at 0 -<~, -2. ·1, n. I. 2.3 using Ihe three-parameter model (Equation 2.3), 2. These prohahilities are summed over Ihe five items at each of Ihe 9 values to yield
3. The domain score It is ohtained al each 0 value hy divilling Ihe sum in the equation above by 5. 4. The resulting relationship belween
1t
and 9 at the 9 values is tahulated.
We now have the functional relationship between It and e at 0 -3, -2, I, O. I. 2, 3. This is a monotonically increasing relationship and can be ploUed as a graph. The computations arc given in Table 5.2. The final implication of the e to 't (or 0 to It) conversion is thai the true score 't (or It) of an examinee whose ability value is known (or estimated) can be computed on a set of items not administered to the examinee! When the item parameters for a set of items are given. an
r"~
87
Abi/il.Y ScalI'
TARLE 5.2 Relalionship Belween 0 ,lilt.! f)
1'1(0)
1'210 )
",(9)
2 f
0 2 J
0.2U ()j() 0.80 0.94 0.98 0.99 L()()
-_._--
1'4(0 )
1'\(9j
----"----
t
1:1',(9)
11: '" tin
..
~-----------
.1
1t
O.(U
0.10
n.ls
n.20
n.M
0.14
O.I.~
(1.11
(l,IS
(1.20
1.12
o.n
0.50 0.115 0.97 0.99 LOU
0.20 0.55 0.90 0,99 1.00
0.16 0.21 0.58
O.lO
1.115 2.75 U:i 4.51 4.96
0.37 0.55 0.73 0.90 0.99
(L94
l.OO
0.20 0.22 0.60 0.96
examinee's lrue score can be compuled as long as his or her 9 value is known, as indicated in the illustration, allowing the projection or prediction of an examinee's true score or pass-fail status on a new set of items. This feature is used in "customized testing" (Linn & Hambleton, 1990). The fact that an examinee's true score can be computed on any set of ilems also is used in the procedure for determining the scaling constants for placing the item parameters of two tests on a common scale (see chapter 9 on equating).
Summary The ability 9 of an examinee, together with the item parameler values, determines the examinee's probability of responding correclly to an item. Based on the examinee's response to a set of items. an ability SCore may be assigned to Ihe examinee. The most important feature of the 9 score thnl distinguishes it from the number-righl score is that 9 does not depend on Ihe particular set of items administered to the examinee. Examinees who are administered different sets of items can be compared with respect to their a values. In this sense, the a-scale may be considered an absolute scale. The e values may be tran!lformed linearly to facilitate inlerprel&tion. The a-scale. or any linear transformation of it, however. does not possess the properties of ratio or interval scales. although it is popular and reasonable 10 assume that the a-scale has equal-interval scale properties. In some instances. however. a nonlinear transformation of the a-scale may provide a ratio-scale type of interpretation. The trans-
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
formation 9' e 9 and 1/ = e" for Ihe one-parameler model provides a ratio-scale interpretation for the odds for sllccess. The "log-odds" transformation also cnahles slIch interpretatiolls. For Ihe Iwo- ami three-parameter models slIch simple Irall~forl1lal ions arc 1I0t available. Thc mosl importanl nonlinear Iransformalion of the a-scale is Ihe transformation that yiclds the true-score scale. When the-item response model fits the data, the lrue score is the slim of the ilem characterislic curves evaluated at a specified value of 9. necause it is Ihe sum of itcm characteristic curves, the ability to true-score conversion is also known as the test characteristic curvc. The Irue scnre is on the same mel ric as the number-right score. If desircd, the true seore may be convcrted to the domain score (or proportion-correct score) by dividing the trlle score by the number of items. The true score or thc proportion-correct score has intuitive appeal and, hence, is often employed to set cut-off scores for making rnRslerY-llonrnaslcry decisions. The lrue score or Ihe domain score can be computcd for any set of ilems (including Ihose nol taken by the examinee) as long as the examinee's ahililY and the ilem parameters are known. Such "prediclions" of an examinec's Irue score on a set of "new" ilems may provide vl1luablc informalion regarding Ihe use or inclusion of these items in a test.
Exercises for Chapter 5 I. Suppose Ihal abilily eSlimales for a group of examinces on a lesl are in Ihc range (-4, 4). a. Whal linear Iransformalion is appropriale 10 pr(){l,\!ce a scale on which Ihe scores are posilive inlegers (assuming Ihal 0 is oblained 10 Iwo decimal places)? b. Whal nonlinear Iransformalion is appropriale for producing scores Ihal range from 0 10 100? 2. In Table 5.2, Ihe relalionship belween e and 1t is labulaled. a. Plot a graph of 1t (on Ihe .v-axis) againsl fl (on Ihe .t-axis). Whal can you say aboul Ihe shape of Ihe curve? b. Suppose Ihal only sludenls who have answered al kasl BO% of Ihe ilems correclly are considered "maslers"; Ihal is, Ihe cUI-off score is sel al 1t = O.BO. If an examinee has an abililY score of 1.2, may Ihis examinee be considered a masler? c. Whal is Ihe 9 value Ihal corresponds
10
a cuI-off score of 1t = O.80?
!!9
rill' Ahi/ily SnIff' :~.
An examinee has an abitity Sl:orc performance on II lesl.
or 0
= 1..'\ as determined hy his or her
a. What is Ihe (~xaminee 's true scorc on the rivc-item paramelers given in Tablc ).I?
Ic.~1
with itcm
b. An examinee must answer at Ieasl four items Oil this test correctly to be cOllside.red a master. Would Ihis examinee he cOllsidered Ii mllsler? c. Whal is the cllt-off score in part b on the 9 scale? 4. Por
Ii
Iwo-parameter model.
a. show thai the odds for success, O. is
b. TIlc odds for success on an item for Iwo examinees with ability 9 1 and III
are 0 1 and 020 respeclively. Show thai the odds ralio for Ihe Iwo
examinees is
Co
If Ihe abilities of the examinees differ by one unit, what is the value of the odds ralio? What is the log of thc odds ratio?
d. What is the value of the odds ratio and the log of Ihe odds ratio if the examin!'es differ by k units?
Answers t.O Exercises for Chapter 5
I. a. y = 100(0 + 4). n
100 ~ b. y = - - L. Pj (9) n j= I
2. a.
1t is a monotonically increasing function of O. It is bounded belween 0.09 and I. In fact, Jt(9) looks like an item characteristic curve.
b. 111e graph shows that the examinee with ability 1.2 has a domain score less than O.R. Hence. the examinee may not be considered a master. c. Prom Ihe graph, Jt
=0.8 corresponds to 0 = 1.45.
90
FUNDAMENTALS OF ITEM RESPONSE TIIHlRY
3. a.
t '"
1: P,(O '" 1.5) "" 4.5.
b. Yes. c. 0"" I.
4.
a.
Q(6) == I I r I + eOa(S
b.O I =
eOa(e,
0 102 1
1»,
= eDa(S,
O2
h l l . lIence, ef)a(e"h).
hll eOa(O,
If9 2
-
9 1 = (.thenO,102
d. If 9 1
-
92
C.
= k, then 0, 102
0
PI Q = 'cOl/(O
h)
f1em~e, eDa(O,
h)
II)
Da(O, -I»
eDa; (n(OII02) '" Va. e
Dok
;
In (0,10 2 )
Dak.
= ena(O, 0,)
6 Item and Test Information and Efficiency Functions
Basic Concepts A powerful method of describing items and tests, selt~cting test items, and comparing tests is provided hy item response theory. The mel hod involves the use of item irrjormatiOll junctions. denoted li(O), where
i::; 1,2, ... ,n
16.11
i at 0, P;(9) is the derivative of Pi(O) with respect to 0, P,(9) is the item response function. and Q,(O) = 1- Pi(O). Equation 6.1 applies to dichotomously scored logistic item response models like those ~iven in E
1,(0) is the "information" provided hy item
2.89al (I 16.21 From Equation 6.2 it is relatively casy to infer the role of the h. n, and (' parameters in the item information function: (a) information is higher when the b value is close to 0 than when the h value is far from 0, (b) information is generally higher when the (J parameter is high, and (c) information increases as the (' parameter goes to zero. Item information functions can play an important role in test development and item evaluation in thllt they display the contribution items 91
92
HINDAMENTALS OF ITEM RESPONSE Tf!EORY
make to ability estimation lit points along the ability continuum. This contribution depends to a great extent on an item's discriminating power (the higher it is, the steeper the slope of P,), and the location at which this contribution will be realized is dependent on the item's difficulty. Birnbaum (1968) showed that an item provides its maximum information at em.. wllere .,
.
e..... =
bl +
~ln[O.5(t + "';1 + vllj
8Ci )].
(6.31
If guessing is minimal. that is. Ci = 0, then en••• = bi' In general, when Cj > O. an item provides its maximum information at an ability level slightly higher than its difficulty. The utility of item information functions in test development and evaluation depends on the fit of the item characteristic curves (lCCs) to the test data. If the fit of the ICCs to the data is poor, then the corresponding item statistics and item information functions will be misleading. Even when the fit is good, an item may have limited value in all tests if the a purameter is low and the (' parameter is high, Moreover, the usefulness of test items (or tasks) will depend on the specific needs of the test developer within II given test. An item may provide considerable information at one end of the ability continuum but be of no value if information is needed elsewhere on the ability scale.
Examples Figure 6.1 shows the item information functions for the six test items presented in Figure 2.4 and Table 2.1. Figure 6.1 highlights several important points: I. Maximum infommlion provided by an item is lit its difficulty level or slighlly higher when c > O. (This is seen by comparing the poinl nn the ability scale where inforllllllion is grcnlest to the" values nf tilt" corresponding items.) 2. The item discrimination parameter suhstantially innuences Ihe nmount of information for assessing ability that is provided by an item. (This l'an be seen by comparing the item infonnalion functions for lIems I and 2.)
\
! 93
1"[lIIlI/fII;On lind f·Jftdrlll'y FUllt'/itlll.l
• m n f
2
1.15
0
r m
•t I
0
0.&
n 0
-4
-3
-2
-1
0
2
3
..
Ability Figure 6.1. Hem Information Functions for Six Typical Test Items 3. When c > O. other thing!! being (,I,ual. an itcm is lesl! IIseful for assessing
abilily. (This can be seen by comparing .he item information functions for Items I and 3.) 4. An item with low discriminating power is nearly useless stati,~tically in R test (see Item 6),
5. Even the most discriminating items (Items I and 4) provide less informalion for assessing ability in SOIllC rcgions of .he uhility cOlltinullm Ifllln do less discriminating items (Item 5). Hem 5 would he more useful for assessing abilities of middle-ability examinees than (say) either Item I or Item 4.
Clearly, item information functions provide new directions for judging the utility of test items and constructing tesls. Because ilem information functions are lower, generally, when c > 0 Ihan when c = O. researchers might be tempted to consider fitting oneor two-parameter models to their test dala. Resulting item information fUllctions will be higher; however, the onc- alld two-parameter item information curves will only be useful when Ihe ICCs from which they (Ire derived fit the test data. The lise of ICC's Ihal dn 110t adequalely fit the lest dala and their corresponding item informalion curves is far from
FUNDAMENTALS OF ITEM RESPONSE TIIEOI{Y
94
optimal and will give misleading results (see, for example, de Gruijtcr, 1986),
Test Information Functions The information function for a test, denoted 19 a'nd derived by Birnbaum (1968, chapter 17), is given by
1(9) ::::
L"
Ii (0)
[6.41
i= I
The information provided by a test at 9 is simply the sum of the item information functions at 9, From Equation 6.4 it is clear that items contribute independently to the test information function, Thus. the contribution of individual test items can be determined without knowledge of the other items in the test. This feature is not available in classical test theory procedures. The contribution of test items to test reliability and item discrimination indices (e.g., point-biserial correlations) cannot be determined independently of the characteristics of the remaining items in the test. This is true because the test score, which is used in these calculations, is dependent on the particular selection of test items. Changing even one item will have an effect on test scores and, hence, all of the classical item and test indices will change. The amount of informat ion provided by a test at 9 is inversely related to the precision with which ahility is estimllted at that point: " = ----I SE(O)
.[r(O)
(fi.51
1\
where SE(O) is called the standard error of estimation. This resull holds whenever maximumlikclihood estimates of 0 are obtained. With knowledge of the test informntion at 9. a confidence hand can be found for usc in interpreting the abililY estimale (see chaplcr 3). In the framework of " serves Ihe same role as the standard error of Im:asuremcnt IRT, SE(9) in classical measuxcment theory. It is important 10 note, however. thul the value of SE(9) varies with abilily level, whereas Ihe classical standard error of measurement does nol.
"
The stalldard error of " 0, SE(O), is the standard deviation of the aSYlllptoticully florlllal distributioll of the maximulIl likel ihood estimate of ability for II given true value of ahility O. The distribution is nonTI<11 when the test is long. Even with tests as short as I n to 20 items, however, the Ilormal approxilllation is satisfactory for lIlost purpose.~ (Silll1~jjllla, 1977). The magnitude of the standard error depends, in general, on (a) the number of test items (smaller standard errors are associated with longer tests); (b) the quality of test items (in general, smaller standard errors are associated with highly discriminating items for which the correct answers cannot be obtained by guessing); and (c) the match between item difficulty and examinee ability (smaller standard errors are associ.lted with tests composed of items with difficulty parameters approximately equal to the ability parameter of the examinee, as opposed to tests that are relatively easy or relatively difficult). The size of the standard error quickly stabilizes, so that increasing information beyond a value of (say) 25, has only a small effect on the size of errors in ability estimation (see, for example, Green, Yen, & Burket, 1989).
Relative ..~rnciency Test developers are interested sometimes in comparing the information functions for two or more tests that measure the same ability. For example, a committee assigned the task of developing a national achievement test may want to compare the test information functions provided by tesls composed of different items or exercises. Comparing information functions for two or more tests can serve as an aid in test evaluation and selection (see, for example, Lord, 1977). Another example would be a school district or state department or education interested in choosing a standardized achievement test. Based on prior information about student performance, the lest that provides the most information in the region of the ability scale of interest would be preferred. (Other factors, however. should be taken into account in the selection of tests, such as validity. cost, content, and test length.) The comparison of information functions is done by computing the relative efficiency of one test, compared with the other, as an estimator of anility at 0:
96
FUNDAMENTALS 01' ITEM RESI'ONSE THEORY
RE(O)
=
16.()J
where RE(9) denotes relative efficiency and IA(O) and IIl(O) are the information funclions for Tests A and B, respectively, defined over a common ability scale, O. If, for example, IA(O) = 25.0 81\
Exercises for Chapter 6 I.
II.
For eadl of Ihe six ilellls given in Tahle 2.1. dclcnninc Ihe vllluc of 0 for which the infomlalioll fU"clion is a maximum. and determine rhe maximum value of Ihe informalion.
h. Which ilems would you choose 10 make lip a Iwo-ilem lesllhal will he most useful for nJllkin)\ decisions IIhOll1 ex IIll1illec.~ 111 0 = 1.0'1 What is Ihe vlllue of the lesl informalion fUllctioll for Iht' Iwu·jtem lesl III Ihis value of 97
2. n. Show Ihal if p=
then e 17,,(8
- h)
p
Q where Q
=0
1 - P.
97
Illformiltioll (lml fJficirnry Flln,·ti/rs
h. Show Ihal Iht' cxpressiolJ given h)' E'lllIlIion 6.2 lIlay he wriUen as
c. Deduce thaI for the two-parameter model,
1. hem parameters for an "ilem hank" made up of four items are given in Table 6.1.
TAHLE 6.1 - - ,..
c 0.00
1.~()
n.n
{J.OO
1.25 100
1.0 1.5
n.oo
1 2
,
.....
1.2.'i
"
4
~--,-----~~----
" -0.5
Itrlll
0.00
SlIppose it is Ilen~ssary to conslructll 1I:.'IIII1I1<1e III' of Ihree items frnlllihis bank. Compule Ihe lesl infonna!ioll function at 0 values of--2, -1,0, I, 2 for Ihe four Ihree-item leslS thaI can hc conslrllcled from Ihc b.mk. Piol the four lesl infortnnlion f\lnclion~. Which sci of items would you IIS~ if lilt' It~sl is (ksigfl{~d n~ II mnstery le.~1 with II ('II! S(,Of(' set III 0 '" 1.0"
Answers to Exercises for Chapter 6 I.
See Tahle 6.2.
8.
TAnLE 6.2 Ill.'m
h
a
('
1 2 3 4
I.n 1.0 1.0 --I.:'i
~
--n.~
0
O.:'i
1.8 O.S 1.8 1.8 1.2 0.4
0.00 0.00 0.25 0.00 0.11) 0.15
I(Om .. )
!l",•• = " + 1.00 I. O()
1.10 1.'i0 -0.41 11.111
2.34 0.46 1.29 2.34
0.85
UH
98
FUNDAMENTALS OF ITEM RESPONSE
TI~[(ORY
b. Since Items I llnd 2 have their ma)(imullI informatioll al (I = I, Ihese would be Ihe items of choice. lIem 2 contribules milch less Ihan Item I, and, hence, Item I may be !lufficienl.
2. a. If P =
---:-:::~.--:-
-
1+ then Q
=I
P ::
el.7at9- hI --~--~ el.1a(9 II)
+
,
.
-----'--~-~-
+ el.7a{O - II)
Hence. I
+ eI. 7a (9-II) =
eI. 7a (9-II)
= .! _ I = Q
I
I I Q. frolll which il follows Ihal
Q
P - Q
b. This follows direclly from E)(pression 6.2 and part a. c. For the two-parameler model. c = O. Hence. from parI b
3. See Table 6.3.
TABLE 6•.3
e -2 -1
0 1
2
Test (I. 2. 3)
Tur(/. 2. 4)
Tes/ (I . .J. 4)
Te.s, (2. J. 4)
0.219 1.361 2.918 1.738 0.492
0.219 1.339 2.681 1.215 0.667
0.1117 0.965 1.486 1.907 1.059
0.0.'\4 0';40
2.2.'iO 2.172 1.076
The lest consisting of Items 2. 3, and 4 would be Ihe most useful since it gives the mosl information III 0 = 1.0.
7
Test Construction
Background
/
The construction of achievement and aptitude tests using classical test theory techniques involves the selection of items according to their content and characteristics-item difficulty and discrimination. Items with high discriminating power are generally the most desirable. and the appropriate level of item difficulty is determined by the purpose of the test and the anticipated ability distribution of the group for whom the test is intended. As noted in earlier chapters. classical indices are not invariant over populations that differ in ability. Hence. the success of classical item selection techniques depends on how closely the group used to determine the item indices matches the populntion for whom the test is intended. When the match is poor, the item indices obtained will not be appropriate for the intended population. In many practical situations, the group for which item indices are ohtained and the group for whom the test is intended are quite different. Consider, for example, the common practice in school districts of field-testing items in the full for use in year-elld tests in Mayor June. While such a field test is useful for detecting gross flaws in the items, the item indices themselves are not likely to be very useful in test development because the ability distribution of the students tested in the filii will differ substantially from the ability distrihution of the students tested at the end of the school year. Another situation in which classical item indices are obtained for groups that may differ from the intended population is in item banking. In developing an item bank, the characteristics of the items to be stored in the bank must be determined. In practice, these items. often called "experimental" items, are embedded in a test and administered to a <)<)
100
FlJNIJAMIINTALS OF I rI'M RESI'ONSE TIIEORY
group of examinees so that their item indices can be ohtained. If the experimental items are numerous, obviously not all can be embedded in one test. Multiple forms of the test are created, cach containing different experimental items and different forms are administered to different groups of examinees. It is usually not possihle to ensure that the different forms are administered to equivalent group"; hence, the item indices for experimental items that were given to differcnt groups of examinees may not be comparable. If the items are banked with the assumption that the item indices nre comparable, any test constructed from the bank will not he appropriate for a given popul.ltion. Apart from the problem of noninvariant item indices, the major drawback of classical procedures for test construction is that even when a well-constructed item bank is available, items cannot he selected to yield a test that meets a fixed specificalion in terms of precision of measurement. The contribution of an item to Ihe reliahility of the test does not depend on the characteristics or the item alolle, but also on the relationship between the ilem and the other items in the test. Thus, it is not possible to isolate the contribution of an item to the reliability and, hence, to the standard error of measurement of a test. Item response theory offers a more powcrfulmelhod of item selection than does classical test theory. Item parameters are invariant, overcoming thl', prohlems of classical item indict'S descrihed IIhovc. In addilioll, \ item difficulty and cXlIlilince ahility IIn~ measured 011 the salllt' scall\ making il possihle to select items that arc most useful in certain regions of the ahililY scule; (or eXilmple, at a cUI-off st:Ort~ for separating masters and nonmasters. Perhaps the 1I10s1 important advanlage or I RT is that it permits the selection or items based 011 the mllollnt of information the items contribute to the total amount of information needed in the test to meet the test specifications. Sillce information is related to precision of measurement, it is possible to choose items to produce a test that hilS the desired precision of measurement at any ability level, for example, al a cul-ofr score.
J
Basic Approach A procedure for using item information functions to build tests to meet any desired set of test specifications was outlined by Lord (1977). The procedure employs an item bank with item parameter estimates
1'1'.\'1 ('omlrllctiml
101
available for Ihe IRT model of dlOice. with accoJllPllnying information fUlictiolls. The steps in the procedure suggested by Lord ( 1977) are as follows: I. Decide on the shape of the desired test infoml3tion function. This was termed the /arllfl informmioll fune/ion by Lord (1977 J. 2. Select hems from the item bank wilh item information flln('tiolls that will fill up hard-to-liII areas under the larget information function. 3. After ea('h item is added to the test, calculate the lest infmmation function for the selected test ilemll. 4. Continue selecting test items until the test information funclion approximates the target information function to it satisfactory degree.
These steps are implemented usually within a frall\ework defined by the content specifications of the test. For a broad-range ability test, the target informatioll runction should be fairly flat, reflecting the desire tu produce a test that would provide (approximately) equally precise ability estimates over the ability scale. For a criterion-referenced test with a cut-off score to separate masters and nonmasters, the desired target information function should be highly peaked near the CUI-off score on the ahility scale, The lise of item information fUllctions allows the test developer to prndul'e a test that precisely fulfills any scI of tc~1 specificationll (assuming thllt the item bank is sufrkienlly hllge aflll cOlltains items of high quality). An example of how item information fun('tions Clln be applicli in a huge test-development projCI.:t was given hy Yen (19R3). A procedure for automating item selection to matdl
102
FIlNDAMENTALS
or
ITEM RESI}ONSE THEORY
As an illustration of the Hhove procedure. consider an achievelllcni tesl. On achievement tests. it is common to observe lower performance on a pretest than Oil a poslles!. Knowing this. Ihe test ('ollslrllclor might select easier items for the pretest and Illore difficult items for till' posCles!. On each testing occasion. precision of measurement will be maximized in the region of Hbility where the examinees would 1I10st likely be located, Moreover, because the items on both tc.'\ts meHsure the same ability and ability estimates do /lot depend on the IliIrticular choice of items, growth can be measured by subtracting the pretest ability estimate from the postlest ability estinHlte. Investigations of the effects of optimal item selection on the decisionmaking accuracy of a test when the intended cut-off scores or standards for the test are known in advance of test development were conducted by de Gruijter and Hambleton (1983) and Hambleton and de Gruijter (1983). To provide a baseline for interpreting the results, tests were constructed also by selecting test items on a random basis, Random item selection from pools of acceptable test items is a common practice in criterion-referenced testing. Error rates (probabilities of misclassification) for the test constructed by random item selection procedures were nearly double the error rates obtained with the optimally selected test items. Optimal item selection is made possible within an IRT framework because items. pef.'lOnS, and cut-off scores are reported on the same scale. The scenario simulated in the Hambleton and de Gruijter studies is not uncommon in testing practice. For example. in the summer of 1990, a group of educators and noneducators from the United States set standards for marginalIy basic, proficient. Hnd advanced Grade 4 students on the 1990 NAEP Mathematics Assessment. These three standanls were mapped onto the NAEP Reporting (Ahility) Scale \Ising the test characteristic function defined for thc lotal pool of Grndc 4 mathematics items. In 1991, whcn test items nre selected for the 1992 Mathematics Assessment, test items could be chosen to maximize the test information at each of the standards. In this way. more accurate information about the percentage of students in each of the four ability intervals defined by the three standards could be obtained. Similar procedures could be used for the Grades 8 and 12 NAEP Mathematics Assessments. A discussion of the process of setting target information functions and selecting items was provided by Kingston and Stocking (1986).
un
'1'1'.1'/ COIiJ//'l/t'liOlI
Scveral problems, however, remain In he addrcssed. One problem is Ihat lise of statistical criteria for itclll selcction alone will not ensure a l·ontcnl·valid test. Unfortunately, it is easy to ovcrclIlphllsi7.C statistical niteri'l and lIot takc into account the important role that item content plays in test development. Fuilurc to allcnd to content considerations might result in Ii charge thai the test lacks contcnt validity. Ways must be found to combine information about item contcnt and statistical criteria in the item selection process. A solution to this problem has been provided by van der Linden and Boekkooi-Timminga (1989), using linear programming lechniques. Another problem in using item information functions in test development is that high a values are likely to he overestimated and, hence, the information function may be biased. A test constructed using items with high a values is likcly to be different from the expected test (see, for example, Hambleton. Jones, & Rogers, 1990). Since the test information function will be overestim
Example I: Rroad Abilities Test Suppose the test developer's inlent is to produce 11 wide-range ability test using the item hank in the Appendix. Suppose also that standnrd errors of (lIpproxilllately) U.50 would be acceptahle ill the abilily range ( ·2.00, 2.(0). with somewhlll largn (:rrors (lutsi(1e thai interval. A f\ possihle larget information fUllction is shown in Figure 7.1. If SE(9) = n.)", Ihen /(0) 4."0. To cOlIstrnctthe shortest possible test that meets the target, items with high discriminations. difficulties between -2.00 and +2.00, and low c values must be chosen. Figure 7.1 shows the target information function (which is flat. with /(9) = 4.00 hctweell 9 =-2 and I) = 2) and the test information functions after selecting the best 10, 15. and 20 test items from the item bank for the desired test. Clearly, the rcsulting 20-item test fairly closdy approximates Ihe desired test. The acidition of items with difficulties nellr··2 and 2 would produce an even bctter match to the target information function.
FUNUAMENTALS OF ITEM RESPONSE TIIEORY
104 15 T
4
,
3
n
.....---- .. -..
Taroet
e
•t
.~--.-
0
r
m
2 -
•t
"0 ".m.) -~
I
0
n 0 -3
-2
-1
o
2
3
Ability
Figure 7.1. Test Information Funclions for 10, 15, and 20 Test hems
Example 2: Criterion-Referenced Test Construction Suppose the test developer wishes to construct a 15-item criterionreferenced test (drawn from the item pool shown in the Appendix) to provide maximum information at the cut-off score 9 := -0.50. The resulting test information function is shown in Figure 7.2. The items selected were 2,3,5.7,14,24,27,30,32,36,47,48,71,72, and 73. (Others were possible, but these 15 ultimately were chosen.) For the purposes of comparison, a IS-item test (called the standard test) was constructed by drawing items at random (a common practice in criterion-referenced test development) and the test information function for this test also is shown in Figure 7.2. Figure 7.3 shows the relative efficiency of the optimal test compared with the standard test. Clearly. the optimal test provides greater measurement precision in the region of the cut-off score (9 ::: -0.50). The optimal test performs about 60% better than the standard lest in this region. The standard test would need to be lengthened from 15 10 24 test items to perform approximately as well as the optimal test. As can be seen in Figures 7.2 and 7.3, the optimal test does not perform as well as the standard test for high-ability examinees. This is
\OS
7'>.H COI'.'(rllcli(l1/
7
8
n
f
5 .
r
..
•t
3
0
m I
0
n
2
o ____.__..__ l -3
_ _ _ .~... j
-2
____.... '"
-1
.L_ .. _ ........_ .. L. ........_
0 Ability
..L.... __ ..•. _.
2
3
Figure 7.2. Test Information Functions for IS-Item Optimal and Randomly ConSlrucled Tests
1.8 - - . - -...- -..-.--.-...... -.-----.--....--..- . -... -- -"--' 1.6 1.4
,,
E
1.2
I C
I
e
0.8 .
c
0.6
n
y
0.4
0.2
o ____ . __..__ L._ _ _ ._L_ ..____ ._.L... -3
-2
-1
0
2
Ability
Figure 7.3. Efficiency Function for IS-hem Optimal Versus Randomly Constructed Tests
3
FUNDAMENTALS Of ITEM RESPONSE TIIEORY 1.25---~'-
__
e."e
S I
a
n
d
a r
Random
d
E
0.5
r r
0
Optimal
0,25
0 -2
-1.5
-1
o
-0.5 Ability
0.5
Figure 7.4. Standard Error Functions Cor 15-ltcm Optimal and Randomly COllstructed Tests
due to the fact that the optimal test is composed l'lfgcly of items that discriminate near the cut-off score and does not contain many items that are suitable for high-ability examinees. The standard test includes a more heterogeneous selection of test items. In practice, the more heterogeneous the item pool and the shorter the test of interest in relation to the size of the item pool, the greater the advantages of optimal item selection over random item selection. The standard errors corresponding to the information functions for the two tests are shown in Figure 7A.
Exercises for Chapter 7 I. Suppose you have an item bank containing six test items that provide information Ilt various 9 values as shown in Table 7.1. a. What is the test inCormation and corresponding standard error at 9 1.0 Cor a lest consisting oC hems 2, 3, and 6" b. How many items like Item 5 are needed OAO at 9 = -1.01
10
obtain
II
standard crror
or
~
10'1
Tt'.H ('on\/nl( lioll
TAUL!<: 7.1 .---------
0 111'111
--,------_.
-3
2
2
()
3
--------------
2
.02 .00
0.00 {LOO
(110 Il.05
3
.no
0.(13
4 5
.15
1.25
.00
(110
(j
.no
n.oo
0.00 IUl2
0.20
01,\
(UlK
o. I ()
LlO
0.25
0.10
0.25
14,\
0.10 (UO OAIl
0.50 0.02
o.on
0.04 o. 10 0.15 0.00
0.10
0.0')
0.00
2,20
11.40
H.t5
0.40
2. Two tests are constructed from the item bank given in Exercise I. Test I consists of Items 2 and 3; Test 2 consists of Items I and 6. a. Compute the information provided by each test at 9 = 0,0, 1.0, and 2.0. b. Compute the efficiency of Test I relalive to Test 2 at e -= 0.0, 1.0. and 2.0. c. What does the relative efficiency analysis indicate about Test 17
d. How many items like Item 5 need to he added to Test I so that Tests I and 2 are (approximAtely) equally informative at 9 = 1.07 3. Suppose that it is desired to construct a criterion-referenced test that is optimally discriminating at e = 1.0. a. Jr the test consists of Items 4 and 5. what is the sHllldard error at 9 = -1.01 h. What is the probability that a candidate with 9 = 0.0 will fail the test when the cut-off score is set at 9 = 1.07
Answers to Exercises ror Chapter 7 I. a. 1(9
-= 1.0):: 3.8: SE(O -= 1.0) =0.51.
=
b. SE 0.40 requires 1(9) = 6.25. Since the information at9 =-1.0 is 0.60. II items like Item 5 are required to produce the desired test. 2. a. See Table 7.2. TA8U~
7.2
T~st
I (Ilems 2 and 3) 2 (hems I and 6)
-
9
Q.Q
/,0
0.35 0.60
1.60 2,35
0.65 0.48
lOS
FUNDAMENTALS OF ITEM RESPONSE THEORY
h. See Table 7.3. TABLE 7.3
o Tr.
Efficienc y (I vs. 2)
00
1.0
VI
O.till
1.35
c. Test I is providing about 58% as much information as Test 2 at 0 =0.0, and aboul68% as much information as Test 2 at 0 = 1.0. AIO :; 2.0, Test I is providing more infomlation than Test 2. d.4 3. a. SE(O ::: -1.0) = 0.70 b. The standard error of ahility estimation at 0 = OJ) is 1.12. The-refore, the probability of this candidate failing the test (i.e., the probability of making a false-negative error) is 0.197.
8 Identification of Potentially Biased Test Items
Background Perhaps Ihe most highly charged issue surrounding tesling, and certainly the one of greatest importance to the public, is thaI of test fairness. Claims that tests are biased against racial or ethnic minorities have led to numerous lawsuits and calls by such organizations as Ihe NAACP for moratoria or bans on certain types of tests (Rudner. Getson. & Knight. 1980). Tests and testing practices have come under close public scrutiny, and test publishcrs lind uscrs IlIlJst dcmonstrate now that their tests are free of bias against minorities. One of the desirable features of itcm response theory is that it provides a unified framework for conceptUlI\izing and investigating bias al the ilem level. Refore IRT procedures for investigating bias can be discussed, some clarification of lemlino\ogy is necessary. Investigations of bias involve gathering empirical cvidence concerning the relative performances on the test item of members of the minority group of interest and members of the group that represents Ihe majority. Empirical evidence of differential performance is necessary, but not sufficient, to draw the conclusion thai bias is present; this conclusion involves an inference that goes beyond thc data. To distinguish the empirical evidence from the conclusion,the lerm differential itemfunl·tioning (DIP) rather than bia.\' is used commonly to described the empirical evidence obtained in investigations of bias. Some argument existt; at; to Ihe appropriate definition of DIE A definition that has been used in recent legal settlements and legislation concerning fair testing is that "an item shows U1F if the majority and minority groups differ in their mean performances on the item." The 100
110
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
problem with this defi1,itioll is Ihal il docs not take into an:oullt the possibility that other variables, such as a real hetween-group difference in ability, may be responsible for the difference illl,-values (see Lord, 1980). The definition of DIF accepted by psychometricians is that "an item shows DIF if individuals having the same anility, but· from different groups, do not have the same probability of gelling the item right."
IRT Methods for Detecting D1F Given the accepted definition of DIF, item response theory provides a natural framework for studying DIE Since the item characteristic function relates the probnhility of a correct response to the underlying ability and to the characteristics of the item. the definition of DIF may be restated operationally as follows: "An item shows DIF if the item response functions across different subgroups are not identical. Converllcly. nn item does not show DIP if the it{~1II dJilfllctcristk hlllctiolis across different subgroups are identicaL" Based on the definition given above, DIF may be investigated hy comparing the item characteristic functions of two or more groups. Item characteristic functions may be compared in several ways. The first and, intuitively. most direct approach is to compare the parameters Ihal describe the item characteristic curves. Since the ICCs are completely determined by their corresponding item parameters, two ICCs can be different only if the item parameters that describe them are different (Lord, 1980). A second approach is to compare the ICCs by evaluating the area between them (Rudner et aI., 1980). If the area between the ICCs is zero, then the ICCs coincide and, hence, no DIF is present. These two approaches for studying DIF are described in the following sections. Comparison of Item Parameters. If the parameters of two item characteristic functions are identical, then the functions will be identical at all points and the prohahilities of a correct response will be the same. The null hypothesis that the item response runctions are the same may be stated as
III
where the sUbscript denotes the group in which the parameter estimates were obtained. Ir the hypothesis is rejected for a given item, we can conclude that DIF is present for that item. To test the null hypothesis, estimates of the item pawmeters and the variance~covariance matrices of the eSlimates are needed. Recall that when estimating item and ability parameters in each group, a scale for the parameters must be specified (chapter 3); this is done typically by standardizing either the ability estimates or the difficulty estimates in each group. As we shall see later (chapter 9), standardizing the ability estimates usually will result in different scales in each group, and the item parameter estimates will not be on a common scale. Standardizing the difficulty parameters will result in item parameter estimates that are on a common scale. After the item parameter estimates are placed on a common scale, the variance-covariance matrix of the parameter estimates in each group is computed. First, the information matrix is computed (see chapter 3) for each group alld is inverted. The variance-covariance matrices of the two groups are added then to yield the variance~-covariancc matrix of Ihe differences between the estimalcs. The statistic for testing the null hypothesis is
where
and L is Ihe variance-covariance matrix of the differences between the parametcr cstimates. The test slatistic is asymptotically (that is, ill large sumplcs) distributed as a chi-square with p degrees of freedom, where fJ is the numher of parameters compared. For the Ihree-p
=
hJifr 2 "1. == \'(h ) + v(b ) l 2
FUNDAMENTAI.S OF ITEM RESI'ONSE TIIEORY
112
where l'(h 1 ) and l'U'l) arc the reciprocals of the information fum:tions for the dilTiculty parameter est imatcs. Since the c parameter is ortell poorly estimated and, hcm:e, has a larj.!c slilndard error, its inclusion in the lest statistic may producc a very conservative test, Ihat is, a test Ihat is not powerful in dctecting DIF. An alternative is \0 compare only the a and I, parameters a~lq to ignore Ihe c parameters. This approach is reasonable, since if differences exist in thc a and " parameters across groups. then the item characteristic functions will be different, regardless of the c parameter values; without diffcrenees in the a and b parameters, apparent differences bet ween the (' parameters would be too unreliable to warrant the conclusion thatlhe ilem characteristic functions arc different (Lord, 1980). The comparison of itcm parameters as a lIleans of complIring itelll characteristic fUllctions has been criticized 011 the grounds that significant differences between thc parameters may he found when 110 practical differences exist between the ICCs in the ahility range of interest. An example of item pHrameter values for two groups Ihat produce almost identical ICCs ill the ability range (~3, 3) was given by Linll, Levine, Hastings, and Wardrop (1981). The item parameters for the Iwo groups are given below: Group I: Group 2:
a:::::
a
1.8; 0.5;
" 3.5; b ::::: H);
c c:::::
0.2 0.2
Although significant differences exist between the parameters, the ICCs ror the two groups dirfer by less than 0.05 in the specified ability rHnge. It should be noted, however, that this item was extremely difficult for both groups and, hence, an inappropriate item for these groups. If the two ICCs were compared in the ability range for which this item is appropriate, a considerable di fference hetween the ICC's would be observed. For items of appropriate difficulty for at least one of the two groups of examinees (items with difficllity parameters in the ahility range of interest), it is 1101 possihle to obtain significant differclwes bel ween the item parameters for the two groups without a corresponding difference in the ICes. A more valid criticism of the comparison of ilem paramcters is that the distribution of the test statistic is known only asymptotically; furthermore, the asymplotic distribution is applicable only when item parameters are estimated in the presence or known ability parameters (Hambleton & Swaminllthan, 1985). It is not known how large the
113
/'1111'111;,,1/.1' Bia,ft't/ 'f1',{lllfnU
sample size must he in order for the asymptotic distribution to apply, and it is not known whether the asymptotic distribution applies when item and ability parameters arc estimated simultaneollsly. In addition to this problem. some evidence suggests that the chi-s4uarc statistic has a higher than expected raise-positive rate (McLaughlin & Drasgow, 19R7). Area Between Item Characteristic Clln·e.~. An alternative approach to the comparison of item characteristic functions is to compare the ICCs themselves rather than their parameters. If, after placing the parameter estimates on a common scale, the ICCs are identical, then the area between them should be zero (Rudner et a!., 19RO); when the area between ICCs is not zero, we can conclude that DIP is present. In computing the area, numerical procedures were used until recently. The numerical procedure involved (a) dividing the ahility range into k intervals, (b) constructing rectangles centered around the midpoint of each interval, (c) ohtaining the values of the ICCs (the probabilities) at the midpoint of eaeh interval, (d) taking the absolute value of the
Ai
= I.
II',. (0) - Pj2 (0) I A 0
9~r
The
Area
(I
114
FUNDAMENTALS OF ITEM Rt:SI'ONSE TIIFORY
For the tW, (0). For a finite range of ability, the area is finite; however, no expression has been derived for the area between ICes in 11 finite ability range, and so numerical methods must be used. Raju (1990) derived an expression for the standard error of the area statistic and suggested that the area statist it: divided hy its standard error can be taken as approximately normally distrilmted. This procedure is based on the assumption that the c parameter vallles are the sallie for the two groups and are fixed (Le., not estimated). When the c parameters for the two groups arc not the same, the signi ficance test for the area statistic cannot he carried 0111. The problem is to find a "cut-off' vallie for the area statistk that can he used to decide whether 01 F is present. An empi ricalapproach to deterlllining a cHt-off is to divide the group with the hlrger sample size into two randomly equivalent groups, to estimate the ICCs in each group separately, and to determine the area between the estimated ICCs (Hamhleton & Rogers, 1(89). Since the groups arc fllndolllly cllllivait-lIt, thl' arell should be zero. Nonzero values of the area statistic are regarded as resultill~ from sampling fluctuations, and the largest area value obtained may he regarded as the largest value that may he expected in terms of sampling fluctuation. Any area value greater than this is assumed to he "significant" and, consequently, indicative of OTF when the majority and minority groups are compared. One disadvantage of the approach to establishing the cut-off value described above is that, as a result of halving the sample, the parameter estimates may be unstahle; consequently, the areH statist ic may not he a reliable indicator of 01F. An alternative approach is to use simulated data to establish the cut-off value (Rogers & Hambleton, 1989). In this approach, the two groups of interest are combined and parameters are estimated for the total group. The item parameter estimates for the total group and the ability parameter estimates for the majority group are used to generate a set of data of the same size as the majority group. Similarly, the item parameter estimates for the total group and the ability parameter estimates for the minority group are used to generate
1'011'111 iall\,
1I;(I.\('d Tr.'! /fen/.f
11.1
sci of dala of Ihe same si:t.c as the minority group. The two sets of simulated data closciy resemble the data ror the majority alld minority groups ill lerms of sample siz.es, distrihutions of ahility, and item \:haracteristics. The one difference is that the two sets of simulated data are based 011 the same item parameters und, hence, no DIF is present. Item and ability parameters then are estimated separately for each set of simulated data, and area statist ies are computed for each item. Since 110 011" is present, nonzero area values are the result of sampling fluc.:tualions; as described ahove. the largest area value obtained in this comparison may be regarded as a cut-off value for lise in the comparison of ICCs for the real majority and minority groups. The empirical procedure described ahove for establishing critical vlllues Illay capitlllize on chance hecallse only one replication is performed. Mult iplc repl icat ions may he carried out and a cut-off vallie might he estahlished for each it(~lll; however, such a procedure would he so compnter-intcnsive as to he prohihitive. A problem cOl1lmon to the IRT approaches described ahove is that item parameters must be estimated in both groups. For proper estimation, a huge numher of ex.aminees with It large llhility range is needcd. In typical DIP sllulies, the numher of examinees in the minority group is usulllly small (around 3(0); furthermore, the group may have a restricted ahility range. Since item parameters will be estimated poorly in such situations, the 011' statistics may lead 10 erroneous decisions ahout the presence of DIF. Because of the problems associated with IRT methods for detecting DIF, alternative methods have heen sought. The mosl popular of the current non-IRT approaches for detecting DIF is the Mantel-llaenszel method (llolland & Thayer. 19B5). Unfortunately. this method is not sensitive to nonuniform DIF. More recently. Swarninathan and Rogers (1990) have provided a logistic regression procedure capable of detecting nonuniform, as well as uniform, DIF. II
The IRT approaches to the investigation of bias described in this chapter are illustrated using the New Mexico data set introduced in chapter 4. In this example the majority group is Anglo-American and the minority group is Native American. For the purposes of the example.
116
FUNDAMENTALS OF ITEM RESPONSE TlIEORY
a random sample of 1,000 Anglo-Americans mId (I randolll saUl pie uf 1,000 Native Americans were drawn from the totHI sct of data. Three-parameter item response models were fillt~d separately to the item responses of each of the two groups. In computing the parameter estimates, the metric was fixed hy standardi7.ing the I, values. Since the two sets of data consisted of responses to the same items, slundllrdi7.ing the b values in each group automatically placed the item parameter estimates for the two groups on the same scale. Area statistics were computed for each item. Be(';Huse the (' values for the two groups were unequal for most items, the numerical method of calculating the area values was used. The 9 increment used in the calculations was 0.01. The area was calculated over the ability range from three standard deviations helow the lower group mean 0 to three standard deviations above the upper group mean 0; the resulting ability range was (-3.36, 3.55). Simulated data were used to determine the cut-off value, as desnihed earlier. Item responses were simulated for two groups for which no item had DIE To obtain parameter values for generating the item responses, the two groups were combined first and parameter estimates were computed for the total group (these parameter estimates are reported in the Appendix.). The ability estimates for the majority group and the item parameter estimates for the combined group were used then to simulate a set of data resembling the majority group; similarly, the ability estimates for the minority group and the item parameter estimates for the combined group were used to simulate a set of data resembling the minority group. Since the same item parameters were used to generate the data for the two groups, the simulated data represent the situation that has no DIE Three-parameter models were fitted separately to each set of simulated data. with the metric fix.ed by standardizing the b values in each group. Area statistics were computed for each item, and the largest area value obtained was used as the cut-off value in the Anglo-Native American comparison. The largest area vallie obtained in the simulated data comparison was 0.498. In comparing the item parameters for the two groups, two chi-square statistics were calculated. The first chi-square statistic, denoted as X~h' was based on only the (l and I, parameters for the two groups, while the second chi-square statistic, denoted as X~h(' was based on the 0,11, and c parameters. The second chi-square test was carried out primarily for
"""'\
l'Olt'lII/IIlIy
117
IJi!l.l"·" /I'M IINIH
TAIILE 11.1 tl('11\ I'al ameler Fslilllah.'s, Area SlalislKS. alill ../ Twcllly-l-'ivC Randomly ('hose"
I
-'
5 !! II
I),
O.l!4(1 ().1 12 . U47 ().12.~
(U19 H.fll) I 01()1I
"
14 16 10 11 Jfl .111 ,11 45 46 49
;0 52 ;6 ;7 60
64 6R 7.1
75
·0.19] - 0 ..137 1I.:'i 14 1.46' -1.1611 I nIl l.tWx 04R I 0.66:\ (1.40') 1.444
O.BS n.2RI n.,)04 0.245 ·1..198 0.567 1.646
--_
II, o ~7:'i 0.77.' OAD (l,W!! (I.flJ9 0.714 1,()44 ().977 ()Sltl
(1,52') IIA!!II n.549 O.H,I!) 1.166 O.:'i!U 0.661 O.4J I 1.050 0.404 OMI5 0.569 0.442 O..'l4() 0.640 n.J17
/'1
0.190 0.190 0.190 0.190 ().190 0.190 0.190 0.190 0.190 0.190 0.1')0 0.190 0.190
o.l.n ().190 0.190 ().190 0.190 0.190 0.190 0.190 0.190 0.190 0.190 0.190
liz O.lIB ··n.OO!! -().!)5.~
n.2!!/) (l,I')?
O.12!! -ONW ().2!!6 -().I06 0,62!!
(12
('2
IlX'J(, O.()()6 fl.!!21 0.114 (),M5 OJ01 0551 1.1.)99 0.59.5
0.170 0.170 0170 0.17f) O17() !l.17() 0.170
O,~O7
n.7lll O,X.l') 1.175 11941 2.77!! 0.140 1.12!! O.2h; 1.240 1545 0.497 1.IS4 0.3R7 -0.122 -0.0(17 0.5.14
nAn l.n'i4 1).50!)
();X6 0.;211 0.4'0 1.201 (J.405 O.4R9 0.531 O.2XO 0.683 1.223 0.562
Values for
Ilems
/)/f'
Millorin' tim,,!,
M,I}O,ill' (;,."/" Ilrlll
TC~I
n.DI 0.170 o 170 0170 n. 170 0.170 0115
() I7Il 0.170 1),1'It)
O.1J7 0.170 0.170 D.170 n 170 O.llO
0.170 (1.170
Sld/H/1i J
2 •
2
Art·(/
X.. ,.
Xii'"
0.417 (UII!! n.W9· 0.344 (1.\..11 ()712· 04'!4 OA05 O,2.lH 0.117
5.X4 7.90 21.1 J. :'i.:l I 17.110· 2I.X(," 17.12· 29.1.1· J.S7 ;",20 II. 14 4.15
(d) 1 9.52 12.99 5.21 14.74
().ll.I·]·
O.I()5 O.1I4 O.MI" n.540· O.2'JO 0.11:"7 0.:!l5 O.R!W· OSlb+ 0.257 0467 0.942· 0.M8· (1.722·
1.31 14.74" 11.62 ;,1]
0.56 I.'J4 14."· :124"+ 1.19 1n.5 2 15.41" 20.29·
n.S)·
h
l'Ull" 15.111 2.l07· 2.42 2.22 1),7H 4.64 1.7b 12.0K
U.OQ 3.M n.l; .1I9 1642· 21.54+ 2.10 5.56 15.07 20.04· 15.24
......
•. ;d.(X" , b. X 'j,(M"
IH2
16.27
"SijtnifiC8l1f allhe 0.00 I level
illustrative purposes. The significance level for each chi-square statistic was set al 0,001 to ensure that the overall Type I error rate was around 0.05. For the X~h' statistic, the critkal value was X1.001 16.27; for the X~h statistic, the critical value was = 13.82. The item parameters for the two groups, the area statistics, and the chi-square values for 25 randomly chosen items are reported in Table 8.1. Of the 75 items analyzed altogether. the area statistic nagged 20 items as showing DIP, while tlte X~h stati ... tic nagged 25 items. The X~h' statistic nagged only 9 items, which represented a subset of those
xLoo,
IIIl
FUNDAMENTALS OF ITEM RESPONSE THEORY
0.9
0.8
P r
0
0.7
b
0.8 -
b I I I
0.5
a
t
Y
0.4
0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
.----. ________ .. _~_ ... _~b.!!'!Y______ Figure 8.1. Plot of ICCs for Majority and Minority Groups for Item 56 nagged by the X~" statistic. As expected, the X~"( statistic was more conservative than the and area statistics. The degree of agreement between the area and X~" chi-square statistics was moderate: 77% of the items were classified in the same way (either showing DIF or not) by the two methods. The rank order correlation between the two methods was 0.71. Two examples of items flagged by both procedures are given in Figures R.1 and 8.2. These items differ in the type of DIF observed. In Figure R.I, the ICCs for the two groups are more or less parallel, differing mainly in their h parameters. This type of DIF is referred to as uniform TJIF; the difference in probabilities of success is uniform for the two groups over all ability levels. In Figure 8.2, the ICCs for the two groups cross; the probability of success is greater for the minority group thall for the majority group at the low end of the ability scale, but is greater for the majority group at the high end of the ability scale. This tyP{~ or DIF is referred to tiS nonuniform DIF, since the difference in probabilities is not uniform across ability levels. One of the advantages of IRT procedures for detecting DIF is their sensitivity to these different types of DIF; this feature is not shared by some of the popular non-IRT procedures
X;"
119
Potentially Biaud 1'1'.\1 Ilnn.f
-~~-i'----------'-------""""~----=-:-r-,
p r
o
b
a
b I I I
t
Y
0.1
o ___ ... .~
-<4
_ .• 1
j. _ _ ._.___l. _ _ ..L...._•••
-3
-2
-1
o
2
3
_____ ._._ . ______ .__ .. Abl.I.I.t.L_
Illglire 8.2. Piol of Ices for Majorily and Minorily Groups ror hem 13
for detet:ting DlF (Holland & Thayer, 1988; Swaminalhan & Rogers, 1990). Of the 20 items flagged by the X~b statistic, 6 were not flagged by the area statistic; II of the items flagged by the area statistic were not flagged by the X~h statistic. Examination of the ICCs for the items inconsi~tently flagged revealed no reason ror the result. This finding demonstrates one of the problems of all methods for detecting DIF: while th~ agreement among methods is moderate, unexplainable differences occur often.
Summary Item response theory procedures for detecting DIF involve the comparison of item chllfllcteristic functions for two grollps of interest. Two ways in which item characteristic functions may be compared are (a) by comparing their parameters or (b) by calculating the area between the t:urves. To compare item parameters ror two groups, a chi-square statistic is computed. The statistic mayor may not include the c parameter;
120
FUNDAMENTALS 01 ITEM RESPONSE THEORY
Ihe reason for 1I0t including I' is Ihal it is oneil poolly l'slimalcd lind. hence. is unreliabl(~. An advantage of till' chi-square slatislic is Ihat it has a known distrihution; a possihle disadvantage (If the procedure is that it may have a high false-positive rate. The area between lCCs can be computed using an exad expression when the c parameters for the two groups arc Ihe salny, ,a lid a significance test for the area is availahle in Ihis l'ase. When the c parameters are not the same, numerical procedures must be used to cakulate the nrea and no significance test is available. In this case an empirical "cut-ofr' value must be obtained. This is done using either randomly equivalent samples or simulated data in which there is 110 DIE Area values are calculated for this comparison, and the largest value ohtained is used as the cut-off for the real analysis. Several other IRT approaches to the detection of DIF have not been described in this chapter. Linn and Harnisch (1981) suggested evaluating the fit in the minority group of the rRT model obtained in the total group. The procedure is carried out by estimating item alld ability parameters for the combined majority-minorilY group; the item parameter estimates and the abil ity parameter esl imates for the minority group obtained in the combined-group analysis are used to assess the fit of the model to the item response data for the minority group. If IlO DIF is found. the ICC obtained for the total group should fit the data for the minority group; if DIF is found. the parameters will not be invariant across the two groups and the model obtained for the total group will not fit the minority group. Goodness-of-fit statistics can be computed for each item to determine whether DIF is present. This procedure does not require the estimation of item parameters in the minority group (which is usually small) and, hence, overcollles some of the difficulties encountered in the two approaches described in this chapter. Another procedure, suggested by Linn et al. (1981). is to calculate the sum of squared differences between the ICCs for every observed value of O. This procedure may be modified 10 take into account the error in Ihe estimated probabilities (Shepard, Camilli, & Williams. 1985). Several comparisons of the effectiveness of IRT and non-IRT methods for detecting DIF have been completed. The reader is referred to Mellenbergh (1989); Rudner et al. (1980); Shepard. Camilli. and Averill (1981); Shepard el al. (1985); and Subkoviak, Mack, lronson, and Craig ( 1984).
12/
f'OIl'tltially Biaud Tot Items
Exercises ror Chapter" I. In an investigation of DlF. a onepararneter model was filled to the data srparlltcly for the minority 111lllll!ajority gro"p~. Por a pankular item. thr difficulty parameter estimates amI the standard errors for the two groups were computed and are given in Tahle R.2. TABl.E 8.2 Majority Group
Minority Group
0.34 0.15
0.89 0.t6
Difficulty Estimate: Standard Error:
a. Compute the variance of each difficulty estimate.
b. Calculate the chi-square statistic for the difference between the di fricully estimates for the two groups. c. Does it appear that this item functions differentially in the two groups? 2. In carrying oul the DIP analysis on the New Me){ico data set, the ilem parameter estimates for Item 43 contained in Table 8.3 were obtained in the Anglo-American and Native American groups.
TABI.E 8.3 Group Anglo-American Native American
(/
h
c
0.93 0.42
0.90 1.112
0.2 0.2
a. Using the formula given by Raju (19RR). calculate the areA between the ICCs for the two groups. E:otplain why this formulA can be used in Ihi!! situation.
b. Using the cut-off value used in the eumple for the Anglo-American versus comparison (cut-off value =0.468), determine if the item shows OW
122
FUNDAMENTALS OF ITEM ItESPONSE TIIF.()I{Y
Answers Co I(xerciscs rllr Chnplcr R L n. For majority group: variance'" SE 2 = O. L'i1 0.0225 For minority group: variance = SE 2 = 0.1 (/ = (J.()256 h. For the one-parameter model, the
·i stalistil.: siluplifies: 6.29
c.
Xt,o.oi =3.84. Since the calculated value exceeds the critical vallie, we can conclude that this item functions differentially in the two groups.
2. a. Area
= (I
0.2)
I 2(0.42 =- 0.93) x x 0.42 x 0.93 1.7
In[l+ex p 1.7x0.42xO.93x(1.82 0.42 - 0.93
== 0.8 x
I
1.536 In [I + exp
Q.9Q1]
(~1.20) J --
(1.82-0.90)1
0.911 = 1.06
The formula can be used because the c values for the two groups are the same. b, The area value exceeds the cut-off value. We can conclude that Ihe item
shows DIE
9 Test Score Equating
Background The comparability of test scores across different tests measuring the same ability is an issue of considerable importance to test developers, measurement specialists, and test takers alike. If two examinees take different tests, how can their scores he compared? This question is particularly important when certification, selection, or pass-fail decisions must be made, since it should be a matter of indifference which test is used to make the decision. To compare the scores obtained on tests X and Y, a process of equating scores on the two tesls must be carried oul. Through this process a correspondence between the scores on X and Y is eSfablished, and the score on test X is converted to the metric of test Y. Thus, an examinee who obtains a score .x on test X has a converted score / on test Y; this score is comparable to the score y of an examinee taking test Y. In making pass-fail, selection, or certification decisions, the cUI-off score, Xc on test X can be converted to the score on test Y, and thil'! converted cut-off score may be ul'!ed 10 make the appropriate decision for examinees taking test Y.
y;
Classical Methods of Equating Classical methods of equating were described in detail by Angoff (1971) and Kolen (1988). In general, the methods fall into the two main categories: equipercentile equating and linear equating. Equipercentile equating is accomplished by considering the scores 011 tests X and Y to be equivalent if their respective percentile ranks in any given group are 123
124
FUNUAMENTALS OF ITEM RESI'ONSE TIIEORY
equal. Strictly speaking, in order to equate scores on two lesls, Ihe lesls musl be given to Ihe same group of examinees. In praclice, Ihe process Iypically is carried oul by giving the lesls 10 randomly clJuivalelll groups of examinees. In lillear equating, it i.~ assumed that the .~core x Oil test X and the score yon lesl Yare linearly related, that is, .,
y = ax + b The coefficienls
II
and b may be delermined using the relation.o;hips ~y = a~~
+ b
and
where J.l~ (~y) and o~ (Oy) are means and standard deviations of the scores on tests X and Y, respectively. It follows that
a
=
0,
and
y Using the above expression, a score x may he placed on the metric of test Y. The ahove expression can he ohtained also hy equating the standard score on text X to the standard score on test Y,
The assumption in linear equating is that the two test score distrihutions differ only with respect to their means and standard deviations. Therefore, standard scores will he equal in sllch cascs. Whcn this assumption is valid, linear equating hecomes a special case of eqlli-
fe.'/ Score f:q"a/illg
125
perccntile equating; otherwise, it may be considered an approximation to eqlliperccntile equating. The linear equating method has many refinemenls. Procedures that take into account, for example, outliers and the unreliability of the test scores are given in Angoff (1971). Our purpose here is to descrihe hi iefly the classical equating procedures and to note the prohlems inherent in such approaches. Lord (1977, 19RO) has argued that in equating test scores, it should he a llIaller of indifference to the l'xallline(~s at l'very given ahility level whether they take test X or test Y. This notion of equity has several implications (Lord, 1977, 1980). I. Tests measuring different traits cannot he equated. 2. Raw scores on unequally reliahle tests cannot he equated (since otherwise a score from an unreliahle lesl can be equated to Ihe score on a reliable lesl, thus ohvialing the need for conslrtJ<:ling reliahle tesls!). 3. r~aw scores on tests with varying difficulty cannot he equated since Ihe II'sls will not he equally reliahle al different ahility levehi. 4. Falli!>le scores on tests X lind Y cannot he equaled unless Ihe tests are strictly parallel. 5. Perfectly reliahle tests can be equated. In addition to the above requirements of equity, two further COJ1(Iitions---sYlllmetry and invariance--lIIust be set for equating tests. The condition of symmetry dictates that equating should not depend on which test is used as the reference test. For example, if a regression procedure is used to determine the constants in the linear equating formula, the condition of symmetry will not he met because the regression coefficients for predicting y from x are different from those for predicting x from y. The requirement of invariance is that the equating procedure he sample independent. These conditions, particularly those of equity, will usually not he met when using classical methods of equating. Theoretically, item response theory overcomes these problems. If the item response model fits the data, direct comparison of the ability parameters of two examinees who take different tests is made possihle hy the invariance property. Equating of test scores is obviated in an item response theory framework; what must be ensured, however, is that the item and ahility parameter values hased on two tests are Oil a CO"'IlIOII scale. Thus, in an item response theory framework, J('(J/il/~ rather thall equllting is lIecessary.
126
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
Nevertheless, because of the prevalence of the term eqllatitlN in the literature, the terms scaling and equotinN will be used interchangeably in this chapter.
Scaling or Equating in Item Response TheQry According to item response theory, the ahility parameter 9 of an examinee is invariant across subsets of items. This means that, apart from measurement error, ability estimates also will be invariant across subsets of items. Hence, two examinees who respond to different subsets of items (or different tests) for which the item parameter values are known will have ability estimates that are on the same scale. No equating or scaling is necessary. When the item and ability estimates are unkllown, however. the situation changes. In this case, as explained in the chapter on estimation, 9 may be replaced by O' = (to + ~, b may be replaced by b' = (tb + (}. and 0 may be replaced hy o· =: 0 I (t without affecting the probability of a correct response. (For the one-parameter model, since a =: I. 9 need only he replaced by O· = 9 + ~, and b by b' = b + ~.) This invariance of the item response function with respect to linear transformations introduces an indeterminacy in the scale that must be removed before estimation ofparametcrs Clm proceed. One wily to remove the indeterminacy is to fix arbitrarily the scale of 9 (or b); in the two- and three-paramcter modeb, the ,,'01ll11l0n practice is to sct the mean and standard deviation of the 9 (or /J) values to be 0 lind I, respectively. For the one-parameter model, the mean of 9 (or h) is set to O. Fixing the scale on 9 or b is arbitrary and is dictated sometimes by the computer program used (in BleAL, for example, the mean of h is set to 0). This scaling must be undone when attempting to compare the item parameter values or the ability parameter values across two groups. To illustrate the procedures and principles underlying scaling, consider the situation in which one test is administered to two groups of examinees (as in studies of 01 \1). Suppose also that the est imatioll of item and ability parameters is carried out s('l'mofcly for the two grouJls. During the estimation phase, it is necessary to fix the scale of the parameter estimates. The two possible wllys of fixing the scale arc (a) standardizing the item difficulty values, that is, fixing the mean and standard deviation of the difficulty values to be 0 and I, respectively; and (b) standardizing the ability values.
\
121 Pirst, consider the situation in which the scaling is done on the difficulty parameters. Since the sallie test is adlllinistered to the two groups, the item parameter estimates must he identical (except sampling fluctuations) if the model fits the
b", = aba + f}
where hA and 0A are the difficulty and discrimination parameter estimates in Group A, and b B and aB are the corresponding values in Group B. Once 0' and 11 are determined, the item parameter estimates in Group B may be placed on the same scale as the item parameter estimates for Group A. The more interesting problem is that of comparing the ability parameters in Group A with those in Group B. Using the same relationship as for the II values above, all the ahility estimates Oil in Group may be placed on the same scale as those in Group A, using the linear relationship
n
where e~ is the value of the parameter ell on the scale of Group A. The reverse situation to that described above is when one group of examinees takes two tests, X and Y. Since the ability paramefer of the examinees taking the two tests mllst be tIlt' salllC, st'ttill~ the IIU'UIl lind slandard deviation of the 9 to 0 and I, respectively, places the item parameters for the two tests on a {'ommon scale. If. however, the mean and standard deviation of the difficulty parameter values for each test are set to () and I, respectively, the ability parameter values in the two tests will differ by a linear transformation,
128
FUNDAMENTALS OF ITEM HESPONSE THEORY
Oy
ex ex + P
The item parameters for tests X and Y mllst he placed on a common scalc using the following relationship:
,
.
ax ay -
ex
These examples indicate that if it is necessary 10 compare examinees who take different tests, or if it is necessary to place items from differenl tests on a common scale, the equating study mllst be designed carefully. Clearly, if different groups of examinees take different tesls, IlO comparison or equating is possible. Designs that permit "linking" of tests and comparison of examinees are discussed next. Linking Designs. In many situations. the interest is in placing the item parameter estimates from two or more tests on a common scale. This placement enables comparison of the difficulty levels of the tests and also facilitates the development of item banks (see Vale, 1986). The four linking designs that permit the scaling of item parameters (or their estimates) are the following: I. Single-Group Design. The two tests to be linked are given to the same group of examinees. This is a simple design, but it may be impractical to implement because testing time will be long. Moreover. practice and fatigue effects (if the two tests are administered one lifter the other) may have an effect on parameter estimation and, hence. on the linking results, 2. Equ;"olent-Grollps Design, The two tests to be linked are given to equivalent but not identical groups of examinees. chosen randomly. This design is more practical and avoids practice and fatigue effects. 3. Am'hor-Tes( Design. The tests to be linked are given to two different groups of examinees. Each test has a set of common items (hal may be internal or external to the tests. This design is feasible and frequently used, and. if the anchor items are chosen properly (see, for example. Klein & Jarjoura, 1985), it avoids the problems in the single-group or equivalentgroups designs. 4. Common-Person Design. The two lests to be linked are given to two groups of examinees. with a common group of examinees taking both tests. Because the testing will he lengthy for .he common group, this design has the same drawbacks as the single-group design.
129
In the single-group or equivalent-groups design, when~ olle group of c'(illIlinces (or l'quivait'nt groups of examinees) takes the two tl'stS, the mcthods uescribed in the previous section may he lIsed to place the itellls on the sume scale. In determining the sl:uling nlilstants in the equivalellt-groups design, matl:hed pairs of ahility vulues arc needed; this need presents It problem, because the groups I:ollsist of different examinces. One possible way to match examinees is 10 I'lmk order the examinees in the two groups and to consider examinees with the same rank to be equivalent. in the anchor test design, the parameters and, hence, their estimates (subject to sampling fluctuations) are .eluted linearly in the two tests, that is,
I :1
I
where h yc .lOd lJxc are the difficulties of the common items embedded in tests Y allli X, respt~ctively. Once thc I:Ollstants (J ami ~ are determined. the item parameter estimates for all items in test X llIay be placed 011 Ihe same scale as test Y. The adjusted itcm paramcter estimates for the COlli mOil items ill test X will not he identical to Ihe corresponding ill~1ll parametcr estimates in test Y (hecausl' or estimation crrms) and. hence. Ihey should he averaged. or Ihe tll'signs desnibed nhove, the an!:lm .. lest desigll is the most feasihle. lIelll:c, determinalion of the sGlling nmslallis will be disclissed with rcference to this design.
Determination of the
Scalin~
Constants
Thc llIelhods availahle for determining llie scaling c()nstanls fl. and r~ «,nly ~ wht'll the onc-parameter model is used) may be clllssified as follows: I Regression Method 2. Mean lind Sigma Method :I. I~ohust MCllll and Sigma Method 4. ChanlCteristic Curve Method
I
I
\
1]0
FUNDAMENTALS Of· ITEM RESPONSE TlII'.ORY
R('~ rt' .,',\';01/ Mt'lIwd. Ollel' pa irs of v a 11I~:s 01 i lem pal "meter l'st illlates in the two groups are obtained. a regression pl'Occdurc may he used to determine the line of hest fit through the points.
The term e indicates the error in fitting the line, since 110' all the points will be exactly on the line, Here l}ye and I,xc are the item difficulty parameter estimates for the common items in tests Y 4Ind X, If common eXlIminees Me used. the eqn:1tion is
where
aYe
and
aXe
are the ahility estimates for an examinee taking tests
Y and X. respectively, The estimates 1\
a
&and Bof the regression coefficients are = r
SYc
1\
and
·~xc
n /Jxc
where r is the correlation coefficient hetween the estimates of the difficulty parameters ror the common items, /J yc and "xc are the respective means, and SYc and "'Xc lire the respective stamhlrd deviations. In the common-examinee design, these values are replaced by corresponding values ror the a estimates. The problem with the regression approach is that the condition or symmetry is not mel. This is true because the coefficients ror predicting h Ye rrom h xe are different rrom those ror predicting IIX, rrom hYe and cannot be obtained by simply inverting the prediction equation 1\
bYe
= a b xc +
1\
P
That is, it does not follow that 1\
h xc
- Il = -bYe -j\---
n
Thererore, the regression approach is not a suitable procedure ror determining the scaling constants.
131
Mean and SiJ:nlll Met/lOti. Since
it follows that
-
-
bYe
= 0. hxc
+ Jl
Rnd
Thus
and ~
= bYe -
0. bxe "
i I
Moreover, since
:1 bYe
= 0. "Xc
+ 13
the transformation from bYe to b xc may be ohtained as
bxc
bye - 13 = --0.--
Hence, the symmetry requirement is satisfied by the mean and sigma method (when using the common-examinee design, the means and standard deviations of the corresponding 0 estimates arc used to determine 0. and 13). Once ( l and p are determined, the item parameter estimates for test X are placed on the same scale as test Y using the relationships b~
= 0. bx
+P
HINIIAMENTA\'S 01; I IFM I{ESI'ONSF 1'I11'ORY
1J2
where b~ and a~ are the difficully and discrimiualion vnlUl~S of items in test X placed on the scale of test Y. The parameter estimates of the common items are averaged. since they will not he identicnl, as a result of estimation errors. For the one-parameter model. the item difficulty estimates for the common items are related as bYe
that is,
(l
= b xe
+
P
+
P
= I. It follows that -
--
bYe
= h xe
and, hence,
P=
bYe - b xe
Thus, the item difficulty estimates for test X are transformed by adding the difference in the mean difficulty levels of the common items. Robust Mean and Sigma Method_ In the mean and sigma method described above, no consideration was given to the fact that item parameters are estimated with varying degrees of accuracy (i.e., some difficulty estimates have larger standard errors than others). Linn et al. (1981) proposed the robust mean and sigma method to take into account the fact that the parameter estimates have different standurd errors. Each pair of values (bye;, bxci) for common item; in tests Y and X is weighted by the inverse of the larger of the variances of the two estimates. Pairs with large variances receive low weights, and pairs with small variances receive high weights. The variance of the difficulty parameter estimate is obtained by first inverting the information matrix (see chapter 3) and taking the appropriate diagonal element. For the three-parameter model the information matrix is of dimension 3 x 3, while for the one-parameter model it is of dimension I x I; that is, it has a single element.
In The steps in carrying out the robust mean and sigma method are summarized below: I. for each pair (hvc;./'xci), determine Ihe weighl. Wi as
where ,,(bYei) and v(bXci ) are the variances of the estimates of the common items. 2. Scale the weights:
w; = wilL• J
Wj
I
where k is the number of common items in tests X and Y. 3. Compute the weishted estimates:
.
,
h Xci =
Wi
h Xci
4. Determine the means Bnd standard deviations of the weigh led item parameler estimates. 5. Determine (l and p using the mellllS and sland!lrti deviations of the weighted estimates.
Stocking and Lord (l983) have suggested that further improvement in determining a and P may be obtained if outliers are taken into account in the computation of the mean and standard deviation. The weights are made more robust by basing them on the perpendicular distances of points from the line bYe
= ab xc +
fJ
Starting with an initial value for a and ~. the process is repeated until the a and ~ values do not change. For details of this procedure. refer to Stocking and Lord (l983) or Hambleton and Swaminathan (1985).
fUNDAMENTALS OF ITEM RESPONSE THEORY
134
Characteri.Hic Curve Method. The mean and sigma method (and its robust version) capitalizes on the relationship that I!ldsts between the difficulty parameters and ignores the relationship that exists between the discrimination parameters in determining the scaling constants. Haebara (1980) and Stocking and Lord (1983) have proposed the "characteristic curve" method, which takes into account. the information available from both the item difficulty and item discrimination parameters. The true score 'tXa of an examinee with ability 90 on the k common items in test X is
«
'tXa
=L
P(90 • bxei. 0Xci. eXei)
i", I
Similarly. the true score 'tYa of an examinee with the same ability 90 on the k common items in test Y is l
'tYa
=L
P(90 •
bvci. aYe;' CYei)
j= I
For the set of common items, bYel =
(l
bxci
+ p
and CYei :::: ('Xci
The constants
(l
and
p are chosen to minimize the function F where F =
~
N
L a= I
('tXa
'tYlI)2
TI',fl S(OI e f:qul1lin}l
135
alld N i!> the number of examinees. The function F is a function uf a and f\ lind is an indication of the discrepancy between 'tXtl and 'ty". The
procedure for determining o. and I' is iterative, and details lire provided in Stocking and Lord (1983). In IIsing the anchor test design. the nUlllber of anchor items and. more important. their characteristics playa key role ill the quality of the linking. For example, if the anchor items are too easy for one group and too difficult for the other, the parameter estimates obtained in the two groups will be unstable. and the linking will be poor. It is important, therefore. that the common items be in an acceptable range of difficulty for the two groups. Empirical evidence suggests that best results are obtained if the common items are representative of the content of the two tests to be linked. In addition, it is important to ensure that the two groups of examinees are reasonably similar in their ability distributions, at least with respect to the common items. A rule of thumb for the number of anchor items is that the numher should be approximately 20% to 25% of the number of items in the tests.
Other Linking and Equating I)rocedures With the andlor test design, concurrent calibration using the LOGIST computer program permits placing the item parameter eslimates and ability parameter estimates on a common scale without the need for a separate linking and scaling step. (A similar analysis with the oneparameter model can be carried out with the RIDA computer program.) The procedure is as follows: I. Treal Ihe dala all if (N x + Ny) examinees have laken a lesl wilh (/Ix + /ly + " 8 ) ilems where n. denotes Ihe numher of allchor ilems.
2. Treal Ihe "Y ilems to which Ihe N x examinees did 1101 respond as "no I reached" items and code them as such. Similarly, code Ihe "x items to which Ihe Ny examinees did nOI respond as "nol reached." 1. Estimate Ihe item and ahility parameters.
This procedure is simple to implement. Currently. lillie information exists regarding the accuracy of this procedure; further investigation of this issue is needed.
136
FUNDAMENTALS Of' ITEM RESPONSE THEORY
In addition to the "linking" procedures dcsaihcd above, item response theory methods may be used to (a) eqnate trlle scores Oil two tests, and (b) equate two lests using "observed score" distrihutions generated for given levels of 9, These procedures are described in Lord (1980) and Hambleton and Swaminathan (1985). The reader is referred to these sources for more details. The steps in carrying out a linking are illustrated using two examples. In the first example. the linking procedure is illustrated in the context of developing an item bank. The second example deals with the problem of linking two tests. In both examples, the linking is carried out using an anchor test design.
Example I Assume that a bank of test items that have been calibrated using the one-parameter model is availahle. The item difficulty esliulltles for the item bank are given in the Appendix. It is desirahle to add 10 the existing bank a set of 15 new, uncalibrated items. To add Ihese 15 items to the existing bank, we could lise the anchor lesl (Iesign with five anchor items chosen from the existing hank. Suppose that the II values for these five items are 1.65, 1.20, -O.RO, -1.25, and 2.50. These anchor items were chosen carefully to match the contcnt und, it was hoped, the difficulty levels of the 15 experimental items. Since the 15 items arc 1I1l1esled, however, it is difficull to assess their difficulty levels a priori. This information could be ohtained from a pilot tcst. In determining the scaling constants, the mean and sigma method is used for illustrative purposes bec,\Use of its simplicity. The steps are as follows: I. The 20-item test (\5 experimental and:; anch{lr items) is a(lministered 10 an appropriate sample of examinees (200 in this example). 2. An appropriate IRT model is chosen-·this IIllist he the same as .11l' IIHlclcl 011 which the existing itcm hank is used. Sinn'. for illustration. we have assumed that the itcms in the hank rit a nne· parameter 1ll()(leI, a onepararneter model is filh:d also to the 20-itemtesl (with lIppropri"lc dHTks 011 model-data fit) . item.~ i>y" (from Ihe item hank. designated as tesl Y) based on Iheir known ill'llI p"f
.1. The mean diffit-ulty levd "rlhe five IIl1chor
1~7
4. The :W·ilelll lest is calihraled. using (say) Ihe rom(lllier program BleAL. The mean of tilt' dirficullies hased on the 20 hem lest will be sel 10 lero in the estimlltioll process. The lIlean of the five '1Il~lor items that is pan of the 20·itel11 test is computed and designated as bxc • with a computed value of 0.25. 5. Since the ilem difficulties of the commoll items are related linearly according to
Beta is calculated 8S bYe - hxc- (Note that u = I because the model used is the one-parameter model.) In this example. p :::: 0.66 0.25 0041. 6. The item di fficulty estimates of the 15 experimental Hems are adjusled by adding (bYe hXc ) 0041 10 each difficulty estimate.
=
=
7. The com..!llon it~ms that are part of the experimental set are adjusted by adding (bYe "Xc) toeach item difficulty value. Sinre the adjusted values will be different from the values fur the common ilems in Ihe item bank. the adjusted difficulty volues ore average.\! with the difficulty values for the common items in the item bank. 8. The 15 experimental items an: on the same scale as the items in the item bank and are added to Ihe item hank. The eslimates for the common items 8re revised. These calculations are summarized in Table 9.1. The new items and their difficulty values have been added to the item bank (items 76 to 90) reported in the Appendix.
Example 2 In this ex.ample, two proficiency tests, each with 15 items, were administered to samples of New Mexico high school students during two consecutive spring terms. Unfortunately, none of the test items came from the item bank in the Appendix. and. therefore, parameter estimates for all of the test items had to be obtained. It was desired to place the items in the test that was administered first on the same scale as the items in the test administered second. Hence. the test administered first is labeled as test X and the second as lest Y. An anchor lest design was used for the linking. The anchor test, with six. items. was conslructed to be representative in content of both lest X
us
"llNI)AMENTALS OF ITEM RESPONSE TIIEORY
TABLE 9.1 Linking Procedure for Placing E)(perimenlal Items (Texl X) on the SlIme Scale as hems in all Hem Rank nest Y)"
hx
Test Y Difficultv Cnmmon IterlU bye
1.29 0.75 -1.24 -1.72 2.17
1.65 1.20 -0.80 -1.25 2.50
Test X Di//iculty Item
1 2
3 4 5 6 7 8 9 10 II 12 13 14 15 16 t7 18
19 20
Scaled Test X Oi//it'ulty
Staled Test X lJi//irulty
hx -t (hyc - h x;) , /Revi,Ud)b
0.85 -1.88 -2.02 0.19 0.22 -1.116 -1.32 -1.10 0.74 0.61 0.50 -0.80 1.70 1.37 1.55
1.70 I.Hi -0.83 -Ul 2.58
1.67 1.18 -0.82 -1.28 2.54
1.26 -1.47 -1.61 0.60 0.63 -1.4.'1 -0.91 -0.69
1.26 --1.47 -1.6t 0.60 0.63 -1.4.'1
-0.91 0.69 1.15 1.02 0.91 ·-0.39 2.11 1.78 1.96
1.1.'1
1.02 0.91 -0.39 2.11 1.78 1.96
.-~-~~-
h xc
= 0.25
hyc
= 0.66
IlYe -
h Xc
--
..
0.41
Common item~ ace in bold. b. Common item difficulties for tests X And Y have been Rve.a!!ed.
ft.
and test Y. The tests were administered to 500 examinees on each occasion. In choosing an item response model, based on pilot studies, it was decided to use a three-parameter model with a fixed c value of 0.2. The item and ability parameters were estimated using the LOGIST computer program; in the estimation phase, the mean and standard deviation of e were set to be 0 and I, respectively.
139
In carrying out the linking, the mean and sigma method was used. primarily for pedagogical purposes. The mhllst mean al\(I sigma or the characteristic curve methods are mOle appropriate hut are not used here hecallse of the nature of the comp"tations involved. The steps ill carrying out the linking are as follows: I. Compute the mean and standard deviation of the difficulty estimates for the common items embedded in tests X and Y. 2. Determine the constants (l and p (since the three-parameter model was used). 3. Scale the difficulty estimates for test X by multiplying them by (l and adding p. 4. Average the difficulty values for the common items. S. Scale the discrimination parameter estimates for test X by dividing them by n.
6. Average the discrimination parameter values for the common items. The difficulty and discrimination parameter estimates fortest X are now on the same scale as those in test Y. The calculations are summarized in Tables 9.2 and 9.3. The constants a and f} can be used to place the ability values of the examinees taking tests X and Yon a common scale. Since
9y =
aa x
+
~
:::: 0.95 Ox - O.IR,
the mean ability or the examinees who took test X may be converted to a mean on test Y, had they taken it. enahling a comparison of the mean ahilities of the two groups even though they took different tests. For the group who took test X. the mean 9 value was set to zero in the estimation phase. Converting this mean to a mean on the scale of tcst Y. we obtain
(ly = 0.95(0) - 0.18
-0.18
This implies that the difference in the mean abilities for the Iwo groups tak ing tests X and Y is -0.18; the group tuking Icsl X had a lower mean ahility than the grollp taking test Y. This inrormation could be used for academic or program evaluation purposes.
FUN()AMENTALS OF ITEM IUiSPONSE TIIEORY
140
TABtE 9.2 Determination of Scaling Constants amI SCliled Difficulty for Tests X and Y·
- - - - - - - - - - - - - - - - - - - _ . _.._--_._-----
Itl'n! I
2 3 4 5 6 7 8 9
10 II 12 13 14 15 16 17
18 19 20 21
Tnt X Diffintlt.l'
Tnt Y Difficliity
1.20 -1.75 -0.110 ··1.211
1.20 1.75 -0.80 -1.28 1.35 1.40 1.20 0.50 0.72 -1.95 -2.20 2.40 1.80 1.45 0.80 1.10 1.85 2.30 -l.50 -1.110 0.40
1.35
1.40 1.20 0.50 0.72 -1.95 -2.20 2.40 1.80 1.45
UO
bYe = 0.39 1.56
a. Common ilems are in bold. b. Common hems .re averaged;
0.80 1.03 1.8:1 2.36
2.10 2.75 -1.40 -1.65 0.60 1.81 2.20 2.70 1.86 -0.90 -1.10 -2.30 0.511 0.92 0.88 1.92 2.10 2.52 1.60 -1.20
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
SYc
Scalrd f)IJfi.~'lt). All Itml.<
.cal~d
-UI
-1.78 O.4()
1.54
1.91 2.38 l.59 -1.04 -1.23 -2.37 0.31 0.69 0.66 1.64
\,112 2.21 1.34
1.32
hxc '" 0.60 Sxc
(1
difficulty value. for leo! X = o."x +
=
0.95
11.= -0.18
1.65
p.
141 TAULE 9.] Discrimination Values for Tests X alill ya
II,''''
I
2 .1
4 :'I 6
7 R 9 10 II 12 IJ
14 1:'1 16 11
IR 19 20 21
'/".fI y /)isniml'llIt;on
S,'",,'" IJ,.I
Tal X
A/lII~m.vh
ni.v,.rimillati""
1.01 121
1.02 1.21 0.90 0.72 1.2:'1 lAO 1.12 0.75 0.92 0.62 0.52 1.98 1.90 1.62 1.01 0.95 1.23 2.00 0.611 0.45 0.70
().90
0.72 1.25 1.40 1.12 n.75 0.92 0.62 0.52 1.98 1.90 1.62 1.01
1.15
0.95 1.22
L116 0.55 0.40 0.65
1.911 O.til () 44 lI.til)
!.6R 1.95
0.90
22
1.60
11
I.X5
24
I .I}() I.ti2
1.On
OJ< I n.62 0.40
n.ll:'i
25 26 27 2X
lSi 1.72
0.65 0.42 0.67 0.R4 11.79 1.29 IAl 1.111
l.l2 042
0.44
OM
29 .10 11
{),l{O
075 1.2.'
.12 .1.1 '4 Vi 10
.._._.. _--------_._ .. - - - - - - - -
- - - - - - - - - - - - _ .... a ("ommnn items me in hold h Common itcUl'l: are a"('m~(~(I; scaled
1 70
I.IN
__ .._.__
d.~crimio,;uiun \tt1hH.~s
for
I{"~'
a= 0,95
..
X :'::
_-------
flllX.
142
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
Summary Classical methods for equating have several shortcomings; most important, the condition of equity usually will not be met when using classical methods. Itcm response theory methods ohviate the need for equating because of the property of invariance of item and ability parameters. Because of the scaling that is needed 10 'eliminate the indeterminacy in item response models, item and ability parameters will be invariant only up to a linear transformation; that is. the item and ability parameters of the same items and same examinees w ill be related linearly in two groups. Oncc the linear relationship is identified, item parameter estimates and ability parameter estimates may be placed on a common scale. This procedure, known as linking or scaling. may be completed using several designs. The most important design is the anchor test design. where two tests containing a common set of items are administered to two different groups of examinees. Using the common items and one of several methods. the coefficients of the linear transformation relating the item purameters for the two tests !.:an be determined. With knowledge of the linear transformation, the item and ability parameter estimates may be placed on a common scale. An excellent review of various designs for linking items to a common scale is provided by Vale (1986).
Exercises ror Chapter 9 I. In DIF studies, the slime lest is administered to Iwo dirferenl groups anti
the item parameters are estimaled separately. Bdore comparing the item parameters ror the two groups, Ihey mUSI he placed on the same scale. Explain how you would ensure that the ilem parameters are on a common scale. 2. Suppose that in an equating study two different tests are given to two dirrerent groups or examinees, with a common suhset or examinees taking both tests. Explain the procedure you would use to place the item and ability parameter estimates for the two tests and Ihe two groups on the same scale. 3. In Example I or chapter 9, it was assumed Ihat Ihe one·parameter model filled t he data. a. Detennine the scaling conslants ror placing the experimental items on the Same scale as the items in the bank, assuming that a two-parameter model fits the data.
'
....
'.
141
1'nt Scarl' EqlltltinK
h, Place the h values of the common items in Ihe l~xf1t'lilllentaltcst on the saine scale as the b values of the common items in the hank, c. !low similar are the difficulty values of the cummon items for the oneand two-parameter models? Carry oul this comparison by IJlolting the !'Icaled difficulty values for the one- and two-parameter models against the "true" ilem bank values. 4. Two tests, A and B, wilh 10 common items were administered to two groups of examinees, and a three-parameter model was fitted to the data. The means and standard deviations for the b values of the common items are given in Table 9.4.
TABLE 9.4 Tnt A
Mean SD
3.5 1.8
4.2 2.2
The difficulty and discrimination values for an item in test Bare -1.4 and 0.9, respectively. Place these values on the same scale as test A.
Answers to Exercises for Chapter 9 I. Standardize the item difficulty parameter estimates. 2. Since a common set of examinees have laken both tests,their abilities must be the same. Because of standardization during the estimation phase. however, the I) values will be related linearly according to
The means and SDs of the common 9 values are used to determine a and Ii, as indicated for the anchor item equating procedure. With the relationship established. the abilities of examinees taking test Y and the item difficulties for test Y can be mapped onto the scale defined for test X. The item discrimination indices for test Yare mapped onto the test X scale using the transformation
ax
Oy
=
a
144
FIINDAMENTALS OF ITEM RESPONSE THEORY
J. 1l.IX=O.97.P=0.42. h. See Tahle 9.5.
TABLE 9.S --~~---------
-
....
Comman Ilems , ,
Scaled common items (2P): Scaled common items (I P): Common items from bank:
\.67 1.70 1.65
2
3
4
.5
1.15
-0.78 -0.83 -0.110
-1.25 -1.31 ·-1.25
2.52 2.58 2.S0
1.16
1.20
c. lne estimates of item difficulty for the one- and two-parameter models are fairly similar, bUI the estimates for the two-parameter model are closer to the values in the bank. 4. The scaling constants for placing items in lest B on the same scale as test A (let X :::: test Band Y = lest A) are IX = 0.82 and P = 0.06. The scaled item difficulty values are 1.09 and 1.10.
1() --------------- ..---------------------
Computerized Adaptive Testing
Background In previous chapters, it was shown that a test provides the most precise measurement of an examinee's ability when the difficulty of the test is malched to the ability level of the examinee. Any single lest administered to a group of euminees cannot provide the same precision of measurement for every examinee. The ideal testing situation would be to give every examinee a test that is "tailored," or adapted, 10 the examinee's ability level. The earliest application of tailored or adaptive testing was in the work of Binet on intelligence testing in 1908 (Weiss, 1985). Little additional work on adaptive testing took place. however, until Fred Lord at the Educational Testing Service began a comprehensive research program in the late 1960s (for a review of his work, see Lord, 1980). Lord pursued adaptive testing because he felt fixed-length tests were inefficient for most examinees, but especially for low- and high-ability examinees. Lord felt that tests could be shortened without any loss of measurement precision if the test items administered to each examinee were chosen so as to provide maximum information about the examinee's ability. In theory, each examinee would be administered a unique set of items. Adaptive testing became feasible only with the advent of computers. The computer's immense power to store test information (e.g., test items and their indices) and for producing, administering. and scoring tests has enabled the potential of adaptive testing to be fully realized (Bunderson, Inouye. & Olsen, 1989; Wainer, 1990). Since the late 1960s a substantial amount of research has been supported by the U.S. Armed Services, the U.S. Office of Personnel Management, and other federal agencies; special conferences have been held, and hundreds of papers 145
146
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
on adaptive testing have bcen puhlished (sec, for l~XlIrnplc, Wainer, 1990; Weiss, 1983). In computerized adaptive testing (CAT), the sCllllcnce of items administered tn an eXllmin('e (kpencls on the cltllmin('c's performann' on earlier items in the test. Based on the examinee's prior performance. items that arc maximally informative about the examinee's ability level are administered. In this way, tests may be shonened without any loss of measurement precision. High-ability examinees do not need to be administered relatively easy items, and low-ability examinees do not need to be administered the most difficult items, because such items provide little or no information about the examinee's ability. Afler an examinee responds to a set of lest items (sometimes only two or three items) presented at a computer terminal, an initial ability estimate for the examinee is obtained. The computer is programmed to select the next set of administered items from the available item bank that will contribute the most information about the examinee's ability, based on the initial estimate. Details of how test items are selected and ability estimates are obtained are provided in the following sections. The administration of items to the examinee continues unlil some specified number of items is administered or a desired level of measure· ment precision (i.e .• standard error) of the ability estimate is attained.
Promise or IRT Item response models are particularly suitahle for adaptive testing because it is possibie to obtain ability estimates that are independent of the particular set of test items administered. In fact. adaptive testing would not be feasible without item response theory. Even though each examinee receives a different set of items, differing in difficulty. item response theory provides a framework for comparing the ability esti· mates of different examinees. In applying item response theory to measurement problems, as was mentioned in chapter 2. lit common assumption is that one dominant factor or ability accounts for item performance. This assumption is mllde. for example, in nearly all of the current applications of adaptive testing. The IRT model most appropriate in adaptive testing is the three-parameter logistic model (Green, Bock. Humphreys. Linn, & Reckase, 1984; Lord. 1980; Weiss. 1983). The main reason for choosing
147
ComI'1I1('r;uti Ada"t;I'.- Tt'stinR
the threc-paranlt'tl'r model is that it generally fils 1ll11Itipk-t.:hoi<.;c item dala helter than the one- or two-jlarameter models. The item information fllllctioll p!;IYs Ill'l itil'al role in mlaptivc testing. Items that <:<JIltrihl1le mallimally Itl tilt' precision of measurement (see dlllplt'rs 6 IIlld 7) lire "eircled fot adminislration. 1It'IIls providing the Illost information are, in general, items on whiLh the C'xaminee has an (approximately) 50% to 60% chance of answering correctly.
Uasie Approach In adaptive testing within an IRT frlll1lework, an allempt is made to Illllteh the difficulties of lest items to the ability level of the examinee being measured. To match test items to ability levels requires a large pool of items whose statistical characterislics are known. so Ihal suitable items may be drawn (Millman & Arter. 1984). According to Lord (1980). the computer must he programmed to accomplish the following in order 10 tailor a test to an examinee. I. Predict from the ellaminee's previous responses how the eJl(alllinee would respond to various test items not yet administered.
2. Make effectivc use of this knowlc(lge istered nex t.
10
seled the leSI item to be admin-
J. Assign at the end of testing a numerical scorc thai relJrest'nts the ahility of Ihe eJl(aminee tested.
The advantages of computerized adaptive testing, ill addition to shortening tests without loss of measurement precision, are numerous. Some of these advantages are • enhanced test security • testing on demand • no need for answer sheels • test pace Ihal is keyed to the individual • immediate test scoring and reporting • the minimization of test frustration for some eJl(8111inees • greater test standardization • easy removal of "defective items" from thc item bank when they are identified
1411
HINDAMENTALS OF ITEM RESPONSE TIIEOI{Y
• greater nexibility in Ihe choice of item formals
• reductioll in test supervision time
Adaptive testing research to dale has been focused in six areas: choke of IRT model, item bank, starting point for testing, selectioll of suhsequent test items, scoring/ability estimation, and choice of method for deciding when to terminate the test administration. Refer to Itimhlcton, ZUlli, and Pieters (1991) for a discussion of rcsl~arch in these six areas. A brief discussion of two of these-item selection and ahility estimat ion-follows. Two procedures are used currently for item selection in an adaptive mode (Kingsbury & Zara, 1989). The first, maximum information (Weiss, 1982). involves the selection of an item that provides maximum information (Le.• minimizes the standard error) at the examinee's ability level. To avoid the same items being selected time and time again (items with the highest levels of discriminating power, in general, provide the most information) and thereby (possihly) affecting test security and, subsequently, test val idity, Green et al. (1984) have suggested that items be selected on a random basis from among items that provide the greatest information at the ability level of interest. Thus, for practical reasons, slightly less than optimal items usually are administered to examinees. The second method, flllyesiofl item selection (Owen. 1975). involves the selection of the test item that minimizes the variance of the posterior distribution of the examinee's ability (see chapter ..'\). As more test items are administered, the posterior distribution becomes more concentrated, reflecting the precision with which the examince's ability is estinHlted. Bayesian methods require specification or II prior helief ahout the examinee's ability; hence, the success of the mcthod depends ill part Oil the appropriateness of the prior distrihution. Thc impact of the prior distribution diminishes as more items arc administered. An important advantage of computerized adaptive testing is that test scoring/ability estimation is carried out while the test is heing administered; thus, feedback of resulls to examinees may be rrovide(1 at the completion of testing. In obtaining ahility estimates, the two estimatioll procedures commonly used are maximum likelihood and Bayesian (see Weiss. IQR2. and chapter 3. this volume). Maximum likelihood estimation poses problems when the numher of test items is smilll. Bayesian procedures overcome the problems encountered with maximullI Iikeli-
149
('omI'IIII'li:cd A(/III'III'r '{('Iling
hood procedures but may produce biased estimatcs or ability if inappro"rillte prior distrilmtions arc choscn.
This example. which highl ights the features of CAT ability estimation and item selection, was prepared by Reshelar (1990). For the purposes or the example, Reshetar created a bank or I J test items. contained in Table 10.1. TABLE 10.1 IUm Paraml'/('r I/('m
b
a
c
1 2 3 4 5 6 7 8 9 10 II t2 IJ
0.09 0.47 -0.55 1.01 -1.88 -0.82 1.77 1.92 0.69 -0.28 1.47 0.23 1.21
1.11 1.21 1.78 1.39 1.22 1.52 1.49 0.71 1.41 0.98 1.59 0.72 0.58
0.22 0.24 0.22 0.08 0.07 0.09 0.02 0.19 0.1) 0.01 0.04 0,02
0.17
Source: Prom Com/,utu IIdap'/v~ Tut/nll: O,vrt"pmnll and Application (p. 9) by R. Re.hetar. 1990. Amhcl'IIl: Uni,crMity or MIIUachulehR. School of educalion.
In practice. an item bank. would consist of hundreds. and possibly thousands. of test items. A setluence of events that might occur in computerized adaptive testing is as follows: I. hem 3 is selecled; Ihis ilem is of average difficulty and high discrimination. Suppose Ihe cuminee answers Ilem :\ correctly. A maximum likelihood estimale of abilily cannol be oblained unlil Ihe examinee hilS answered 8t least one item correctly and one ih:m incorreclly. (Zero or perfect scores correspond 10 _00 and +00 ability eSlimah:s, respeclively.)
150
FUNDAMENTALS OF ITEM RESI'ONSE TIIEORY
TABLE 10.2 Maximum-Likelihood Ability Estimates ami Standard Error for Olle Examinee al the End of E'lCh CAT Stage
Stall I'
fIrm Nllmfll'f'
2
12
3 4 5 6
1 4 II 9
0 1 0 I
7 8 9
2 I R
I 1
1/011
Rnlwnu
"
"
"
(I
1(0)
.... E(O)·
1.0]
0.91 2.35 3.55 4.61
1.02 0.65 0.55 0.47 0.45 0.44 0.43
~
0
1.46 1.11 1.24 1.29 1.31 1.25
5.05 5.27 5.47
•. SE(8) = II ~ /(8)
"
, , i
2. Another item i~ selected. Ilem 12 is chosen because it is more dimcull Ihan the previously administered item. Suppose the examinee correctly answers Item 12. Again, a maximum likelihood estimate of ability cannot be obtained. 3. Item 7 is chosen next; it is more difficult than hems 3 and 12. Suppose the examinee answers this item incorrectly. The examinee's item response vector for the Ihree items may be represented as (I. I, 0). Through use of the maximum likelihood procedure for estimating ahilily with known item parameters. an ability estimate can be obtained (8" 1.03). The test information for the three items at this abiliiY level is I(e" = 1.03) 0.97. and the corresponding standard error is SE(O) = 1.02. These values appear in Table 10.2. 4. Next. the information provided by each of the remaining itcms in the bank is computed at e = 1.03. These values are reported in Tahle 10.3. ItCIlI 4 is selected next because it provide!! the most infnnnation at 9 = 1.03. Suppose that Item 4 is administered and then is answered correctly by the examinee. A new ability estimate is ~}btail1cd for the response pllllern (I. I, 0, I). The new ability est imale is 9 '" 1.46. 5. The item informatioll at e '" 1.46 for the rcmaining items is cmnputcd, The process described above for administering an item, estimating ability, determining the information provided hy unadministered items. and choosing an item to be administered next based on the infonnation it provides is continued. To continue this procedUre. Item II is chosen next. following by Item 9. then by Items 2, I. and finally. 8. The procedure stops when the standard error of the examinee's abil ity estimate stops decreasing
TABLE 10.3 lnfonnation Provided by Unadministered Items at Each CAT Stage
....
4
5
Infor17Ultion 6
1.192
0.010
0.319
0.292
1.24
0.:';49
8
1.29
0.232
9
10
l\
Stag~
9
4
1.03
0.034
0.547
5
1.46
0.179
6
1.13
7
Provid~d
by
ft~m
13
8
9
10
11
0.051
0.143
1.008
0.251
l.UlI
0.166
0.004
0.017
0.205
0.579
0.136
1.683
0.175
0.494
0.008
0.039
0.159
0.917
0.219
0.170
0.433
0.006
0.029
0.175
0.187
0.173
0.006
0.026
0.182
0.175
0.174
1.31
0.005
0.024
0.186
0.168
0.174
1.25
0.006
0.028
0.184
0.173
2
3
7
12
/
152
FllNDAMENTALS OF ITEM RESPONSE THEORY
by a specified amoun!. As can be seen from Table 10.2. the decrease in the standard error when Item 8 is administered in stage 9 compared with the standard error at stage 8 is 0.0 I. The proced~re stops at this poinl. The estimate of the examinee's ability is taken as e :: 1.25.
Weiss and Kingsbury (1984) described several other examples of application or CAT to educational testing problems. ••
Exercise for Chapter 10 For the example in the chapter. suppose that an examinee was administered Items 3, 12, and 7 and responded (I, I. 0). Item 4 was chosen to be administered next, and the examinee answered it incorrectly. The maximum likelihood estimate of ability was computed to be 0.45. Compute the infonnation function for the remaining items at this e vallie. Which item sbould be administered to the examinee nexfl
Answer to Exercise for Chapter 10 The item information values at 9 = 0.45 are given in Table 10.4.
TARLE 10.4 Ilem Inrormalion
I 0.50
256 0.66 0.0.1 0.19
K 0.1 R
9
10
1.()6
OAK
II I.l 0.4.') O. 16
---------.--hem 9 has the highest infonnation at e =0.45. It is administered nexl.
11 Future Directions of ltetTI Response Theory
We hope that Dr. Testmaker and other applied measurement specialists will find the contents of this book helpful. Many important concepts. models. features, and applications were introduced. and many examples were provided; this material should prepare our readers for the next steps in the learning process. No book, hy itself, can prepare measurement specitdists to use IRT models successfully in their work. Applied work with various data sets and IRT computer programs is an essential component of training in IRT. The practitioner must be ready to handle the many problems that arise in practice. Although IRT provides solutions to many testing prohlems that previously were unsolved, it is not a magic wand that can be waved to overcome such deficiencies as poorly written test items and poor test designs. In the hands of careful test developers, however, IRT models and methods can become powerful tools in the design and construction of sound educational and psychological instruments, and in reporting and interpreting test results. Research on JRT models and their applications is being conducted at a phenomenal rate (see Thissen & Steinberg, 1986, for a taxonomy of models). Entire issues of several journals huve been devoted to developments in IRT. For the future, two directions for research appear to be especially important: polytomous unidimensional response models and both dichotomous and polytomous multidimensional response models. Research in hoth directions is well underway (Bock, 1972; Masters & Wright, 1984; Samejima, 1969, 1972, 1973, 1974). With the growing interest in "authentic measurement," special attention must be given to I RT models that can handle polytomolls scoring, since authentic measurement is linked to performance testing and to nondichotomous scoring of examinee performance. I'D
154
IllJNI)AMI~Nl'ALS
OF ITliM RESPONSE TIIEORY
Multidimensional IRT models were introduced originally by Lord and Novick (1968) and Samejima (1974) and, more recently. by Embretson (1984) and McDonald (1989). Multidimensional mmJcls offer the prospect of better filling current test data and providing multidimensional representations of both items and examinee abilities. It remains to be seen whether parameters for these multidimensional models call be estimated properly and whether multidimensional representations of items and examinees are useful to practitioners. Goldstein and Wood (1989) have argued for more IRT model building in the future but feel that more attention should be given to placing IRT models within an explicit linear modeling framework. Advantages, aQcording to Goldstein and Wood, include model parameters that are silnpler to understand, easier to estimate, and that have well-known shltistical properties. 'In addition to the important IRT applications addressed in earlier chapters, three others are likely to draw special attention from educators and psychologists in the coming years. First, large-scale state, national, and international assessments are attracting considerable attention and will continue to do so for the foreseeable future. Item response models are being used at the all-important reporting stages in these assessments. It will be interesting to see what technical controversies arise from this type of application. One feature that plays an important role in reporting is the ICC. Are ICCs invariant to the nature and amounts of instruction? The assumption is that ICCs are invariant, but substantially more research is needed to establish this point. Second. cognitive psychologists such as Embretson (1984) arc interested in using IRT models to link examinees' task performances fo their abilities through complex models that attempt to estimate parameters for the cognitive components needed to complete the fasks. This line of research is also consistent with Goldstein and Wood's (1989) goal of seeking more meaningful psychological models that help explain examinee test perrormance. Much or the IRT research to date has emphasi7.ed the use of mathematical models that provide little in the way of psychological interpretations of examinee item lind test performance. Third, educators and psychologists are making the argument for using test scores to do more than simply rank order examinees on their abilities or determine whether they have met n particular achievement level or standard. Diagnostic inrormation is becoming increasingly important to users of test scores. Inappropriateness mea.mrement developed by M. Levine and F. Drasgow (see, for example, Drasgow
f'u/llrl'
Oi't'Clitl/u ollll'm R",'IWIU(! l'h"(}I'Y
155
el aI., 1987), which incorporales IRT models, provides a framework for idenlifying aherranl responses of examinees and special groups of examinees on individual items and groups of items. Such information may he helpful in successful diagnoslic work. Gretl1er use of IRT models in providing diagnostic inrormation is anlicipafed in the coming years.
..
Appendix A
Classical and IRT Parameter Estimates for the New Mexico State Proficiency Exam
TABLE A.I Classical and IRT Item ParHtncler ESlimates for IIIC Olle-, Two-, lind Three-Parameter Models --~-----~
1'1.'111 P"raml.'la E.,lilllille.' ..."-.-'--
~
llelll
1 2 :\
4 5 6
P
0.45 0.70 0.65 0.77 0.75
r
a
c
----.~~~----~--
(121
LOU
-U-fn
·0.75 -IA5 -1.34
{),(,O
{),1)2
lUll
1 . III
--2,25 -1.25
11.14 0,66
·1,69 -0.97
IU7 n,M
0.17 0.1" n,17 n,17 n,17
O,W
1.11 -II.')()
O,M 0.79 0.67 O.S] 0,62
0.17 0,17 0,17 1l.1'I 0.17
14 IS 16 17 18 19 20
0.62 0.80 0.65 0.49 0.64
1:'i6
JP I>
0.22
0.61 0.59 055 0.73 0.38
D
1I
0.45 0,50 n.20 0.:\7
0,35 0.29 0.32
II 12
"
b
OAI
0.]9 0.76 0.60 0.78 0.55
7 R 9 10
-~.
--.-.~"---. 0.61 0.1l2
O.~R
0.51
0.R4 ()QI
,-0.27
0.71 1.67 -O,S6 -1.70 -0,32
!l47
-U6 O.IQ
0.37 0,21 0,30 0.44 0.54
-0.53 ··0,4 7 n.25 I.IR O.5X
-U.S5 -(l,R I -0 ..10 -0.97 O,4Q
0.56 (1,29 0,4.1 O,R2 0,75
·n.14 -0. II 0,22 ··(J,67 0,76
O.6S IU7 1),56 O.RR UO
017 0.17 n,17 0.17 n.IS
0.54 0.34 0.45
-0.51\ -1.67 -0.74 0.04 -0.7n
-0.45
1.04 0.67 0,5<1 0,6R 0.65
-0,04
I.:'i.l 0.06 O.M
0,21 0.17 0.17 0,22 0.17
n.27
OAO
OA3 OAO
n.52 ' 1.3(, -11.52 -1.51
~
-1.5] -0.78
O,OJ n.M
0.75 (),52
050
. n,o'J
,1..12
·(1.32 0.51 - 0.31
1.2J O.7:l
'1,/ \57
Ap/)I'I/(UX A
TABU': A.I continued . _ . _ - - _.....
lP
-
/tt'm
P
r
b
b
a
b
a
('
21 22 23 24 25
0.69 0.67 0.46 0.74
~.99
-1.07 ~.78
0.18 -1.26
0.20 -0.89 -0.48
0.53 0.68 0.50 1.15 0.80
~.68
~.85
0.61
0.34 0.41 0.35 0.52 0.47
4).46 0.63 -0.64 -0.12
0.59 0.74 0.74 1.25 0.98
0.17 0.10 0.17 0.17 0.17
26 27 2R 19 3()
0.34 0.70 0.6\ 0.73 0.74
0.30 0.50 OA4 0.35 0.44
0.78 -1.05 -0.56 -1.23 -1.28
-1.24 -1.0:1
0.44 0.99 0.71 O.5R 0.85
1.18 -0.52 () 12 -H.91 --O.R I
0.65 1.08 0.91 0.62 0.R6
0.12 0.17 0.17 0.17 0.17
."32
0.57 0.74 0.44 0.60 02R
(U2 O.3R
-0.35 -1.20 0.29 -0.52 1.14
0.41 1.17
OA6 ()H~
O.OR 0.'10 O.7R
OAt. 1.'17
(I.5R 0.6R 0.R7 UO 1'<)4
0.17 0.17 0.19 0.20
-0.99 1.11 -1.4 \ -0.50 03J
-n.R2 1.46 '.19 n.51 n.26
0.:1:1 -0.51 0.17 -0.12 1.24
IU ·11.45 11.1 R -0.19 1.53
~).6R
-0.61 --1.16
.H .14 35
30
n JR
'"
40 41 41 41 44 4.'\
46 47 4R 49 :')0
51
52 53 :')4
55
OYi
OA5 0.29
O.M 1129 0.77 0.(,0 043
0.46 n.27 (US
n.4:1 n.()(1 O.4() (1.52 0.26
OAI
O.M (US 079 0.76 n.57 OA9 (U4 O,,'\() 0.74 OAR
(UK
OAR
0.46 (),l7 n.2l 0.2R 0.44 OAO 0.39 (U6 (U()
IU5
n.n 0.39 IUJ 0.61
~.56
-J.J4 -1.57 -1.:17 4).34 0.04 n.R 1 4).114 -1.26 0.12
0.97 -O.RO -~1.50
o.n
-un -1.2R ~).43
051 n.7S
(UI.l
OAt.
lAO
II.R3
· n..'III
nAI II"'L 051
1.54 1111 · not)
OM
(1.111
O.5K
UO
0.94 n.63 0.69
n.IS
n.l7 0.10 0.17 0.17 0.17
O.fi2
O."R
0.99
(U.')
· (I.ot)
().')]
0:')6
(1.70 0.44 1.46
I.H 0.4 , 1.14
0.25
(U2 OA5 O.D n.74 0.79 OM OA!
- 11.2:') {1.R9 ·I.OR -1.00 O.!O
0.R4 O.7R O.RO 0.51
0.17 n.17 0.17 0.17 0.17
0.94 1.06 I.Ol 0.01 1.41
0.20 0.14 0.23 0.17 0.08
(1.115
053
(un
11.59
·OA! -1.32 0.0:')
n.59
(1..'\7 1.01 0.5]
(J.55
·0.94
1.21
n.n
O.oK
n.17 0.17 0.17 0.15
158
FUNDAMENTALS OF ITEM RESPONSE THEORY
TAIILE A.I conlinut!d
__~ ______ JIl-I ltfn~fa,.,!"!.e!e!~ E.I!inlllleJ Cla.uiml p r
_ _ _ _ _• _ _ _ _ _
" __ c
IP
2P
3P
h
b
a
b
(I
('
0.47
0.34 0.32 0.43 0.26 0.35
-0.03 -0.71 -0.02 -1.1\8 0.15
-0.03 -0.82 -{J.03 -2.18 0.18
0.48 0.49 0.66 0.48 0.49
0.43 -0.37 0.35 1.82 0.61
0.67 O:5ti 0.85 0.52 0.70
0.17 0.17 0.17 0.17 0.17
61 62 63 64 65
0.71 0.73 0.79 0.63 0.59
0.35 0.38 0.30 0.23 0.47
-1.09 -1.21 -1.57 -0.63 -0.43
-1.13 -1.15 -1.69 -0.97 -0.38
0.56 0.64 0.53 0.33 0.77
0.77 -0.85 -1.37 --0.34 -0.05
0.62 0.68 0.56 0.40 0.89
0.17 0.17 0.17 0.17 0.17
66 67 68 69 70
0.77 0.54 0.66 0.72 0.53
0.16 0.52 0.41 0.37 0.21
-·1.45 -0.20 ·-0.80 -1.12 -0.14
-2.85 -0.17 -0.75 -1.10 {J.26
0.26 0.90 0.65 0.61 0.26
1.97 0.17 ·-0.40 0.46
0.31 1.22 0.74 0.66 0.35
0.17 O. I 7 0.17 0.17 0.17
71
0.78 0.78 0.64 0.60 0.46
0.41 0.37 0.53 0.28 0.23
.. 1.49
-1.53 -0.68 -0.48 0.17
.. 1.21 -1.34 -0.53 -0.62 0.3\
0.83 0.72 0.98 0.41 0.30
0.911 -1.06 0.23 -0.1l7 0.91
0.114 0.76 1.14 0.52 1.41
0.17 0.17 0.17 0.17 0.17
111'/1/
56 57 58 59 60
72
73 74 75
0.51 0.64 0.50
<-O.l!:tl
76 77 78 79 80
1.26 -1.47 -1.61 0.60 0.63
81 82 83 84 85
-1.45 -0.91 -0.69 Ll5 1.02
86 87 88 89 90
0.91 -0.39 2.11 1.78 1.96
-1-1.77
Appendix II Sources for IRT Computer Programs
Program
SQUIl:t'
BICAL, BIGSCALE
Dr. Benjamin Wright University of Chicago Slatislicallaboratory Department of Education 58:15 Kimllark Ave. Chicago, IL 60637 U.S.A.
MICROSCALE
Mediu Interactive Technologies
21 CharleH Street W('istpurt, CT 06880 U.S.A. PML
Dr. lan·Eric Gustafsson University of (llUellorg Institute of Educatitm Flick S·431 20 Molndal SWEDEN
RASCAL. ASCAL
Asseument Systems Corporation 2233 Univeuily Avenue Suite 440 St. Paul, MN 55114 U.S.A.
RillA
Dr. Cees Glas Nlltional Institute for Ihlucaliollal Measurement P.O. Box 1034 680 I MG Arnhem The Netherlllnds
159
160
FIJNDAMENTALS OF ITEM
IU~.')rONSE
TUEORY
SmtiH'
LOG 1ST
Educational Testing Service Rosedale Road Princeton. NJ 01l'i41
U.S.A. ,
TULOG. MUl:nLOG
Scienlific Sor,ware, Inc. I 36Q Neil1:el ROlld Mnoresville, IN 461511 U.S.A.
NOBARM
Dr. Colin Fraser Cenlre ror Behavioral Studies University or New England Armidale. N.S.W. AUSTRALIA 2351
MIRTE
Dr. Mark Rechse American College Testing Program P.O. Bolt 1611 Iowa City. IA S2243 U.S.A.
<
...
,\
References
American Educalional Re~earch A!'ISocialion, American Psycholngical Associalion, &: N.uional Council on Measurement in Educali()n (198~). Sltmdards for t'dllmlinnal am/I'sydwlogical tuling. Washinglon, DC: American Psychological Association. Andersen. E. B. (1972). The numerical IIOlulion of II set of conditional eSlimation equlltions. }mmliJl of tht' Royal Slati.ttiml Sodl!ly, Ser;r.l H. 34. 42-54. Andersen, E..R. (1973). A goodness of fil lest for Ihe Rasch modeL P.tydfllnlellika. 'ul. 123·140. Andrich, D. (I978a). Application of a psychometric raling model 10 ordered categories which are scored wilh successive integers. Applied Psycholollicol Me{/.ru"enltnt. 2. 581-594. Andrich. O. (1978b). A hinomiallalenl Irait model for the siudy of Likert-slyle allilUde questionnaires. British Journal Malht'l'lUltical /llId Stllli.
ot
161
162
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
Bejar. I. I. (1980). A procedure for investigllting the unidimensionali!y of achievement II'~I~ bRsed on ilem parameter e~lillliltes, .Imunlll of {'.'d,It'IIliollal M"II,UfI'I·m"I1I. 17. 2113-296, Airnbnum. A. (19611), Some latent trllit n10deh lind their use in inferring 1111 examinee '5 ability. In F. M. Lord and M. I{, Novick. Slllli.~lkllll"f'()ril'.f of melllalirsl .H'(/I'I',< (chaplers 17-20), Reading. MA; Addiwn.We~ley. Bock, R. D. (1972), Estimating item parameters and latent ability when re~ponses are scored in two or more nominal categories. P.fychomnrilw. 37.29·51. , , Bock, R. Do, & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychomttrilt.tl, 46. 443-459. Bock, R. D., Gibbons, R., & Muraki, E, (19811). Full information item factor analysis. Applit'd Psycllo/(>!lical Ml'a.
Bock, R. D., & Lieberman, M. (1970). Filling a response curve model for dichotomously scored item~. P.fychomttrika. 35, 179-1911. Bock. R. D., & MisJevy. R. J. (1982). Adaptive EAP estimation of ability in a microcOlnputer environment. App/itd P.fychological Mta.turtmtnt, 6(4), 431-444. Bunderson, C. V., Inouye, D. K., & Olsen. J. B. (19P-9). The four generations of computerized educational measurement. In R. L. Linn (Ed.), Educational mta.fUrtmtnt (3rd ed.) (pp. 367·407). New York: Macmillan. Carlson. J. E. (1987). Mullidimtn.fional iltm rtSpOn,ft tlltory tstimation: A COfl/pUltl' program (Research Repol1 ONR87-2). Iowa City, IA: American College Testing, Cook, L. L., & Eignor. D. R. (1983). Practical considerations regarding the use or item response theory to equate tests. In R. K. Hambleton (Ed.), Applications af iftm rt.fpOnst th~ory (pp. 175-195). Vancouver, BC: Educational Research Institute of British Columbia. Cook. L. L.. & Eignor, D. R. (1989). Using item response theory in test score equating. Intunalional Journal of Edllwtional Rf'uard/, 13(2), 161·173.
Cook, L. L., Eignor, D. R.• & Taft. H. L. (1988). A comparative study of the effects of recency of instruction on the stability of IRT and conventional item parameter estimates. Journal of Educational Mtasu/,tmtnt. 25( I), 31-45. de Gruijter, D. N. M. (1986). Small N dou not alway~ justify the Rasch model. Apptitd Psychological Mta.turtmtlll, 10, 187·194.
de Gruijler. D. N. M., & Hambleton, R. K. (19113). Using item response models in crilerion-referenced lest item selection. In R. K. Hamhleton (Ed.), Applications of ilf'm I'f'.fpon.ft thtol'Y (pp. 142-154). Vancouver, Be: EducationRI Research Institute of British Columbia. Divgi, D. R. (1986). Does the Ra.~ch model really work for multiple choice items? Not if you look closely. Jo",.",,! of Educational Mt'osul't'mt'ni. 23. 283-2911. Drasgow. F.• Levine, M. v.. & McLaughlin. M. E. (19117). Detecting inapproprillte test scores with optimal and practical appropriateness indices. Applied P,f.vdlOl,,~;{'(/1 Mta.turl'ml'nt, 11(1),59-79. Drasgow, E, & Lissak, R. I. (19113). Modified parallel analysis; A procedure for examining the latent dimensionality of dichotomously scored item responses. JouI'nal IIf Applied P,'y(·holo1(Y. 611. 363-373.
Drasgow, E. & Parsons. C. K. (19113). Application of unidimensional item response theory models to multidimensional data. Applied P.r,lI('holo~i('(J1 MeOml'l'mnli. 7, IR9-199. Embretson, S. E. (1984). A general laten! trait model for response processes. I'sydw· mel/ilea. 49. 175-186.
Rrjrll'/w('s
16]
Fraser, C. & McDonald. R. I'. (1988), NOHARM: Least squares ilem faclor analysis, M"I/;I'(l/';tlft' B,'IIm';oral Rt'.tl'aFl'h. :B. 267 ·26'), OIas. C. (1990). RUM: RtHI" inl'ompll'u llt'.";1(1I ll/ltlly,fi,\, Alllhelll, The Nelhe!ltlnds: National In~titule for Elducational MellNurelllenl. (loftlsteill. II" & Wood. R, (1989). Five decadC!'s of hem reslJ,mse modelling. Bfitl.th hmrnal of Mat/It'matka/ and SfaliJI;('al P,fyd",/oIlY. 42, 1]9-167. Green. B. E. Bock. R. D.... umphreys. L. G .. Linn. R. L.. & Recklls<". M. D. (1984), Technical guidelines for assessing computerized adaptive tests. Jtmma/ of E:ducafional Mrasurl'mml. 2/(4), 347·360.
Green. 0, R .• Yen. W. M., & Burket, G. R. (1989). Experience~ in Ihe application of item respoose theory in test construction. Applit'd Mfa.tUrfmeni in Edu/'afi(ln. 2(4). 297-312. Gullihen, II. (1950). Tht'ory of meniallt'.tIS. New York: John Wiley, GustafslIon, J. E. (1980a). A solution of the conditional eslimalion problem for long teslS in lhe Rasch model for dichotomous items. Edllcational anll P.fycholoRical MeaSIIreml'tlt. 40. 377-385.
Guslafsson. 1. E. (l9110h). Teslina and obtaining fit of dala 10 Ihe Rasch model. Briti.th ) ournal of Mathematical and Statistical P,fycholoRY. 33, 205-233. Haebara, T. (1980). Equaling loailltic ability scales by weighted least squares method. )al'ont'u P.Tycho/agicol Rtsearch. 22.144-149. Hambleton. R. K. (Ed.). (1983). Application,f of iUIfI rnponu theory. Vancouver, BC: Educational Re!learch Institute of British Columbia. Hamblelon, R. K. (1989), Principles and selected applications of item response theory. In R. L. Linn (Ed.). Educationa/mt'asurt'lfIt'nl Ord ed.) (pp. 147-200). New York: Macmillan. Hambleton. R. K., & Cook. L. L. (1983). Robustness of item response model~ lind effects of leM length lind sample size on the precision of abilhy eslimates. In D. Weiu (Ed.), Nt'w horizon.' in testing (pp. 31-49). New York: Academic Press. Hambleton. R. K., & de Gmijter, D. N. M. (1983). Application of item response models to criterion-referenced test item selection. )(1l1rnal IIf Educational Ml'osuft'mt'nt.
20.355·367. Hambleton, R. K.• JOl1e~, R. W., &: Rogers. II. 1. (1990. August). Inflllmee of ittm paramt'ft'r t'.ttimation t'rmr., in ft',U deve/opmt'nt, Paper presented at the meeting of American P~ychologicftl ASlIOCilltion, Boston. lIambleton. R. K., &: Rogers, H. J. (19119). Detecting potentially hiRsed test items: Comparison of IRT area and Mantel·lJaenszel methods. AI'plitd Ml'a,fllrt'mt'nl in EduC'Olion. 2(4). 313-334. lIambleton, R. K .• & Rogers. H. J. (in press). Assessment of IRT model fit. Applit'd P.tYI·holog;"a/ Mt'Dsurt'nlt'nf.
Ilalnblelon. R. K., & Rovinelli. R. J. (1973). A Fortran IV program for generaling examinee response data rrom logistic lest models, Behavioral Sci"nct'. 17, 73·74. Ilaml1lelon. R. K.• & Rovinelli. R. J. (1986). Assessing the dimensionality of a set oflesl items. Applied P.~ych(/lollical Mta.tureml!tlt. 10, 287·J02. f1amblelOn. R. K .• & Swaminathan. H. (1985). I",m rt'spon.lf. theor.... : Prin";pl"s and ''{If/liclilions. 80Slon: Kluwer. Ifftmhlel
164
/
FUNDAMENTALS OF ITEM RESI'ONSE TIIEORY
lI"mblelOn. R. K .. &. van tier Linden. W. J. (19112). Advances in item teSflon~e theory lind applications: An introduction. A,'plie,/ Psychlliogimi Mm.mr"mnlt, 0, 37.\-3711. lIambleton. R. K.• Zaal. J. N .• &. Pielers. J. M. P. (1991). Computeril.ed adaptive testing: Theory. applications. and standards. In R. K. Hambleton &. J. N. Znal (Ed.~.). Advanas in edt/catlonal and psy.. hological testing: 1'I,('(}ry and applicatiml.t (PI'. 341-366). Boslon: Kluwer. lIarris. D. (1989). Comparison of 1-. 2-, and 3-parameter IRT models. Educational ' , Measurement: Issu~s and Practice. 8, 35-41. lIallie. J. A. (1985). Methodological review: Assessing unidimensionality of tests and items. Applied Psychological Mea.mrl'ment. 9. 139-164, Holland. P. W.• &. Thayer. D. T. (1988). Differential item performance and the MantelHaenszel procedure, In H. Wainer &. H. I. Braun (Eds.). T(',vt validity (pp, 129-145), Hillsdale. NJ: Lawrence Erlbaum. Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. PsycfromnriKa, 30, 119-185. Kendall, M. G .• &. Stuart. A. (1961). The advanct'd theory ofstati.vtic.v (Vol. I). New York: Hafner. Kingsbury, G. G .. &. Zara, A. R. (1989). Procedures for selecting items for computerized adaptive tests. Applil'd Mea.mrt'mt'nt in Education, 2(4). 3~9-375. Kingston. N. M., &. Dorans. N. J. (1984). Item location effects and their implications for IRT equating and adaptive testing. Applied P.fychologiml Mt'asuremt'nt, 8. 141154. Kingston. N. M.• &. Dorans. N. J. (19115). The analysis of item-Ability regressions: An exploratory IRT model fit tool. Apl,/i,.d Psyclrolo(liral Mell.wft'ItIt'"t, 9, 281-2118. Kingston. N. M .• &. Stocking. M. (1986. August). P.~y('homl'tri(' i.l.fllt:s;n IRT·ooud tnt con.flr/wtian. Paper presented at the meeting of American Psychological Association. Washington. DC. Klein, L W.• &. Jarjoura. D. (1985). The importance of content representation for common-item equating with non-random groups. Journal of Educational Mt'asurement, 22(3), 197-206. Kolen. M, J. (19811). Traditional equaling methodology. Edrlt'otiorlal Mt'a.vuremt'FIt: 1s.IIIt'S and Pro('/ice, 7(4),29-36, Linn. R. L (1990). Has item response theory increased the validity of achievement test scores? Applit'd Mea.ml'emt'nt in Edllwtion, 3(2). 115-141. Linn. R. L.. &. Hambleton. R. K. (1990). Cu.flilmizu/ tNt.~ and (,I/.fIomiuti rwrnl.f (CRESST Technical Report). Los Angeles: UCLA. School of Education. Linn, R. L.. &. Harnisch. D. L. (1981). Interactions between item content and group membership on achievelJlent test items. Journal of Educational Mea.fUremt'n1. 18. 109-118. Linn. R. L.. Levine, M. V.• Hastings. C. N .• &. Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applit'd P.tycholo(lica/ Mt'a.furt'mt'lII. 5. 159-1 n. Lord. F. M. (1952). A theory of test SCOrt!.f (Psychometric Monograph No.1). IOWA City. IA: Psychometric Society. Lord. F. M. (1974). Estimation of latent ability and item parameten when there are omilled responses. P.~ychometrika. 39, 247-264. Lord. F. M. (1977). Practical applications of item characteristic I:urve theory. JOllrnal of EdllcationaIMt'asllremenl,14,117-138.
165 Lord. F M. (l9RO). Applications of item re.fpmHI' theory /tI1'/'(It·t;'ol tf'Stillll problm,.f. IIitl~dalr. NJ: Lawrence Erlbaum. Lord, F M. (19114). Standard errors of measurement at differt"nl ability level~. Journal"f Fducolim,ol Ml'o.fUrl'ml'lIt, 21, 2]9·243. Lord, F M., & Nov ick, M. R. ( 1968). St(lti.uit'Olll,etlnes 11/ mellwlll!'.,1 H me.'. Reading, MA: Addison·Wesley. Ludlnw, L. II. (198.5). A slrlltegy for the graphical representation of Rasch model residuals. Edu('OliOllol and P.rycholollical Mro.IIII'l'nlent, 45, 85 I -8W. Ludlow, L. H. (1986). Graphical analysis of item response theory residuals. Applied P.fycholollkal Ml'a.furl'mt!nt. /0,217-229. Maslers. G. N. (19R2). A Rasch model for parlial credit scoring. P.t}chonletrika, 47.
149-174. Masters, O. N .• & Wrighl, B. D. (1984). The essential process in II family of measurement models. P.fJ'l'homl'lri/ca, 49. S29-.544. McDonald, R. P. (1967). NOli-linear factor anaIY.fi.f (Psychometric Monograph No. IS). Iowa City: IA: P~ychometric Society. McDonald, R. P. (1981). The dimensionality of leslS lind ilems. British Journal of Malht!matical and Sfatistieal P.fycholoIlY. 34. I ()()·I 17. McDonald. R. P. (1989). Fuilire directions for item response theory.llllernational }(mrnal of E""('(1tiollol Re.ft!arch. IJ(2), 20.'1·220. McLaughlin, M. B .. & Drasgow. F. (1987). Lord's chi-square test of item bia~ with estimated and with known peraon parameters, Af)I'lied Psychological Mt!asurt!ment. 11,161·173, Medin Interactive Technologies, (1986). Micro.fl·ale. Black Rock, CT: Author. MeJlenbergh, G. (19119). Item billS lind item res(lOn~e theory. International Journal of Educatioflal Rt!.ftOI'Ch. 0(2). 127- 143. Millman, J.. <1 Arter, J. A. (1984). Issues in item banking. Journal of Educational Mea.fu/'emt'flt, 21. 315·330. Mi~levy, R. J. (1986). Bayes modal eslimalion in item response models. P.f.'vchometriJ:a. 51.177·19S. Mislevy, R. J., &, Bock, R. D. (1984). BILOG: Ma,rimltn' likelihood item anaIY.fi.f and tnt ,f('oring with logi.ftie modtl." Mooresville. IN: Scientific Software. Owen, R. J. (1975). A Bayesian !lequenlial procedure for quantal res(lOnse in the context of adaptive menial testinl. Journal of tht American Statistical Association. 70,
351·356. Phillips, S. E., & Mehrell~. W. A, (1987). Curricular differences and ullidimen5ionaJity of achievernenl test d8ta: An exploratory analysis. Journal of Educational Measur<'·
menl.24.1·16. Rajl!, N. S. (1988). The IIrea between two item characteristic curves. P.tychometri!.:a. 53,
495-502. Raju. N. S. (1990). Determining Ihe significance of estimaled signed and unsigned areas between two item response functions. Af'pli~d P.t),c/wlollical Mea.Hlreml'nt. 14(2),
197·207. Rasch, O. (1960). Probabilistic modt!l.f far some intel/illl'nl'{, ami attoi"melll t{'.Us. Cupenhagen: Danish Institute for Educationol Resear!:h. Reckase, M. D. (1979). Unifactor latent trait models Applied to multi-faclor tl"st5: Results and implications. Jm/mal of Edllm/imllil Slatistin, 4. 207·2.m.
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
166
R. (1990). Compult'r a(lapl;V~ lotinfr D(',,~/(}pm..nl and opplictllion (Lahora· tory of P~ycbOlnetrjc and Evaluative Research Report No. 2(4). Amherst: IIniversity of Musachusells. School of Education. Rogers, II. J., &. Hambleton. R. K. (1989). Evaluation of COlllpul('!r simulated haseline statislics for uSe in item bias studies. Educalional and Psycholol(it'al M~o.Tur('mmt. Re~hetar.
49. 355·369. Rogers. H. J., &. Hallie. J. A. (1987). A Monte Carlo investigation of several person and item fit statistics for item response models. IIppli~d Psycholol(ical,M.~asur('m('nt. 11.47-57. Rudner, L. M.• Getson, P. R .• &. Knight, O. L. (1980). Biased item detection techniques. JourNlI of Educot;ONlI Statistics. 5.213-233. Safrit. M. J., COllta, M. G .• &. Cohen, A. S. (1989). Item respollllC theory and the mea&urement of motor behavior. R~s~an:h Qllarl~rly for E:urds~ and Sporl. MJ.
325-335. Samejima, F, (1969). Estimation of latent ability u~;ng a ","sf'Onu pallern of I(radt'd ~eor~~ (Psychometric Monograph No. 17). Iowa City.IA: Psychometric Society. Samejima, F. (1972). II gt!nerol model for fru rt!.fpon.ft! dOlO (Psychometric Monograph No. 18). Iowa City. IA: Psychometric Society. Samejima. F. (1973). Homogeneous case or the continuous response model. PsyrflOm~triA:a. 38. 203·219. Samejima. P. (1974). Normal ogive model on the continuous response level in the multidimensional latent space. Psychom~tr;A:a. 39. I I 1-121. Samejima. F. (1977). A use of the information function in tailored te5ting. IIpp/it'd Psyeholol(ical Uemurement. I. 233-247. Shepard, L. A .• Camilli, 0., &: Averill. M. (198 I). Comparison of procedures for detecting test·item bi ... with both internal and eJlternal abili ty criteria. Journal ()f Educalional Slal;.II;e.T. 6. 317·375. Shepard. L. A., Camilli. G .• &: Williams, D. M. (1984). Accounting ror statistical ani facts in item bias research. Journal of Educalional Slali.YI;es. 9. 93-128. Shepard. L. A .• Camilli, G., &. Williams. D. M. (1985). Validity or approximation technique~ for detecting item bias. Journal of EduHllional Ut'asur~mt'nt. 11(2). 77-105. Spray, J. (1990). One-parameter ilem respon!le theory models for psychomotor tests involving repealed. independent attempts. Reuarc:l, Quartuly for E,url'iu and Sporl.6/(2), 162·168. Stocking, M. L. (1990). Specifying optimum ell.8minees for item parameter estimation in item response theory. Psych,mrttriA:a. 55(). 461·475. Stocking. M. L .• &. Lord. F. M. (1983). Developing a common metric in item response theory. IIpplil!d Psyt'holol(ical Ul!asllrt!mt'flt. 7.201 -210. Subkoviak, M. 1., Mack. J. S .• Ironson. G. H .• &. Craig. R. D. (1984). Bmpirical comparison of selected item bias detection procedures with bias manipulation. Journal of EducalionoI Ueo.fu","me,,/. 1 I( 1).49-.58. Swaminathan. II. (1983). Parameter estimation in Item ·respunse model •. In R. K. flambleton (Bd.). IIpplicalions of item r~~f'OllSe Iheory (PI" 24·44). Vancouver. DC: Educational Research Institute of Brilish Columbia. Swaminathan. II.. &: Gifford, J. A. (1982). Baye5ian estimation in the Rasch model. Journal of Educalional Slatislics. 7. 175-191.
167 i Swaminathan. 11.. & Gifford. 1. A. (198]). Estimation of parameters in the three· parameter latent trait llIodel. In I). Weiss (Ed.). N~w hmizmu in ,,,stinK (pp. 13· 30). New York: Academic Press. Swaminalhan, B.o & Gifford. J. A. (I98S). RayesiUIl eSlillllllion in the two'parameler logistic model. PJychomt'lrika. 50. 349·364. SW8min8than. If .. & GiUord, J. A. (1986). Bayesian estimatiun in the three-parameter logistic model. Psyr/wmt'lrika, 51. 589-60 I. Swaminathan. II., & Rogers. B. J. (1990). Detecting differenlial ilem funclioning using logistic regression procedure~. Jmmw/ 01 Edu('otiona/ Mea.uII-emf'IfI, 27(4), 361· 370. Talsuoka. K. K. (l9l!7). Validation of cognitive sensitivity for item response curves. Journal 01 Edllrat;onal M('alfurt'mmt. 24. 233·245. Thissen. D. M. (1986). MULTlLOG: IUm onaly.,i.l al/(I .
\ I 1
168
FUNDAMENTALS OF ITEM RESPONSE THEORY
Wingersky, M. S., Barton, M. A., & lord. F. M. (19112). I.OGIST uur's Kuit/e I'rincelOn, NJ: Educational Tesling Service. Woodcock. R. W. (1978). Developm"nl and stl1ndardiZfllion of Ihe WoodcocK-Johnson P.rycho-Educational Battt!fy. Hingham, MA: Teaching Resources Corporation. Wright, B. D. (1968). Sample-free test calibration and penon measurement. PmcudinR.r of Iht! 1967 Invitational Conft!ft!nu on Tt!5tinK Prohlems. Princeton, NJ: Educational Testing Service. Wright. 9. D. (1977). Solving measurement problems with the R~s~h model. Journal of Educational Measurt!menl, 14,97-116. Wright. 9. D., Mead. R. J .• &. Bell. S. R. (1979). BICAL: Calibraling ill'm.r wilh Ihe Rasch modd (Statistical Laboratory Research Memorandum No. 239). Chicago: University of Chicago, School of Education. Wright. B. D .• Schulz, M., &. Linacre, J. M. (1989). BIGSC ALE: Rasch analysis compultr program. Chicago: MESA Press. Wright, B. D., &. Stone, M. H. (1979). Besl 11'51 dt!.tign. Chicago: MESA Press. Yen, W. M. (1980). The extent. causes, and importance of context effects on item parameter~ for two latent trllit modeh. Joltrntll of Educalional Mea.tllreml'nt. 17, 297-311. Yen, W. M. (1981). Using simulation results In choose a latent trait model. AI'I!lied Psychological Mt!'osureml'nl, 5, 245·2ti2. Yen. M. W. (1983). Use of the three-parameter logistic model in the development of a standardized achievement test. In R. K. Hamhleton (Ed.), Ap"lif'otiot/,f of ilem rnponse theory (pp. 123-141). Vancouver, BC: Educational Research Institute of British Columbia. Yen, W. M" Burket, G. R., & Sykes. R. C. (in press). Nun-unique stllutions to the likelihood equation fur the three-purillneter lugistic lIlodel. l'.nc!wmr/riAtI.
Index
ability scores, 10, 77·88 ability scores, validity of, 78 achievement tests, 102 adaptive testing. See computerized adaptive testing Aitkin, M., H, 46 American Educational Research AsS()ciation, I American l'sydlOlogical Association, I anchor test design, 128-129. 136-141 ANCILLES, 47,49 Andersen, E. B. 46.58 Antlridl. 1> .• 26 Anl\off, W. II., 123, 125 Ansley, T_ N.• 58 area statistic, 113 115 Arter. J. A .. 147 ASCAL. 47.4') AssessfIlt'nt Syslrm~ Corpnrlltion. 4749 IIssumpti()n.~. 55· 59 assumptions. (;hc('king of. 55-59.61-6.1 authentic measurement, 15.1 Averill. M, 120 Haker. F. H.. 15.45,57 nfll'lnn, M. A., 4(; Heyesian estinllllion procruures, J8-:\<), 4.1-44 Hayesiall item selection. 148 Hejar, I. I.. 56 Dell. S. R.. 42 AlC'AL, 42. 46.48 OIGSCALE. 42, 48 RlLOG, 41. 47, 50
hinomial trials model. 27 Birnbaum, A., 14,91-92,94 Rock, R. n., 26.37,43,47-47,49,56, 146,I5J Boekkooi-Timminga, E., 10 I, 103 Bunurrson, C. V., 145 Burket, O. R .• 37.95 Camilli, G .• 57. 120 Carlson, J. E.• 47.49 characteristic curve method. 134·135 classical item tli friculty. 2, 5 dassicHI item discrimination. 5 clussical measurement, 2-5 dassicaltrue scort' model. 4, 7 co!(nilive <':ollll'ollenis. 154 ('"hrll. A. S .. ,'Oflllllnll !,t'''OIlS (ksi!!lI. 1211 (:(llllllllll.'ri/rd Hllal'live tt'~tinl/., 14S-1 S2 cOllcurrell1 (lIlihrlllinll. IJS· 136 condilional nla~irnum likelihood e~lim"tion.46 Cook, L. L.. 5. 58 Cmta, M. G., 27 Cfll ill , R. D., 12() ,'I ilt'rinn . n,I'('re1Il'l."d le~lS. 78, 102 (,lIshlmi/.,'d 1"'ling. 117 cUI-off score, 1<5
n
DATAGEN,54 de Gruijler, D. N, M .. 86, 94,102 differemial ilem funclioning. 109-120 Divgi, I). R., 5.1 domain ~(nre, R.~
Jot)
170 Ullr;IIIS.
FUNDAMENTALS OF ITEM RESPONSE TIII'ORY
N. 1. .'111·.'\9
Drasgow.
r .. ~6. ~II, 62.
113. I ~4
EAP estimates. Su expC'cted
1\ posteriori estimales eigenvalue plot, 56 Eignor, D. R., .'I, 58 Embretson, S. E., 154 equating, classical. 123·126 equating, definition, 123 equaling. IRT, 126·142 equipercentlle equaling, 123 124 equivalent group design, 1211 expected a posteriori estimates, 39,44
Forsyth, R. A .• 58 Fraser, C, 41,49 Gelson, P. R.• 109 Gibhons, R., ~6 Gifford, J. A., 42-44. 46 (JIM, C. 41·41'1 Goldstein. H.. 154 goodness of fil. 24, 53·14 goodness·of·fit statistics, 61 graded response model. 26·27 Green. 8. E, 146. 141'1 Green, D. R., 95 group dependent ilem statistics. 3 growlh. measurement of, 102 Gulliksen, H.. 57 Guslafsson, J. E., 41-48.58 Haebara, T., 134 Hambleton. R. K., 4-5, 36-37.43.54·59, 10,18,82.86-87.102-103,112. 114,133.136,148 IInmisch, D. L., 120 Harris. D.• 15 lIastings. eN., 112 lIallie, J. A .• 53. 56, 58 heuristic estimation, 46 lIolhmd. P. W.• 115. 119 110m, J. L .. 56 lIumphreys. L. G .• 56, 146 inapproprialeness measuremenl, 154·155 indeterminacy problem, 41·42
tllrorlllnli"n hllll'l iUII . .'il'l' Il'S! 01 ;relit information function informotion nlalrix, 44 Inouye, D. K., 145 il1yuriance, R<sessmcnt of. <;7, ~'i'). 63-66 invariance. property of, 18·25 invariant model parameters, '8 • /ronson, G. II .. 120 item billS. 109 item characteristic curve. 7 item charncteristic function, 7 item difficulty parameter, 13 item discrimination parameler, 15 item inforrnalinn runction, 91-94,99106,147 item misfil slatistics, 61, 72·14 item selection, 99·106 Jarjollra, D., 128 joint Bayesian estimation, 46 joint muimlllll likelihood estimation, 41· 44. 46 Jones, R. w., 10.\ Kendall, M. G .• 39 Kingsbury, G. G .. 148, 152 Kingston, N. M., 58-59. 102 Klein, L. W., 12K Knight, D. L., 109 Kolen, M. J., 123 Lam. R., 61 IRtenl ~pace, 10 Levine, M. V.• SR. ItO, 154·155 Lieberman, M., 43 likelihood function, 34 Linacre. J. M .• 42 linear eqUAting, 124-12.5 linear programming, 103 linking designs, 128·129 Linn, R. L., 1R, 87. 110, 120, 132. 146 Lissak. R. I.. 56, 62 Incal independence, 9-12 LOGIST. 42. 46-47, 49. 135 IOl!i~tic models; one·parameter. 12·14, RI-Kl two·parnmeler. 14-17 three-parameter. 17 ·18
171
/1/(/1'.1
l
Owen. It I., 1411 Ilarallel Irsb, 4 Parsons, C. K.. 511 Phillip", S E., 511 !'ierers, J M.P., 141( PML,47-48 Poisson counts model. 28 polytomous response models, 26-28, 1531.~4
predictions, W-61 pseudo-chllnee level parameter. 17
Q, statistic, 54·61 Raju, N. S., 113·114, 121 RASCAL,48 Rasch. G., 14,46 Rasch model. 14 Reckase, M. D .• 56, 146 regression method. 130 regression models, 19. 32 relative efficiency, 9.~-96 reliahility. da~~kal. 4.94 Reshetar. R., 149 residuals, '\9·61, 66·61 RIl)A.47-4I1, 135 robusl melln and sigma melhod, 132·13:1 ROg"fS. B. 1..5.1.55.103,114-115.119 Rovinelli. R. J .• 54,56,70 Roznowski. M. A., 56 Rudner, L. M .• 109-110. 113 Sofril, M. 1., 27 Samejima. F.. 26, 95,153·154 scaling. 125·126 seal ing constants, 129-135 Schulz, M .• 42 score transformations, 78-87 Shepard. L. A., 57,120 single-group design. 128 speededness, asse~sment of. 57 Spray, 1., 26·27 standard error. Su Mandard error of esti· mation MandMd error of e,timation, 44·45. 9495. 106 standard error of measurement, 4, 94
172
FUNDAMENTALS OF ITEM RESPONSE TIIEORY
slllndllnlized residuals. 5<)-61, 611-72 Sleinherg, L.. 153 Stocking, M. L.. 63. 102, 133-135 Stone, M. If .. 5, 14.57-58 Sluart, A .• 19 Subkoviak. M. J .• 120 Swnminnlhftn. H., 5, 36·:'1.42·46. Y'i, 57-59,711.112. 112, 119, 13.1. U6 Sykes, R. C. 37 Taft. H. L.. 58 lailored tesling. Su compulerized adaplive lesling largel informalion funclion. 10 1-102 Tatsuoka. K. K .. 511 lesl Chllr8clerialic curve. 115 lesl dependelll abililY scores. <; lesl developmenl. 92·96.99-106 lesl fairness, 109 lesl informalion function. 38,44, 94-95. 100-106
IInhlilllell,iunalily.9 10. Urry. V. w.. 46·49
"'7
Vale. C. D .. 1211. 142 vall lkr Lindell. W. 1.. 4.101.10.1 Wainer. II. 511.145·146 .• Wnrtlrnl). J. t .. 112 Weiss, n. J. 5. 145-146. 1411. 152 Williams. D. M., 57, 120 Wingersky, M. S., 42, 46, 411 WITS scale. 110 Wolfe. R. G., 511. 61 Wood. R.• 154 Woodcock. R. W.• 110 Woodcod· Johnson P5ydlO" Educnl ional Rallery. 110 Wright. R. D., 5. 14,26.42.46-411,57· 59, 7R, 153 Yen. W. M., 37, 54. 5R, 61, 95, IOJ
Te .• , Sltlndard.f, I
Thayer. D. T .• 115. 119 Thissen, D. M., 47. 49. 58, 153 Traub, R. E .• 58. 61 Irue proponion-correct score. 115 Irue score. 2. 84-87 Tucker, L. R .• 56
~6
ZlIal, J. N .. 1411 Zara, A. R .• 1411
About the Authors
Ronald K. Hambleton is Professor of Education and Psychology and Chairman of the Laboratory of Psychometric and Evaluative Rellearch at the Univerllity of Massachusetts at Amherst. He received his Ph.D. in psychometric methods from the University of Toronto in 1969. His principal research interests are in the areas of criterion-referenced measurement and item response theory. His most recent books are Item Response Theory: Prindples and Applications (co-authored with H. Swaminathllll), A Practical Guide to Criterion-Referenced Testing, and, forthcoming. Advances in Educational and Psychological Te.ftinK (co-edited with Jac Zaa/). He has served as an Allsociate Editor to the Journal of Educational Statistics (1981-1989) and currently Ilerves on Ihe editorial boards of Applied MNUlireml'nt ill Education, Multivariate Behovioral Rl'.fl'tlrch, ApI,lied P,fyclwlo~iclli Mellsurement, JOllrnal of Ecillcat;oflal Mrosurement, Edllf(JtiOlw/ and PsycholoKical Measurement, Evaluation and the lIealtil I'mfe,u;OIIS, and Psit'ot"emo. He hall served also as President of the International Test Commission (19901994) and as President of the National Council on Measurement in Education (1989-1990). H. Swami nathan is Professor of Education and Psychology at the University of Massachusetts at Amherst. He received his Ph.D. in psychometric methods and statistics from the University of Toronto in 1971. He has held the positions of Associate Dean of Academic Affairs and Acting Dean of the School of Education. He has served as an Associate Editor to the Journal of Edw'otiollal Statistics and currently is an Associate Editor of Psicothema and Rev;.sta Edllcao Potuguese. He has served also 8s the President of Educational Statisticians, a special interest group of the American Educational Research Association; co-program chair of Division D of AERA; and as a member of the 173
174
FUNDAMENTALS OF ITEM RESPONSE TlIEORY
Gmduale Records Exarnilllliiolls BtHtrtl. lIis principal research interests are in the areas of item response theory. mullivariale statislics, and Bayesian analysis.
1-1. Jane Rogers is Assistant Professor :II Teachers College. Columbia University. She received her Ph.D. in psychometric'methods frolll the University of Massachusetts in 1989. lIer research interests include item response theory, large-scale assessmenl, methods for the detection of differential item funcl ioning, Bayesian methods. and multivariate statistics.